main_hero_img
posted: November 07, 2021 edited: January 21, 2022

Accessing UniProt via its REST API

uniprotproteinsrestapipython

What is UniProt?

The Universal Protein Resource more commonly referred to as just UniProt is an online resource that contains protein sequences and associated data from more than a dozen different organisms including humans. This curated database aims to be a universal resource that ties together disparate data from a wide variety of sources into a single, central location. This data includes structural information, isoform data, subcellular localization, reported variants and associated pathologies, and much more. Most scientists access this data by going directly to the website (https://www.uniprot.org) and using the search function to pull up information about a particular protein. While this approach works well when you are interested in only a few proteins, it quickly becomes tedious and unfeasible if you want to perform queries with larger protein sets.

For such queries, a different approach is needed. This is where UNIPROTs REST (Representational state transfer) protocol comes in. Essentially this protocol describes how one can programmatically send queries to uniport’s servers to access their database. Before we get into the details about how to accomplish this, it’s necessary to go through basic terminology and discuss how computers send information via the internet.

Client-server communication

You (a client) represent a computer that wants to access information stored elsewhere. To get this data, you must send out a request that specifies the location of the desired data and what data you would like to download. For the internet, locations are specified by Internet Protocol (IP addresses) which are analogous to mailing addresses. Each IP address specifies a unique network location/destination and is required to route messages on the network. Any communication on the network between any two or more computers requires such addresses. In your case, you the client would like to communicate/request information from the computers at uniprot (their servers). This requires you to know their IP address, but at this stage all you know is the web address (a string of characters) is uniprot.org. Luckily, the internet has another service that can take any name (web address) and find the IP address of the computers registered under that name. This is called address resolution and is the reason why you can just enter a web address such as cnn.com and magically see their website. Remember that the network has no conception what cnn.com is, it only understands IP addresses and how to route using these numbers. How is this solved? Well, behind the scenes, the computer takes what you put as the web address, sends it to the address service that then sends back the IP address for cnn.com. Once the IP address is in hand, then your computer can establish contact with the computers at CNN and download the data for webpage you want to look at.

networking_diagram

IP addresses are only a part of the story in the same way an addressed envelope without a letter inside is an incomplete correspondence. Once a connection with another computer is established, you next need to tell that computer what you want it do for you. These communications are generally referred to as requests because generally you as the client are requesting some data from the server computer. There are two main types of requests that are used GET requests and POST requests.

Both are used to request data, but they differ in the format they use to communicate with the server. Let’s turn to another simple example to examine the difference. Imagine you are looking at an online phonebook with many hundreds of pages. Your goal is to get the data on page 100 of this book. How do you ask the phonebook server to provide you this specific page? Well in a GET request you would provide the desired page number as data within a URL. The URL would look something like this: http://phonebookserver.com/book.php?page_number=100. Let’s break down the URL to see what’s going on here.

URLs

A URL is really a list of text instructions that are written consecutively in a single line. The first part (written before the : ) is the protocol http. This is just a name which represents a universally accepted, defined language that your computer will use to communicate with the phonebook server. This would be equivalent to you writing a letter and on the outside writing in large letters ENGLISH to specify to the recipient you wrote your letter in English and you would like your response letter to be in English as well. Next is phonebookserver.com which is the web name of the server. This will be converted to an ip address in the background as discusses and tells the network what computer to send your request to.

url_schema

So at this point going left to right we have established the language of communication and the end points from which data will be sent and received. However simply sending data to a computer without telling it what to do is not enough. A server computer serves many functions and most of them are likely unrelated to the specific question you are asking it. More information is needed in your request. You need to tell the computer what to do with the data you are sending it. In our case we want to send the data to book.php, a software program that lives on the server and was specifically designed to get phonebook pages. Again, lets analogize this situation to make it clearer. In many respects a server is like a large office building with many different departments in it that specialize in different tasks. If you send a letter to the office, but don’t specify a department it won’t end up with the right people and your request won’t be processable. In our case, if we don’t specify book.php, the computer won’t understand what program or file we want to send the data too and as it won’t be able to send back a meaningful response to your request.

Alright we have now covered the first part of the url:  http://phonebookserver.com/book.php which has allowed us to contact the server and tell it we want to send data to the book.php program. Now we want the book.php program to give us data for page 100. How do we get it to do this? Here is where the data section of the url “ page_number=100” comes in. This line is setting the variable input parameter “page_number” to 100 so that it can be sent to book.php. The program presumably takes this input variable “page_number” and internally use it to gather the data associated with page 100 and send it back to you.  

In our simple example, the only parameter we set via our GET request was page_number. In many cases you might want to set additional parameters. This is done by simply appending the additional parameters and the values you want to set them to the end of the url. Let say that the program book.php has an additional option to specify what text color you want the returned page to have and this option is named text_color. To get page 100 with the color red we would rewrite the url to http://phonebookserver.com/book.php?page_number=100&text_color=red.

POST requests

Now lets move on to explaining POST requests, which are conceptually similar but slightly different in formatting to GET requests. As we saw with the above example, in a GET request all the critical information for routing and data is included in the text of the URL. This works well if you have a small amount of data you wish to send. However, imagine you want to send an image or a list of a thousand uniprot ids to a server. If you formatted this according to the GET request format outlined above, you would end up generating an absurdly long URL with several thousand characters. This goes against what the GET request type was designed for which is why all URLs are limited to a maximum of 2048 characters. Going back to our letter analogy, a GET request would be equivalent to writing everything on the envelope. This works well if you want to say one or two things, but it would be ridiculous to write an entire letter on it.

This is where POST requests come in. They have no strict length restrictions and allow you to send much larger amounts of data by placing it in the body of a message rather than in the header where the URL goes. Additionally, unlike GET requests which only support ASCII characters, POST requests support binary data which allows a larger variety of content such as images to be sent.

Using python to request UniProt data

So now that we’ve covered the basic two types of request you might be wondering how to construct them. While specifics differ between languages such as python, java, c++, etc, they all have packages or libraries that you can include in your code that enable requests to be generated, configured, and sent off to a server.  This tutorial will utilize python because of its prevalence in bioinformatics. There are several modules that could be used within python, but we will be sticking with the requests library because of its ease of use. You can find information about this library and how to use it here: https://docs.python-requests.org/en/latest/.

First make sure that your local python environment has the requests module installed. Without it installed, you won’t be able to use it in your code and the editor you are using will most likely throw some sort of error along the lines of “module requests not found”. If you see this type of message make sure to download it via pip or another python package management software.

First python GET request

Once you have the requests library installed, we can move on to a starter example with uniprot. Go to your browser (Chrome/etc) and go to the main uniprot page and search MCM (mini chromosome maintence proteins). This will bring a table of results of proteins that have been the term “MCM” somewhere in their annotations. In the entry column you will see that each row has a short string like “P33993” that designates the uniprot id for that entry. Clicking it will bring you the main page for that entry. In this case, clicking P33993” leads you to the page at URL https://www.uniprot.org/uniprot/P33993 which has all information about the protein MCM 7 encoded in humans. Now let’s say you want to do the same thing (get information about MCM7), but programmatically in python. We can do this in 3 lines of code. 

import requests
response = requests.get('https://www.uniprot.org/uniprot/P33993');
print(response.text)

Lets walk through the example top to bottom. The first line imports the module requests so that it is available to the rest of the program. Importing a module essentially copy and pastes code in that module to the top of the program so that you can use the code that has already been written there. In our case this lets use the requests.get function with the URL 'https://www.uniprot.org/uniprot/P33993’ provided as a single string argument (take note of the quotes that tell python this is a string). As implied by its name, this function performs a GET request to the desired URL and then returns the response sent back by the server. In the above example, the response is placed into the response variable to which we have assigned the output of the requests.get function. This response variable is an object that contains many sub variables inside of that we can access to get useful information out. To access the actual response text that was sent back by the server, we access the text subvariable from the response object by writing response.text. We then print this response.text to the console so we can view what the server sent back to us after we sent it our GET request. Examining the console you’ll see a really long blob of text that begins with something like this:

<!DOCTYPE html SYSTEM "about:legacy-compat">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head><title>MCM7 - DNA replication licensing factor MCM7 - Homo sapiens (Human) - MCM7 gene &amp; protein</title><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/><meta content="width=device-width, initial-scale=1" name="viewport"/>

What is this strange output that was spit out by python? This is actually the HTML of the webpage that you looked at before in your browser when you clicked the link: 'https://www.uniprot.org/uniprot/P33993’. However here, rather than seeing the webpage that the browser constructs for you, you’re seeing the raw HTML code that the server sends back which the browser invisibly and instantaneously converted into a user friendly visual website. In other words, you have just performed the programmatic equivalent of clicking the link and loading the page. You’ve just completed the first of many python GET request!

Let’s take all of this to the next level. Presumably when you clicked on the MCM7 link you did so because you want to get some information about the protein such as its amino acid sequence. In the browser you could just scroll to the sequence section and copy the sequence of a particular isoform. With what we just did, it’s not as simple. The sequence information is contained with the response text we printed, it’s just buried within the text and not easily accessible. One possibility would be to use python string manipulation functions and search for the HTML section that contains the sequence so we could then extract it. While this can be done, its tedious and wasteful. If we really do only want the sequence of MCM7, it doesn’t make sense to load in all that extra information about the protein. It’s more work for the server, it uses more memory, adds unnecessary network traffic, and is harder for us to process in the end because we have to filter through complicated response text.

UniProt REST API

This is where UNIPROT’s REST application interface (API) comes in. The intelligent folks at uniprot recognize that ambitious individuals such as yourself might want to access their data in a targeted fashion. So they have made an interface where you can programmatically perform requests to only get the data you need from their databases. To get started with this API, lets look at their documentation contained at https://www.uniprot.org/help/api_queries. This page describes how we should format our requests/queries to get back the data that we want.

Let’s continue and see how we would request just the sequence of MCM7s primary isoform. First observe that the endpoint url of the program is https://www.uniprot.org/uniprot/ meaning that any requests we send will have to be sent to the uniprot file/program that lives at www.uniprot.org. This special program is listening to network requests such as the one that we are about to send and will return an output based on parameters that we provide it. We want information about human MCM7 which has the uniprot id P33993. This is a unique identifier that has been assigned by uniprot and uniquely defines only this protein. MCM7 in another species such as mouse will have a different id as it is a different protein. To get information about this specific protein, we must provide the uniprot program with this id. We do this by adding “query=id:P33993” to the url. This is setting the variable called “query” to “id:P33993”. The uniprot program will receive this input and search its database for any protein entries with this id. In our case this should only return one result (human MCM7).

Alright so this is enough to tell the program what entry we want, but not what we want it to do with that entry. Each entry is associated with many fields (datapoints) such as the name of the gene that encodes it, known mutations, functions, the length of its sequence, the actual sequence, etc. To specify that we only want the actual amino acid sequence sent back to use we include an additional “column” parameter in our GET request and set it to sequence by adding “&columns=sequence”. If we wanted them to send us more information about MCM7 we could include additional columns such as id, comment(FUNCTION), etc as outlined here in the column names document https://www.uniprot.org/help/uniprotkb_column_names.

Lastly, we must specify the text format we want for the text that the server returns to us. There are a number of formats available to us including tab, txt, fasta, html, etc. The format you pick will depend on what you plan to do downstream of getting the response. In this case, “tab” is the easiest to deal with so that’s what we’ll use for the example. All together now our GET request URL looks like this: https://www.uniprot.org/uniprot/?query=id:P33993&columns=sequence&format=tab. The beauty of GET requests is that because everything is in the URL, this text alone fully specifies our request and we can actually copy paste this into the browser address bar to get a preview of what the server will send us. It should look like a white page with the following text:

Sequence
MALKDYALEKEKVKKFLQEFYQDDELGKKQFKYGNQLVRLAHREQVALYVDLDDVAEDDPELVDSICENARRYAKLFADAVQELLPQYKEREVVNKDVLDVYIEHRLMMEQRSRDPGMVRSPQNQYPAELMRRFELYFQGPSSNKPRVIREVRADSVGKLVTVRGIVTRVSEVKPKMVVATYTCDQCGAETYQPIQSPTFMPLIMCPSQECQTNRSGGRLYLQTRGSRFIKFQEMKMQEHSDQVPVGNIPRSITVLVEGENTRIAQPGDHVSVTGIFLPILRTGFRQVVQGLLSETYLEAHRIVKMNKSEDDESGAGELTREELRQIAEEDFYEKLAASIAPEIYGHEDVKKALLLLLVGGVDQSPRGMKIRGNINICLMGDPGVAKSQLLSYIDRLAPRSQYTTGRGSSGVGLTAAVLRDSVSGELTLEGGALVLADQGVCCIDEFDKMAEADRTAIHEVMEQQTISIAKAGILTTLNARCSILAAANPAYGRYNPRRSLEQNIQLPAALLSRFDLLWLIQDRPDRDNDLRLAQHITYVHQHSRQPPSQFEPLDMKLMRRYIAMCREKQPMVPESLADYITAAYVEMRREAWASKDATYTSARTLLAILRLSTALARLRMVDVVEKEDVNEAIRLMEMSKDSLLGDKGQTARTQRPADVIFATVRELVSGGRSVRFSEAEQRCVSRGFTPAQFQAALDEYEELNVWQVNASRTRITFV

Great success! Now let’s get this working in python.

import requests
response = requests.get('https://www.uniprot.org/uniprot/?query=id:P33993&columns=sequence&format=tab')
print(response.text)

Again we first have to import requests to use it. In the next we provide the GET request url we have built to the requests.get function. This function then executes the actual network request and returns the response and places it in the response variable we’ve provided as a container. Now just like before we want to see what the server has sent us so we print out the response by using the print command on the response.text sub variable. The following text should pop up in the python console:

Sequence
MALKDYALEKEKVKKFLQEFYQDDELGKKQFKYGNQLVRLAHREQVALYVDLDDVAEDDPELVDSICENARRYAKLFADAVQELLPQYKEREVVNKDVLDVYIEHRLMMEQRSRDPGMVRSPQNQYPAELMRRFELYFQGPSSNKPRVIREVRADSVGKLVTVRGIVTRVSEVKPKMVVATYTCDQCGAETYQPIQSPTFMPLIMCPSQECQTNRSGGRLYLQTRGSRFIKFQEMKMQEHSDQVPVGNIPRSITVLVEGENTRIAQPGDHVSVTGIFLPILRTGFRQVVQGLLSETYLEAHRIVKMNKSEDDESGAGELTREELRQIAEEDFYEKLAASIAPEIYGHEDVKKALLLLLVGGVDQSPRGMKIRGNINICLMGDPGVAKSQLLSYIDRLAPRSQYTTGRGSSGVGLTAAVLRDSVSGELTLEGGALVLADQGVCCIDEFDKMAEADRTAIHEVMEQQTISIAKAGILTTLNARCSILAAANPAYGRYNPRRSLEQNIQLPAALLSRFDLLWLIQDRPDRDNDLRLAQHITYVHQHSRQPPSQFEPLDMKLMRRYIAMCREKQPMVPESLADYITAAYVEMRREAWASKDATYTSARTLLAILRLSTALARLRMVDVVEKEDVNEAIRLMEMSKDSLLGDKGQTARTQRPADVIFATVRELVSGGRSVRFSEAEQRCVSRGFTPAQFQAALDEYEELNVWQVNASRTRITFV

Greater success! The fun doesn’t stop there. Building a URL manually as we did is a good exercise but there comes a point where this is again tedious and repetitive. It turns out the request library allows us to specify url paramaters in a different way which is more concise and lets us avoid having to remember pesky URL formatting rules such as putting (&) ampersands between our inputs. Take a look at the code below:

import requests
payload = {'query': 'id:P33993', 'columns': 'sequence', 'format':'tab'}
r = requests.get('https://www.uniprot.org/uniprot', params=payload)
print(r.text)

This accomplishes the exact same thing but in a slightly different way. Rather than providing our parameters/inputs into the URL string directly, we separate them out into a new variable named “payload”. Payload is a key value pair array where the 3 keys (query, columns, and format) represent the names of our GET request variables and the 3 values( 'id:P33993', ‘sequence’, ‘tab’) are what we want to set them to.

Next, we again execute the requests.get function except we use it in a slightly different way. We provide the address of the uniprot program as the first input and then we set the second input “params” equal to payload. Even though they appear separate here, internally the requests.get function will combine them and generate a URL the same way before contacting the server.  Running this will produce the same output as before: the amino acid sequence of MCM7.

More advanced queries

Alright now let’s get real fancy. Let’s say we want to know more about all MCM proteins, not just MCM7. Additionally, let’s say we don’t want to have to go through the trouble of first identifying the uniprot id for all MCMs via the website manually  so that we can then perform our query as we did before. How can we do this? Well in the same way we searched the UNIPROT website manually to find MCM7 we can do a search, via the REST API, to get all entries that relate to the term MCM in humans. Lets construct the query to see how this works.

Alright so we want all human MCM proteins. We will communicate this to UNIPROT via the query and filt (short for filter) parameter that we send it in our GET request.

So far our paylod for the GET request looks like this:

payload = {'query': 'name:mcm'}

Ok, but that’s not enough to specify what we want. Many proteins in the UNIPROT database probably have the string mcm in their name in some way and most of them don’t interest us. So now we have to provide more information in our payload  to narrow down the results of mcm. Let’s filter for only human results by adding Homo sapiens as the desired organism.

payload = {'query': 'name:mcm', 'fil': 'organism:Homo+sapiens' }

Uniprot also has a lot of entiees that have been computationally predicted but not reviewed by a person. Let’s add another filter to remove non-reviewed entries.

payload = {'query': 'name:mcm', 'fil': 'organism:Homo+sapiens+AND+reviewed:yes' }

Note that the way you must construct the fil param for UNIPROT to understand it. It is the name of the filter “organism” followed by a colon, followed by the value for the filter “Homo+sapiens” (use + rather than a space) and then separate out multiple filters by using “+AND+”.

Ok so now with this payload should be retrieving a list of reasonable result of human proteins that contain some text in their name related to the term MCM. Now let’s further limit the results we want to get back from UNIROT.

Internally, whenever a search is performed, UNIPROT scores how well an entry relates to a query term and this value is stored in a score field. We can take advantage of this score and use it to sort the results, lets put higher scoring “better” matches first. We do this by telling UNIPROT to sort by score (default is high to low).

payload = {'query': 'name:mcm', 'fil': 'organism:Homo+sapiens+AND+reviewed:yes', 'sort:score'}

Alright so the payload is getting a little thiccc now, but lets add a few more things. Even if we sort by score, we are still likely going to get too many results. Lets limit the number of results we get to a reasonable amount like 20 so that we essentially get the top 20 best related hits to our search term MCM.

payload = {'query': 'name:mcm', 'fil': 'organism:Homo+sapiens+AND+reviewed:yes', 'sort’ : ’score', 'limit’: '20'}

Now its time to focus on telling UNIPROT how we want to format the results it sends back to us. The columns we want returned are (id,  entry name, sequence) and as before we want them to be tab formatted. The updated payload:

payload = {'query': 'name:mcm', 'fil': 'organism:Homo+sapiens+AND+reviewed:yes', 'sort':'score', 'limit':'20', 'columns': 'id,entry+name,sequence', 'format':'tab'}

Lets put this all together with the actual request call.

import requests
payload = {'query': 'name:mcm', 'fil': 'organism:Homo+sapiens+AND+reviewed:yes', 'sort':'score', 'limit':'20', 'columns': 'id,entry+name,sequence', 'format':'tab'}
r = requests.get('https://www.uniprot.org/uniprot', params=payload)
print(r.text)

Running this code should result in the follow output:

uniprot_output

Look at that! We never touched the UNIPROT site and we managed to get sequences relating to human mcm proteins (and other related proteins).

Putting it all together

As a final cherry on top to this whole enterprise lets actually use the sequence data we just retrieved. We might be interested in a number of things, but as an example lets focus on a simple questions. How many methionines do each of these proteins contain? To do this we will take the text of the response which will be a string( tab separated value format) that contain lines (ending with ‘ ’) and columns that are split by the tab character ‘ ’. Once split into these elements we can then just iterate through every line (representing one retrieved entry), fetch its sequence and then use the count function to count how many times the character M, representing the amino acid methionine, occurs in the sequence. We’ll then print the result of this operation out to the console via print. See the code below:

import requests

#prepare the paylod with our desired parameters for our search
payload = {'query': 'name:mcm', 'fil': 'organism:Homo+sapiens+AND+reviewed:yes', 'sort':'score', 'limit':'20', 'columns': 'id,entry+name,sequence', 'format':'tab'}

#perform the actual request to the uniprot server
r = requests.get('https://www.uniprot.org/uniprot', params=payload)

#get the response as text (string)
results = r.text
#use the split function to take the string and split into lines (each line ends with the newline character )
lines = results.split("\n")
#remove the first line because it contains the headers (id, entry name, sequence) 
lines.pop(0)

#loop through each line containing the entries we requested
for line in lines:
    
    # each line is a string again which we can slit by the tab character '	' to get the individual values
    fields = line.split("\t")
    #first column = entry at index 0 of the fields array returned by the split function
    uniprot_id = fields[0]
    # first column = entry at index 1 of the fields array returned by the split function
    entry_id = fields[1]
    # first column = entry at index 2 of the fields array returned by the split function
    sequence = fields[2]

    # use the python string count function to count how many M(methionines) appear in the sequence string
    methionine_count = sequence.count("M")

    # print the entry id to the console and the count(which as an integer(int) we must first convert to a string so we can print it)
    print(entry_id + ":" + str(methionine_count))

which produces this result:

num_methionines_per_protein_console_output

YOU DID IT! Wow. I know that was a lot but good for you for sticking with it. The key is patience and just trying things out. Try playing with the code above, break it, fix it, and just see what happens.

Thanks for reading and I hope you were able to take something away from this tutorial!.