main_hero_img
posted: November 16, 2021 edited: November 17, 2021

Finding the reverse complement in python

pythonsequencingDNAbeginner

Background

It is standard practice that unless otherwise stated, a nucleotide sequence is written left to right beginning with its 5’ (five-prime) phosphate end and then ending with its 3’ (three-prime) hydroxyl end where 5’ and 3’ refer to the position of these groups on the 5-sided sugar ring (ribose or deoxyribose) of the backbone. When working with single stranded oligonucleotides that’s the end of the story, but often double stranded DNA or RNA species are utilized. In those cases, there are two different single stranded sequences hydrogen bonded to each other in the same molecule. Stating only one is enough to define the molecule completely as each base always pairs with exactly one other corresponding base (A to [T or U], C to G). It would be redundant to write both so only one is typically written.

nucleotide orientation diagram

However, often it is useful to interconvert between these two such that when given the sequence of one of the single stranded sides you can get the sequence of the complementary strand that is bound to it. This other complementary sequence is known as the reverse complement. At first glance the problem would seem very straightforward, if we have a sequence AACA that is double stranded, we know that each A is bound by a T, and every C is bound by a G so therefore the sequence of the opposite strand is just TTGT. However, there is an additional complication. Double stranded nucleic acids always consist of two strands that run anti-parallel. This means that if one is oriented 5’ to 3’, the complementary strand will be oriented in the reverse 3’ to 5’ direction and therefore every end of these double stranded molecules will have one side that is 5’ and one that is 3’.

parallel vs antiparallel in nucleotides

Let’s backtrack to our example. We initially said that the opposite strand should TTGT, but remember that this implies that we mean 5’ TTGT 3’. Effectively we said that 5’ AACA 3’ is bound to 5’ TTGT 3’, but is it? No. If both strands ran in the same direction this would be perfectly fine, but since they don’t this is false.

How can we fix this? Let’s go through the procedure again, but slightly differently. We know that if the top strand is written with its 5’ end to the left the bottom one will have its 3’ end on the left. Now working left to right we can say again, A binds to T, G, binds to C and we end up with: 3’ TTGT 5’. Now there’s one final step. Because of convention when we write a single strand, we always write it 5’ to 3’ so if we were to report this sequence we would reverse 3’ TTGT 5’ to make 5’ TGTT 3’ which we could abbreviate to TGTT. That is the correctly written out sequence of the top strand’s reverse complement. This is relatively straightforward when you’re dealing with only 4 nucleotides but quickly becomes tedious with longer sequences. Fortunately, this is something that we can implement pretty easily in python (or any language).

finding the reverse complement

Coding it in python

Let’s now try to write a script that accepts in a nucleotide (DNA only) sequence and outputs the reverse complement. A DNA sequence can very easily be represented by a series of characters(like we've been doing so far). So let's take in the DNA sequence as a string and work with that in our code. Alright time to fire up a text editor or a code editor such as PyCharm or VSCode.  

First we need a way to get the string (DNA sequence) from the user so we can work with it. Let's use use the input function which is is built-into python and prompts a user for text and returns the response when they press enter.

Here's the code that asks for the nucleotide sequence and places it in the variable named input_sequence.

input_sequence = input("Enter your nucleotide sequence: ")

Now that we have a string from the user we need to process it. Let’s begin by going through the procedure we just did, but this time in code. We want a way to work our way backwards (3’ to 5’) on the provided sequence, find the base that should pair with the one we find, and add that to the reverse complement which we will build up left to right (5’ to 3’). Ok so for now we just have input sequence which will contain a string of user input. To go along it character by character we can use a loop construct which will let us go character by character from the end to the beginning. To access characters in python we can use square brackets ‘[]’ and index into the string like so: 'me so happy”[3] which is ‘s’. So really what we need is a number that down one by one from the last character’s index to the first character’s index (0).

string indexing

How will we do this? Using the “for in“ construct provided by python:

for index in range(len(input_sequence) - 1, 0, -1):

This does several things at once, lets break it down and work through an example where we imagine that the user has entered “AACA”. Python’s bulilt-in len() function first finds the length of a string (in our case the input sequence) so it will be 4. And then we subtract 1 from it yielding 3 so the above code will become evaluated and become the below code (behind the scenes in python).

for index in range(3, 0, -1):

Now, we are passing, 3, 0, -1 into python’s built-in range function. The range function takes in a starting number (3), an ending number (0), and a step number (-1). It then returns a list of numbers that is generated based on those inputs. In our case we are asking it to generate a list of numbers that starts at 3, counts down by 1, and ends when it reaches 0. This will be [3, 2, 1, 0]. If you substitute that in for the range function call you get this, which is functionally equivalent to the above code.

for index in [3, 2, 1, 0]:

At this point we’re at the “for in” construct. This construct tells python to loop through a list from start to finish and assign each member in the list to a variable after the keyword for (index). So in our case the first time in the loop, index will be 3, the next time 2, and so on until 0. Great! So now we have a variable that will start at the last character’s index and then work its way down till it reaches 0 (the first index). But we don’t want an index we want the nucleotide base. How do we get it? By indexing into the string using our index variable. Here’s the code:

for index in range(len(input_sequence) - 1, 0, -1):

    base = input_sequence[index]

This will loop 4 times and each time it will put the character from input sequence at index into the variable base. It’s time to use base for something useful. We want to convert it into it’s complement: the base that it binds to. How do we do this? There are many options, but we’ll be using one of python’s built-in data structures called a dictionary. A dictionary lets you store a series of name value pairs so that you can later look up a value by its name. How does this help? Well in our case, for each base we encounter in our input we want to look up the base that it should bind to. Let’s create a dictionary called pairs and put in the rules we already know A binds to T, and C binds to G. Here it is:

pairs = {
    "A":"T",
    "C":"G",
    "G":"C",
    "T":"A",
}

Let’s break it down, there are 4 key value pairs here: “A”: “T” is one such pair where the key “A” is associated with the value “T”. This means that if we ask pairs for that value associated with “A” we’ll get out “T”. Notice here, we have to define 4 rules even though we only stated 2. This is because dictionaries are not 2-way, you can only look up by the keys not the values. In our statement, A binds to T implies that T binds to A. but we need to explicitly define this relationship in our dictionary by adding two key-value pairs: “A”: “T and “T” :”A” so that when we see an “A” we know it binds with “T” and if we see a “T” we know it binds with “A.

Great! Now if we type pairs[“A”] we’ll get “T” which is what we want. Let’s combine this with our loop to convert the input sequence into the complement like so:

pairs = {
    "A":"T",
    "C":"G",
    "G":"C",
    "T":"A",
}

input_sequence = input("Enter your nucleotide sequence: ")
for index in range(len(input_sequence) - 1, 0, -1):

    base = input_sequence[index]
    complement = pairs[base]

There’s just one problem we’re converting each base as we go along, but we’re not storing our output anywhere so it’s lost each time. Let’s change that and add a new variable called “reverse_complement” where we’ll add the corresponding bases to. Now here all together we have this:

The result

pairs = {
    "A":"T",
    "C":"G",
    "G":"C",
    "T":"A",
}

input_sequence = input("Enter your nucleotide sequence: ")
reverse_complement = ""

for index in range(len(input_sequence) - 1, 0, -1):

    base = input_sequence[index]
    complement = pairs[base]
    reverse_complement += complement

print('>: ' + reverse_complement)

First we assign  reverse_complement to a blank string since we have no data for it yet. Then inside the loop we use the += (plus-equals) operator to add the complement base one at a time each time we go through the loop. The += operator a shorthand. It just takes the value that the string already has an adds a new character at the end. We could also have written this as reverse_complement = reverse_complement + complement, but += is shorter and looks nicer. There we have it! Save the script and make sure it has the .py extension so other pograms understand that it is meant to be read as a python file.  

Let’s actually use to do some good ol’ convertin. Run the program in your editor and you should see something like this: 

prompt empty

Then enter in a sequence: 

prompt after entering sequence

There you go you did it! Congrats! Much thrill such excitement. Have fun converting all the sequences now.