main_hero_img
posted: January 20, 2022 edited: January 20, 2022

Find all possible kmers in python

pythondnasequencesrecursion

Setup

Modern DNA synthesis techniques allow for the construction of vast libraries of different DNA sequences that can be used in many high throughput applications. As an example, say you just found that a particular transcription factor binds to a promoter with a defined sequence that is 8 nucleotides long. The natural binding affinity of the factor for the sequence is good, but it’s possible that it could be even better. How could we find the optimal binding sequence for this protein? Ideally, a rational approach would allow us to look at a known structure of the protein bound to DNA to determine what interactions are responsible for binding. However, structures aren’t always available and even then it may be difficult to predict exactly which sequence best interacts with the DNA binding domain of your protein. Hmmmm tricky. Wouldn’t it be nice if we could just try all possible sequences and just see which works best? It’s brute force, but it requires very few previous assumptions and we don’t risk missing the actual best sequence.

So how could we actually do this? One potential option might be to express the purified transcription factor (with a tag added on for purification) and incubate it with a pool of oligonucleotides that contain all possible 8nt long binding sites. Then after letting them bind, we can use beads to pull the protein out of the solution along with any oligonucleotides that it bound. We can the digest away the protein to release the oligonucleotides and perform sequencing to see what oligonucleotides were pulled down. It stands to reason that the sequence that was pulled down most often (will show up most in the sequencing data) has the highest affinity for the transcription factor.  

Woo! Seems like the start of such a great plan! If it works…. Before we get to this part though we have a potential issue. We need to tell the nucleic acid synthesis company what sequences we want them to make for us. Assume that they require you to send them a list of every sequence, in our case that will be all possible variants for your 8nt sequences. You could type it out, but you don’t have that kind of time or patience. Code to the rescue! Let’s explore how we can use python to write a small script that will generate every possible 8 nucleotide combination.

Starting to code

Start off by defining an array that contains the 4 possible nucleotides T,C,G,A which will be the building blocks of our sequence.

nucleotides = ['A', 'T', 'C', 'G']

Next lets define the function which contain the code. Let’s name the function generate_all_sequences. It will take a single input value: an integer that represents the desired length of every sequence we output.

def generate_all_sequences(n):

We can now jump inside the function definition and start getting into the meat of the problem. We want to systemically generate every possible sequence. In the simple case of a 2 nucleotide sequence how would this work? One way would be to nest two for loops, an outer one to iterate through the 4 possibilities for the first base and an inner one to iterate through 4 possibilities for the second base. This would produce all 4^2 = 16 sequences. The code for that would be:

nucleotides = ['A', 'T', 'C', 'G']

sequences = []

for firstbase in nucleotides:

    for secondbase in nucleotides:

        sequences.append(firstbase + secondbase)

Walking through the script top to bottom, we first create the empty list sequences. We then have two loops that use the for in operator to go through all the 4 possible nucleotides. Then in the innermost loop we combine the string firstbase with the string secondbase to create the 2 nucleotide sequence that we append onto the end of sequences.  

What would it would take to extend the current examples so that it can handle the 3 nucleotide sequence case? We could add another nested for loop for a total of 3, but this highlights a problem with the approach we’ve taken here. For every extra nucleotide we want to include we have to nest another for loop and at a certain number this would become unfeasible. We need a better way. Let step back for a second and examine the problem through a different lens.

Say we already had a list of all possible sequences n nucleotides long and we wanted to extend that to all possible (n + 1) sequences. How would that work? Well the n + 1 set can be made by adding an A, C, G, or T to every sequence in n. That’s easily done with two nested for loops. The beauty of this logic is that it’s always true regardless of the length of the sequences and therefore this two for loop approach will always work. We no longer have to contemplate the horrors of having to write some sort of 7 level nested for loop to get to our solution.

Recursion

However, we have a second problem. How do we get the n nucleotides long sequence set in the first place? It’d be fantastic if we already had a function defined that could give us all possible sequences of length n. Hmmm, sounds an awful lot like the function we’re trying to write! This is where we can make use of recursion: a function or process that references itself. Ok lets write some code inside generate_all_sequences to see how something like this might look.

Here’s the code that outlines the procedure we described before:

    sub_sequences = generate_all_sequences(n - 1)
    
    sequences = []

    for sequence in sub_sequences:

        for nucleotide in nucleotides:

                 sequences.append(nucleotide + sequence)
    
    return sequences

First we get sub_sequences a list of all possible sequence of n – 1 length from our function generate_all_sequences. We then iterate through all the sub_sequences and for each of these sequences, we add on a nucleotide at the front. At the end,sequences will be a new list that contains all possible n long nucleotide sequences. It might seem weird or intimidating to see a function call itself, but don’t worry this an actual language feature and is used quite often.

Let’s walk through what happens when we call a function that references itself. As an example imagine we called generate_all_sequences(5). Inside that call, we again call generate_all_sequences, but this time with 4 because we feed it n - 1. Inside generate_all_sequences(4) the same thing happens and generate_all_sequences(3) gets called and so on.

Notice though that as its written now there’s nothing to ever stop this processing from continuing until n is -infinity (other than your computer from crying tears of pain and stopping the program). So we need to define a stopping condition to terminate this otherwise endless string of function calls. Hmm looking at this, how do we know when to stop? Well whatever n we start with, we know that at some point we’ll get reaches a state in the call tree when generate_all_sequences(1) is called. This is an interesting case because we don’t need to call generate_all_sequences(0) to know what generate_all_sequences(1) should be. What are all possible sequences of length 1? It’s just our list of nucleotides!! Amazing. There we have it, a terminating condition. All we have to do is say that if n equals 1, we just return a list of the 4 nucleotides. No more calling generate_all_sequences and thus the endless function calls are brought to an end.

Final code

This brings us to the final code:

nucleotides = ['A', 'T', 'C', 'G']

def generate_all_sequences(n):

    if n < 1:
        return []

    if n == 1:
        return nucleotides

    sub_sequences = generate_all_sequences(n - 1)
    
    sequences = []

    for sequence in sub_sequences:

        for nucleotide in nucleotides:

                 sequences.append(nucleotide + sequence)
    
    return sequences

Now as a sanity check that our function works we can do a quick test. We know that for a given sequence of length n, there are 4^n total possible sequences. So the list we return from generate_all_sequences must therefore have 4^n entries. Indeed if we run a quick test:

print(len(generate_all_sequences(5)));

We get a length of 1024 which equals the expected 4^5.

For a final check, lets make sure each of the 1024 sequences are truly unique. Here we can leverage pythons built in set() function to convert our list into a set. By definition sets cannot have duplicate members so any duplicates in the list we provide to the set will be trimmed in this process. If all members are unique, the number of items in the set should again be 1024.

If we run the following code:

print(len(set(generate_all_sequences(5))));

We see that again, we 1024! Grreeeat success!