Greedy Motif Search

Having spent some time trying to grasp the underlying concept of the Greedy Motif Search problem in chapter 3 of Bioinformatics Algorithms (Part 1) I hoped to cement my understanding and perhaps even make life a little easier for others by attempting to explain the algorithm step by step below.

I will try to provide an overview of the algorithm as well as addressing each section of the pseudo code provided in the course:

Pseudo Code

   GREEDYMOTIFSEARCH(Dna, k, t)
        BestMotifs ← motif matrix formed by first k-mers in each string
                      from Dna
        for each k-mer Motif in the first string from Dna
            Motif1Motif
            for i = 2 to t
                form Profile from motifs Motif1, …, Motifi - 1
                MotifiProfile-most probable k-mer in the i-th string
                          in Dna
            Motifs ← (Motif1, …, Motift)
            if Score(Motifs) < Score(BestMotifs)
                BestMotifsMotifs
        output BestMotifs

I haven’t gone into any detail for creating profiles from k-mers, finding profile-most probable k-mers or scoring as I found these topics were relatively clear in the course text.  Furthermore I haven’t posted my code for obvious reasons.


Overview

The basic idea of the greedy motif search algorithm is to find the set of motifs across a number of DNA sequences that match each other  most closely. To do this we:

  • Run through each possible k-mer in our first dna string (illustrated for the first three 3-mers in orange below),
  • Identify the best matches for this initial k-mer within each of the following dna strings (using our profile-most probable function, illustrated below in green) thus creating a set of  motifs at each step
  • Score each set of motifs to find and return the best scoring set.

For example the first three iterations in the main loop if k = 3 (k being the length of our k-mers, or in this case 3-mers) will give us the following motif sets and scores:

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 7
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 5
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 4

The best motif set in the first three iterations, as illustrated above, is therefore the third which scores 4, the best overall score happens at the seventh iteration:

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 2

The Code

BestMotifs ← motif matrix formed by first k-mers in each string from Dna

First we initialise best_motifs as a list featuring the first k-mer from each dna string.

 GGCGTTCAGGCA
 AAGAATCAGTCA
 CAAGGAGTTCGC
 CACGTCAATCAC
 CAATAATATTCG

So using our example DNA list above we get:

best_motifs = [‘GGC‘, ‘AAG‘, ‘CAA‘, ‘CAC‘, ‘CAA‘]

This list will be used to compare against other motif lists that we will build later in the algorithm.


for each k-mer Motif in the first string from Dna

Motif1 ← Motif

In our outer loop we will be iterating through each k-mer in the first dna string (in this case ‘GGCGTTCAGGCA‘ ).

At the beginning of each of these iterations we will first initialise a new list called motif and then add the respective k-mer as its first entry

First iteration:

GGCGTTCAGGCA

So on the first iterations the motif list will be initialised as motif= [‘GGC‘]

Second-iteration

GGCGTTCAGGCA

On the second iteration the motif list will be initialised as motif = [‘GCG‘]

And so on…

Within each iteration of the outer loop we will be repeating the following instructions within our ‘inner loop’ below:


for i = 2 to t

form Profile from motifs Motif1, …, Motifi – 1

MotifiProfile-most probable k-mer in the i-th string in Dna

Motifs ← (Motif1, …, Motift)

Our inner loop effectively iterates through the second dna string the final dna string (or rather dna[1] to dna[t]). At each step we will add the most likely candidate k-mer from the current dna string to our current motif list.

As an example we will run through the first iteration of our main loop for which we initialised our motif list as [‘GGC‘] above:

 GGCGTTCAGGCA
 AAGAATCAGTCA
 CAAGGAGTTCGC
 CACGTCAATCAC
 CAATAATATTCG

To tie things together I have tried to colour code the current k-mer from the first dna string as orange, the current dna string as blue and the best matches from previously analysed dna strings as green.

Please note that we iterate through dna[1] to dna[t] within the inner loop which excludes dna[0]. dna[0] is used to create the first item in our motif list upon each iteration of our outer loop as described above. This ‘seed’ motif is then compared to each of the other dna strings. If we included dna[0] in our inner loop we would be comparing it to itself at each iteration.

Create a profile from all motifs in our list

We only have one item in our motif list when we first start our loop which results in the profile:

G G C
A 0 0 0
C 0 0 1.0
G 1.0 1.0 0
T 0 0 0

And so our first profile might be represented in a form similar to:

profile = [(0.0, 0.0, 1.0, 0.0), (0.0, 0.0, 1.0, 0.0), (0.0, 1.0, 0.0, 0.0)]

Find Profile-most probable motif in dna[i] using the current profile

We can now run our profile-most probable function passing the profile above and dna[i] as arguments. In the first iteration of our inner loop we will be using the second dna string;  ‘AAGAATCAGTCA‘ and the profile; [(0.0, 0.0, 1.0, 0.0), (0.0, 0.0, 1.0, 0.0), (0.0, 1.0, 0.0, 0.0)] which returns the motif  AAG

Add the resulting motif to motif list

We now add the resulting k-mer to our motif list:

motif= [‘GGC‘, ‘AAG‘]

Rinse and repeat using the next dna string and the updated motif list:

 GGCGTTCAGGCA
 AAGAATCAGTCA
 CAAGGAGTTCGC
 CACGTCAATCAC
 CAATAATATTCG

Again we create a profile from all motifs in our list

G G C
A A G
A 0.5 0.5 0
C 0 0 0.5
G 0.5 0.5 0.5
T 0 0 0

Which we might represent in a form similar to:

profile =[(0.5, 0.0, 0.5, 0.0), (0.5, 0.0, 0.5, 0.0), (0.0, 0.5, 0.5, 0.0)]

And again we find the Profile-most probable motif in dna[i] using the current profile

Using dna[i];  ‘CAAGGAGTTCGC‘ and the profile; [(0.5, 0.0, 0.5, 0.0), (0.5, 0.0, 0.5, 0.0), (0.0, 0.5, 0.5, 0.0)] which returns the motif  AAG which results in the updated motif list:

motifs = [‘GGC‘, ‘AAG‘, ‘AAG‘]

And so on…

We carry on this way until we reach the last dna string, dna[t]. In this case our motif list will look something like this:

motif = [‘GGC‘, ‘AAG‘, ‘AAG‘, ‘CAC‘, ‘CAA‘]

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG

ifScore(Motifs) < Score(BestMotifs) BestMotifsMotifs outputBestMotifs

Finally we need to compare our current motif list with our best motif list to see which has the lower score. In the case above motif [‘GGC‘, ‘AAG‘, ‘AAG‘, ‘CAC‘, ‘CAA‘] scores 7 compared to best_motif which scores 6. As such best_motif is not updated on this iteration.

I have duplicated the results for the first three iterations below now that we have some additional context.

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 7
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 5
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 4

This entire process is then repeated with the next k-mer in dna[0] and so on until you have reached the last k-mer in the first dna string.

In each case the motif list is emptied and initialised with the current k-mer from the first dna string. As such the last motif list would be initialised as motif = ”GCA”.


Finishing off

Hopefully the above details will give you a good idea how to implement the algorithm.

If you do have any trouble it might be worth attempting to code the inner loop first to recreate the first motif set (starting with ‘GGC‘). Once you have that in hand it should be fairly straight forward to iterate through the remaining k-mers and keep track of the best scoring motif set.

 

And I think that covers everything… I really hope that this has been helpful!   Please let me know if I have missed anything (literally or figuratively) as this is as much an education for myself as anybody else.