Greedy Motif Search

Having spent some time trying to grasp the underlying concept of the Greedy Motif Search problem in chapter 3 of Bioinformatics Algorithms (Part 1) I hoped to cement my understanding and perhaps even make life a little easier for others by attempting to explain the algorithm step by step below.

I will try to provide an overview of the algorithm as well as addressing each section of the pseudo code provided in the course:

Pseudo Code

   GREEDYMOTIFSEARCH(Dna, k, t)
        BestMotifs ← motif matrix formed by first k-mers in each string
                      from Dna
        for each k-mer Motif in the first string from Dna
            Motif1Motif
            for i = 2 to t
                form Profile from motifs Motif1, …, Motifi - 1
                MotifiProfile-most probable k-mer in the i-th string
                          in Dna
            Motifs ← (Motif1, …, Motift)
            if Score(Motifs) < Score(BestMotifs)
                BestMotifsMotifs
        output BestMotifs

I haven’t gone into any detail for creating profiles from k-mers, finding profile-most probable k-mers or scoring as I found these topics were relatively clear in the course text.  Furthermore I haven’t posted my code for obvious reasons.


Overview

The basic idea of the greedy motif search algorithm is to find the set of motifs across a number of DNA sequences that match each other  most closely. To do this we:

  • Run through each possible k-mer in our first dna string (illustrated for the first three 3-mers in orange below),
  • Identify the best matches for this initial k-mer within each of the following dna strings (using our profile-most probable function, illustrated below in green) thus creating a set of  motifs at each step
  • Score each set of motifs to find and return the best scoring set.

For example the first three iterations in the main loop if k = 3 (k being the length of our k-mers, or in this case 3-mers) will give us the following motif sets and scores:

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 7
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 5
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 4

The best motif set in the first three iterations, as illustrated above, is therefore the third which scores 4, the best overall score happens at the seventh iteration:

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 2

The Code

BestMotifs ← motif matrix formed by first k-mers in each string from Dna

First we initialise best_motifs as a list featuring the first k-mer from each dna string.

 GGCGTTCAGGCA
 AAGAATCAGTCA
 CAAGGAGTTCGC
 CACGTCAATCAC
 CAATAATATTCG

So using our example DNA list above we get:

best_motifs = [‘GGC‘, ‘AAG‘, ‘CAA‘, ‘CAC‘, ‘CAA‘]

This list will be used to compare against other motif lists that we will build later in the algorithm.


for each k-mer Motif in the first string from Dna

Motif1 ← Motif

In our outer loop we will be iterating through each k-mer in the first dna string (in this case ‘GGCGTTCAGGCA‘ ).

At the beginning of each of these iterations we will first initialise a new list called motif and then add the respective k-mer as its first entry

First iteration:

GGCGTTCAGGCA

So on the first iterations the motif list will be initialised as motif= [‘GGC‘]

Second-iteration

GGCGTTCAGGCA

On the second iteration the motif list will be initialised as motif = [‘GCG‘]

And so on…

Within each iteration of the outer loop we will be repeating the following instructions within our ‘inner loop’ below:


for i = 2 to t

form Profile from motifs Motif1, …, Motifi – 1

MotifiProfile-most probable k-mer in the i-th string in Dna

Motifs ← (Motif1, …, Motift)

Our inner loop effectively iterates through the second dna string the final dna string (or rather dna[1] to dna[t]). At each step we will add the most likely candidate k-mer from the current dna string to our current motif list.

As an example we will run through the first iteration of our main loop for which we initialised our motif list as [‘GGC‘] above:

 GGCGTTCAGGCA
 AAGAATCAGTCA
 CAAGGAGTTCGC
 CACGTCAATCAC
 CAATAATATTCG

To tie things together I have tried to colour code the current k-mer from the first dna string as orange, the current dna string as blue and the best matches from previously analysed dna strings as green.

Please note that we iterate through dna[1] to dna[t] within the inner loop which excludes dna[0]. dna[0] is used to create the first item in our motif list upon each iteration of our outer loop as described above. This ‘seed’ motif is then compared to each of the other dna strings. If we included dna[0] in our inner loop we would be comparing it to itself at each iteration.

Create a profile from all motifs in our list

We only have one item in our motif list when we first start our loop which results in the profile:

G G C
A 0 0 0
C 0 0 1.0
G 1.0 1.0 0
T 0 0 0

And so our first profile might be represented in a form similar to:

profile = [(0.0, 0.0, 1.0, 0.0), (0.0, 0.0, 1.0, 0.0), (0.0, 1.0, 0.0, 0.0)]

Find Profile-most probable motif in dna[i] using the current profile

We can now run our profile-most probable function passing the profile above and dna[i] as arguments. In the first iteration of our inner loop we will be using the second dna string;  ‘AAGAATCAGTCA‘ and the profile; [(0.0, 0.0, 1.0, 0.0), (0.0, 0.0, 1.0, 0.0), (0.0, 1.0, 0.0, 0.0)] which returns the motif  AAG

Add the resulting motif to motif list

We now add the resulting k-mer to our motif list:

motif= [‘GGC‘, ‘AAG‘]

Rinse and repeat using the next dna string and the updated motif list:

 GGCGTTCAGGCA
 AAGAATCAGTCA
 CAAGGAGTTCGC
 CACGTCAATCAC
 CAATAATATTCG

Again we create a profile from all motifs in our list

G G C
A A G
A 0.5 0.5 0
C 0 0 0.5
G 0.5 0.5 0.5
T 0 0 0

Which we might represent in a form similar to:

profile =[(0.5, 0.0, 0.5, 0.0), (0.5, 0.0, 0.5, 0.0), (0.0, 0.5, 0.5, 0.0)]

And again we find the Profile-most probable motif in dna[i] using the current profile

Using dna[i];  ‘CAAGGAGTTCGC‘ and the profile; [(0.5, 0.0, 0.5, 0.0), (0.5, 0.0, 0.5, 0.0), (0.0, 0.5, 0.5, 0.0)] which returns the motif  AAG which results in the updated motif list:

motifs = [‘GGC‘, ‘AAG‘, ‘AAG‘]

And so on…

We carry on this way until we reach the last dna string, dna[t]. In this case our motif list will look something like this:

motif = [‘GGC‘, ‘AAG‘, ‘AAG‘, ‘CAC‘, ‘CAA‘]

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG

ifScore(Motifs) < Score(BestMotifs) BestMotifsMotifs outputBestMotifs

Finally we need to compare our current motif list with our best motif list to see which has the lower score. In the case above motif [‘GGC‘, ‘AAG‘, ‘AAG‘, ‘CAC‘, ‘CAA‘] scores 7 compared to best_motif which scores 6. As such best_motif is not updated on this iteration.

I have duplicated the results for the first three iterations below now that we have some additional context.

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 7
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 5
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 4

This entire process is then repeated with the next k-mer in dna[0] and so on until you have reached the last k-mer in the first dna string.

In each case the motif list is emptied and initialised with the current k-mer from the first dna string. As such the last motif list would be initialised as motif = ”GCA”.


Finishing off

Hopefully the above details will give you a good idea how to implement the algorithm.

If you do have any trouble it might be worth attempting to code the inner loop first to recreate the first motif set (starting with ‘GGC‘). Once you have that in hand it should be fairly straight forward to iterate through the remaining k-mers and keep track of the best scoring motif set.

 

And I think that covers everything… I really hope that this has been helpful!   Please let me know if I have missed anything (literally or figuratively) as this is as much an education for myself as anybody else.

 

47 thoughts on “Greedy Motif Search

  1. You can’t imagine how helpful this was for me. Turns out I just needed to fix one thing- but I’m not sure how much longer it would’ve taken me to fix that if I hadn’t walked through it with your guidance.

    Much obliged

  2. I concur with Adam, this is a great walkthrough, the pseudocode is a little opaque as written (at least to me), and I have now found two small mistakes I had made, one related to this algorithm, and one to the profile most probable, without this page showing the interim data I would still be struggling, so thank you so much for your guidance, I’m sure there was more information on the fora, but I found your link there and this was just what I needed, a second mind to help me out.

  3. Very helpful, thank you. I think you have an error, though. When demonstrating the scoring (orange and green) in the beginning, you always use AAG on the second line, but I’m pretty sure GGC would be followed by GTC, GCG would be followed by AAG, and CGT would be followed by AGT on the second line.

    • Nevermind. The reason it select AAG is because all the probabilities in the second string are 0, ie none of them exactly match the first motif.

        • I thought that in the class he talked about adding a small value, like one, to everything in order to make more similar strings appear even when they aren’t perfect matches.

  4. “Finally we need to compare our current motif list with our best motif list to see which has the lower score. In the case above motif [‘GGC‘, ‘AAG‘, ‘AAG‘, ‘CAC‘, ‘CAA‘] scores 7 compared to best_motif which scores 6. As such best_motif is not updated on this iteration.”

    I understand your description until the phrase ‘scores 7’.
    How did you determine that score?

  5. A very heartfelt thanks for clearing this mess up the best way possible. Quite unfortunate the instructors did not do such a marvelous job with my being ready to drop the course as a result.

  6. On the second iteration the motif list will be initialised as motif = [‘GGC‘]

    Is there a typo ?

    Should it rather be :-
    On the second iteration the motif list will be initialised as motif = [‘GCG‘]

  7. I’m struggling with how you identify AAG as the best match in string 2 (AAGAATCAGTCA) during the first iteration (i.e. with kmer GGC in string 1). If I start with a profile of (0010)(0010)(0100) for GGC, then I won’t choose AAG as the best match, since not a single base matches. If instead I make a profile matrix using my best motifs, which have been arbitrarily initialized to the first three letters of each DNA string, then I think that, for example, the third kmer in the second string, GAA, should fit the profile better than AAG. My profile matrix from the initial best motifs is:
    (0.2 0.6 0.2 0)(0.8 0 0.2 0)(0.4 0.4 0.2 0)
    So if I compare AAG and GAA, in the first position they’re both 0.2, so equally likely. In the second position they’re the A, so equally likely. But in the third position, G has 0.4 and A has 0.2, so GAA is more likely than AAG.

    Up to this point, I’ve gotten everything correct in the code challenges and quizzes, but this one has really stumped me. Any help would be much appreciated.

    Thanks.

    Eric

    • I figured out my error – AAG is the most likely match to GGC in the first iteration because, although it has probability 0, all the others do too (when you multiply probabilities of the individual bases), so the first one is chosen.

      Thanks so very much, Mr. Graeme, for your step by step explanation of this problem. I definitely wouldn’t have solved it without your explanation (and barely did even with this site!).

      Best wishes,

      Eric

  8. This was really helpful. I’m not sure I would have been able to solve this with just the description in the text. Thanks for the clear explanation.

  9. There’s a mispelling in:
    best_motifs = [‘GCC‘, ‘AAG‘, ‘CAA‘, ‘CAC‘, ‘CAA‘]
    since the first kmer from first string is “GGC” (as stated in other part of the text).

  10. This post was my main guide to solve this problem, definitely without this useful guide I was unable to overcome !
    Thanks a lot.

  11. This is very helpful Thanks.
    However there is still a minor problem here:
    Second-iteration

    GGCGTTCAGGCA
    On the second iteration the motif list will be initialised as motif = [‘GGC‘]
    SHOULD BE
    On the second iteration the motif list will be initialised as motif = [‘GCG‘]

  12. WOW. I commented over on Stepic, but this was SO useful to me. I was feeling pretty dumb at not being able to get it, but your walkthrough made it clear. Thanks so much for taking the time to write this!

  13. Thank you so much for the explanation. As a biochemist with little programming experience I am playing catch-up on reading and applying pseudo code.

    One question: when the first inner loop is complete and the motif list is emptied and initialised with the current k-mer from the first dna string, does the profile also reset to match the current kmer? (for instance, at initialization “GCA” in the second iteration, would it start out with [(0, 0, 1, 0), (0, 1, 0,0),(1,0,0,0)] rather can carrying forward the last profile at the end of the previous iteration?

    Again, thank you.

  14. It seems I am stuck on the simple stuff. How are you arriving at the scores? Aren’t these the sums of Hamming distance between the first piece (in orange) to the subsequent most probable? I would appreciate someone going through the tabulation step by step. Thanks.

    • To get the scores, first get the consensus of the array strings from motifs (and then separately from best_motifs).

      So if the motifs are the following:
      GTTCAGGCA
      AATCAGTCA
      CGAGTTCGC
      GTCAATCAC
      TAATATTCG
      Score = 7

      The consensus string (most common) is AAG and the score is the number of differences from that string. You get AAG from checking each Character from the 5 kmers. So position 0 has G, A, A, C, C [A most common] and position 1 has G, A, A, A, A [A most common] and position 2 has C, G, G, C, A [G most common] … AAG.

      GGC – AAG: 3 differences
      AAG – AAG: 0 differences
      AAG – AAG: 0 differences
      CAC – AAG: 2 differences
      CAA – AAG: 2 differences

      7 differences total

      • It modified my formatting on the first listing. Let me try again.

        So if the motifs are the following:
        -GGC-GTTCAGGCA
        -AAG-AATCAGTCA
        C-AAG-GAGTTCGC
        -CAC-GTCAATCAC
        -CAA-TAATATTCG
        Score = 7

  15. Sorry I haven’t replied to any questions recently! I have just started PhD. and along with looking after my very new son I haven’t had a chance to refresh myself on everything to answer suitably.

    If any one else understands the problem clearly from the post it would be fantastic if you could answer any questions that come up either here or with a link to the stepic discussions page if you want more fame!

    Cheers, Graeme

  16. Thank you, friend. It’s really very clear, understandable post. If there were not you, I’m afraid I’ll never understand this algorithm, at least so good.

  17. Hello ! I´m stucked.
    I don´t understand how is it possible that given: “dna string; ‘AAGAATCAGTCA‘ and the profile; [(0.0, 0.0, 1.0, 0.0), (0.0, 0.0, 1.0, 0.0), (0.0, 1.0, 0.0, 0.0)]” it will return the motif AAG”.
    I think I´m loosing something here. How could that profile return AAG ?

    Thanks for your response!

Leave a Reply

Your email address will not be published. Required fields are marked *