Greedy Motif Search

Having spent some time trying to grasp the underlying concept of the Greedy Motif Search problem in chapter 3 of Bioinformatics Algorithms (Part 1) I hoped to cement my understanding and perhaps even make life a little easier for others by attempting to explain the algorithm step by step below.

I will try to provide an overview of the algorithm as well as addressing each section of the pseudo code provided in the course:

Pseudo Code

   GREEDYMOTIFSEARCH(Dna, k, t)
        BestMotifs ← motif matrix formed by first k-mers in each string
                      from Dna
        for each k-mer Motif in the first string from Dna
            Motif1Motif
            for i = 2 to t
                form Profile from motifs Motif1, …, Motifi - 1
                MotifiProfile-most probable k-mer in the i-th string
                          in Dna
            Motifs ← (Motif1, …, Motift)
            if Score(Motifs) < Score(BestMotifs)
                BestMotifsMotifs
        output BestMotifs

I haven’t gone into any detail for creating profiles from k-mers, finding profile-most probable k-mers or scoring as I found these topics were relatively clear in the course text.  Furthermore I haven’t posted my code for obvious reasons.


Overview

The basic idea of the greedy motif search algorithm is to find the set of motifs across a number of DNA sequences that match each other  most closely. To do this we:

  • Run through each possible k-mer in our first dna string (illustrated for the first three 3-mers in orange below),
  • Identify the best matches for this initial k-mer within each of the following dna strings (using our profile-most probable function, illustrated below in green) thus creating a set of  motifs at each step
  • Score each set of motifs to find and return the best scoring set.

For example the first three iterations in the main loop if k = 3 (k being the length of our k-mers, or in this case 3-mers) will give us the following motif sets and scores:

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 7
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 5
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 4

The best motif set in the first three iterations, as illustrated above, is therefore the third which scores 4, the best overall score happens at the seventh iteration:

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 2

The Code

BestMotifs ← motif matrix formed by first k-mers in each string from Dna

First we initialise best_motifs as a list featuring the first k-mer from each dna string.

 GGCGTTCAGGCA
 AAGAATCAGTCA
 CAAGGAGTTCGC
 CACGTCAATCAC
 CAATAATATTCG

So using our example DNA list above we get:

best_motifs = [‘GGC‘, ‘AAG‘, ‘CAA‘, ‘CAC‘, ‘CAA‘]

This list will be used to compare against other motif lists that we will build later in the algorithm.


for each k-mer Motif in the first string from Dna

Motif1 ← Motif

In our outer loop we will be iterating through each k-mer in the first dna string (in this case ‘GGCGTTCAGGCA‘ ).

At the beginning of each of these iterations we will first initialise a new list called motif and then add the respective k-mer as its first entry

First iteration:

GGCGTTCAGGCA

So on the first iterations the motif list will be initialised as motif= [‘GGC‘]

Second-iteration

GGCGTTCAGGCA

On the second iteration the motif list will be initialised as motif = [‘GCG‘]

And so on…

Within each iteration of the outer loop we will be repeating the following instructions within our ‘inner loop’ below:


for i = 2 to t

form Profile from motifs Motif1, …, Motifi – 1

MotifiProfile-most probable k-mer in the i-th string in Dna

Motifs ← (Motif1, …, Motift)

Our inner loop effectively iterates through the second dna string the final dna string (or rather dna[1] to dna[t]). At each step we will add the most likely candidate k-mer from the current dna string to our current motif list.

As an example we will run through the first iteration of our main loop for which we initialised our motif list as [‘GGC‘] above:

 GGCGTTCAGGCA
 AAGAATCAGTCA
 CAAGGAGTTCGC
 CACGTCAATCAC
 CAATAATATTCG

To tie things together I have tried to colour code the current k-mer from the first dna string as orange, the current dna string as blue and the best matches from previously analysed dna strings as green.

Please note that we iterate through dna[1] to dna[t] within the inner loop which excludes dna[0]. dna[0] is used to create the first item in our motif list upon each iteration of our outer loop as described above. This ‘seed’ motif is then compared to each of the other dna strings. If we included dna[0] in our inner loop we would be comparing it to itself at each iteration.

Create a profile from all motifs in our list

We only have one item in our motif list when we first start our loop which results in the profile:

G G C
A 0 0 0
C 0 0 1.0
G 1.0 1.0 0
T 0 0 0

And so our first profile might be represented in a form similar to:

profile = [(0.0, 0.0, 1.0, 0.0), (0.0, 0.0, 1.0, 0.0), (0.0, 1.0, 0.0, 0.0)]

Find Profile-most probable motif in dna[i] using the current profile

We can now run our profile-most probable function passing the profile above and dna[i] as arguments. In the first iteration of our inner loop we will be using the second dna string;  ‘AAGAATCAGTCA‘ and the profile; [(0.0, 0.0, 1.0, 0.0), (0.0, 0.0, 1.0, 0.0), (0.0, 1.0, 0.0, 0.0)] which returns the motif  AAG

Add the resulting motif to motif list

We now add the resulting k-mer to our motif list:

motif= [‘GGC‘, ‘AAG‘]

Rinse and repeat using the next dna string and the updated motif list:

 GGCGTTCAGGCA
 AAGAATCAGTCA
 CAAGGAGTTCGC
 CACGTCAATCAC
 CAATAATATTCG

Again we create a profile from all motifs in our list

G G C
A A G
A 0.5 0.5 0
C 0 0 0.5
G 0.5 0.5 0.5
T 0 0 0

Which we might represent in a form similar to:

profile =[(0.5, 0.0, 0.5, 0.0), (0.5, 0.0, 0.5, 0.0), (0.0, 0.5, 0.5, 0.0)]

And again we find the Profile-most probable motif in dna[i] using the current profile

Using dna[i];  ‘CAAGGAGTTCGC‘ and the profile; [(0.5, 0.0, 0.5, 0.0), (0.5, 0.0, 0.5, 0.0), (0.0, 0.5, 0.5, 0.0)] which returns the motif  AAG which results in the updated motif list:

motifs = [‘GGC‘, ‘AAG‘, ‘AAG‘]

And so on…

We carry on this way until we reach the last dna string, dna[t]. In this case our motif list will look something like this:

motif = [‘GGC‘, ‘AAG‘, ‘AAG‘, ‘CAC‘, ‘CAA‘]

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG

ifScore(Motifs) < Score(BestMotifs) BestMotifsMotifs outputBestMotifs

Finally we need to compare our current motif list with our best motif list to see which has the lower score. In the case above motif [‘GGC‘, ‘AAG‘, ‘AAG‘, ‘CAC‘, ‘CAA‘] scores 7 compared to best_motif which scores 6. As such best_motif is not updated on this iteration.

I have duplicated the results for the first three iterations below now that we have some additional context.

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 7
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 5
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 4

This entire process is then repeated with the next k-mer in dna[0] and so on until you have reached the last k-mer in the first dna string.

In each case the motif list is emptied and initialised with the current k-mer from the first dna string. As such the last motif list would be initialised as motif = ”GCA”.


Finishing off

Hopefully the above details will give you a good idea how to implement the algorithm.

If you do have any trouble it might be worth attempting to code the inner loop first to recreate the first motif set (starting with ‘GGC‘). Once you have that in hand it should be fairly straight forward to iterate through the remaining k-mers and keep track of the best scoring motif set.

And I think that covers everything… I really hope that this has been helpful!   Please let me know if I have missed anything (literally or figuratively) as this is as much an education for myself as anybody else.

Edit: A few people have kindly asked if they can donate a tiny bit towards hosting costs and so I have tentatively added a donate (via paypal) button below. Thanks 🙂

79 thoughts on “Greedy Motif Search

  1. You can’t imagine how helpful this was for me. Turns out I just needed to fix one thing- but I’m not sure how much longer it would’ve taken me to fix that if I hadn’t walked through it with your guidance.

    Much obliged

  2. I concur with Adam, this is a great walkthrough, the pseudocode is a little opaque as written (at least to me), and I have now found two small mistakes I had made, one related to this algorithm, and one to the profile most probable, without this page showing the interim data I would still be struggling, so thank you so much for your guidance, I’m sure there was more information on the fora, but I found your link there and this was just what I needed, a second mind to help me out.

  3. Very helpful, thank you. I think you have an error, though. When demonstrating the scoring (orange and green) in the beginning, you always use AAG on the second line, but I’m pretty sure GGC would be followed by GTC, GCG would be followed by AAG, and CGT would be followed by AGT on the second line.

    • Nevermind. The reason it select AAG is because all the probabilities in the second string are 0, ie none of them exactly match the first motif.

        • I thought that in the class he talked about adding a small value, like one, to everything in order to make more similar strings appear even when they aren’t perfect matches.

  4. “Finally we need to compare our current motif list with our best motif list to see which has the lower score. In the case above motif [‘GGC‘, ‘AAG‘, ‘AAG‘, ‘CAC‘, ‘CAA‘] scores 7 compared to best_motif which scores 6. As such best_motif is not updated on this iteration.”

    I understand your description until the phrase ‘scores 7’.
    How did you determine that score?

  5. A very heartfelt thanks for clearing this mess up the best way possible. Quite unfortunate the instructors did not do such a marvelous job with my being ready to drop the course as a result.

    • Kudos to the Instructors all the same! A big thanks to MrGraeme for complimenting the Instructors! The duo did an excellent job.

  6. On the second iteration the motif list will be initialised as motif = [‘GGC‘]

    Is there a typo ?

    Should it rather be :-
    On the second iteration the motif list will be initialised as motif = [‘GCG‘]

  7. I’m struggling with how you identify AAG as the best match in string 2 (AAGAATCAGTCA) during the first iteration (i.e. with kmer GGC in string 1). If I start with a profile of (0010)(0010)(0100) for GGC, then I won’t choose AAG as the best match, since not a single base matches. If instead I make a profile matrix using my best motifs, which have been arbitrarily initialized to the first three letters of each DNA string, then I think that, for example, the third kmer in the second string, GAA, should fit the profile better than AAG. My profile matrix from the initial best motifs is:
    (0.2 0.6 0.2 0)(0.8 0 0.2 0)(0.4 0.4 0.2 0)
    So if I compare AAG and GAA, in the first position they’re both 0.2, so equally likely. In the second position they’re the A, so equally likely. But in the third position, G has 0.4 and A has 0.2, so GAA is more likely than AAG.

    Up to this point, I’ve gotten everything correct in the code challenges and quizzes, but this one has really stumped me. Any help would be much appreciated.

    Thanks.

    Eric

    • I figured out my error – AAG is the most likely match to GGC in the first iteration because, although it has probability 0, all the others do too (when you multiply probabilities of the individual bases), so the first one is chosen.

      Thanks so very much, Mr. Graeme, for your step by step explanation of this problem. I definitely wouldn’t have solved it without your explanation (and barely did even with this site!).

      Best wishes,

      Eric

  8. This was really helpful. I’m not sure I would have been able to solve this with just the description in the text. Thanks for the clear explanation.

  9. There’s a mispelling in:
    best_motifs = [‘GCC‘, ‘AAG‘, ‘CAA‘, ‘CAC‘, ‘CAA‘]
    since the first kmer from first string is “GGC” (as stated in other part of the text).

  10. This post was my main guide to solve this problem, definitely without this useful guide I was unable to overcome !
    Thanks a lot.

  11. This is very helpful Thanks.
    However there is still a minor problem here:
    Second-iteration

    GGCGTTCAGGCA
    On the second iteration the motif list will be initialised as motif = [‘GGC‘]
    SHOULD BE
    On the second iteration the motif list will be initialised as motif = [‘GCG‘]

  12. WOW. I commented over on Stepic, but this was SO useful to me. I was feeling pretty dumb at not being able to get it, but your walkthrough made it clear. Thanks so much for taking the time to write this!

  13. Thank you so much for the explanation. As a biochemist with little programming experience I am playing catch-up on reading and applying pseudo code.

    One question: when the first inner loop is complete and the motif list is emptied and initialised with the current k-mer from the first dna string, does the profile also reset to match the current kmer? (for instance, at initialization “GCA” in the second iteration, would it start out with [(0, 0, 1, 0), (0, 1, 0,0),(1,0,0,0)] rather can carrying forward the last profile at the end of the previous iteration?

    Again, thank you.

  14. It seems I am stuck on the simple stuff. How are you arriving at the scores? Aren’t these the sums of Hamming distance between the first piece (in orange) to the subsequent most probable? I would appreciate someone going through the tabulation step by step. Thanks.

    • To get the scores, first get the consensus of the array strings from motifs (and then separately from best_motifs).

      So if the motifs are the following:
      GTTCAGGCA
      AATCAGTCA
      CGAGTTCGC
      GTCAATCAC
      TAATATTCG
      Score = 7

      The consensus string (most common) is AAG and the score is the number of differences from that string. You get AAG from checking each Character from the 5 kmers. So position 0 has G, A, A, C, C [A most common] and position 1 has G, A, A, A, A [A most common] and position 2 has C, G, G, C, A [G most common] … AAG.

      GGC – AAG: 3 differences
      AAG – AAG: 0 differences
      AAG – AAG: 0 differences
      CAC – AAG: 2 differences
      CAA – AAG: 2 differences

      7 differences total

      • It modified my formatting on the first listing. Let me try again.

        So if the motifs are the following:
        -GGC-GTTCAGGCA
        -AAG-AATCAGTCA
        C-AAG-GAGTTCGC
        -CAC-GTCAATCAC
        -CAA-TAATATTCG
        Score = 7

  15. Sorry I haven’t replied to any questions recently! I have just started PhD. and along with looking after my very new son I haven’t had a chance to refresh myself on everything to answer suitably.

    If any one else understands the problem clearly from the post it would be fantastic if you could answer any questions that come up either here or with a link to the stepic discussions page if you want more fame!

    Cheers, Graeme

  16. Thank you, friend. It’s really very clear, understandable post. If there were not you, I’m afraid I’ll never understand this algorithm, at least so good.

  17. Hello ! I´m stucked.
    I don´t understand how is it possible that given: “dna string; ‘AAGAATCAGTCA‘ and the profile; [(0.0, 0.0, 1.0, 0.0), (0.0, 0.0, 1.0, 0.0), (0.0, 1.0, 0.0, 0.0)]” it will return the motif AAG”.
    I think I´m loosing something here. How could that profile return AAG ?

    Thanks for your response!

  18. Thanks for your help!

    as soon as I catch up with the learning Ill make sure to keep an eye out for any articles you publish : D

  19. Thank you. I kept going through the text and videos and I could not figure out what ‘t’ was. If it’s passed in as a variable, it must be important, but what was it? You actually mentioned t in the pseudocode and description.

    So, much thank yous.

    Your explanation is beautiful.

  20. Awesome post you made. Great resource to explain that algorithm. A lot of people got stuck there, and this for sure helped a lot!

  21. for the third iteration, there are two modes (‘AAG’ and ‘AAT’) both appearing two times. However using ‘AAT’ gives you a lower score, so this should be indicated in the notes maybe? I am still figuring out how to do this. Maybe I will just go through doing hamming distance by doing all of the unique(strings) in the list and take lowest score which will be fast for the example but not when the list is much longer.

  22. В ваших обзорах содержится большое
    количество нужной информации. Я в любом случае добавлю ваш ресурс в избранное, чтобы
    заходить на него ежедневно!!!
    Я убежден, что таким образом буду узнавать много нового и интересного !

    Спасибо огромное за следующие посты,
    и удачи!

    • * I meant to say, I have no idea how you are scoring your matrix in the first place, where does the score 7 come from in your fist example?

  23. thank you for the very clear explanation, I now understand how to the algorithm works,
    I just have one question, how does the scoring work? Why does the first kmer get a score of 7, and the second a score of 5?

  24. Жаль, что вы не принимаете пожертвования…
    Я бы с
    удовольствием внес определенную сумму на счет этого
    замечательного веб ресурса!!! Я убежден, что добавлю ваш канал
    RSS в свою учетную запись Google … Я с нетерпением жду
    появления новых постов, и отслеживаю их
    в
    моей группе в Facebook! Вскоре будем
    обсуждать!

  25. Under the section:

    “Find Profile-most probable motif in dna[i] using the current profile”

    we have the k-mer “GGC” and apply its profile to the string:

    “AAGAATCAGTCA”

    which is shown to yield AAG as our best match.

    Why is AAG the best match? It matches only 0.0’s with the profile matrix from GGC. Something like “AGA” or “GAA” would at least have one character correctly overlap.

    Correction: I just went through the code myself to verify why AAG is selected.

    What actually happens is that our profile comparison is derived from the Pr function. The Pr function multiplies all the probabilities of each character’s match in the profile matrix together. If any character in the k-mer is matched to a 0.0 in the profile matrix even a single time, the other probabilities get multiplied by this 0.0, and the total probability returned is 0. This means AAG, where all 3 characters match to 0.0, returns a total probability of zero; AGA and GAA, which have one character match to 1.0, nevertheless has its 2 other characters matched to 0.0, resulting in the same total probability of zero. In other words, AAG, AGA, and GAA, are all weighted as the same, a zero, even though AGA and GAA in fact more closely match the input k-mer of GGC. The script only sees zeroes, and so it takes the first string by default: AAG.

    The *only way* to return a non-zero probability, or a “match”, is if all 3 characters match the profile matrix. Then there are no 0.0’s being multiplied into the total probability, and we end up in a non-zero total probability. This is why the later dna strings input will find exactly matching k-mers. For example, the next iteration also finds AAG. It finds AAG because the previous iteration returned AAG, which added this combination of characters to the profile matrix, so the next AAG will exactly match it.

    I don’t know if it’s better this way to throw away partial matches. It reduces false positives, but probably converges more slowly since we are just going to be returning the first few characters in the string until one matches perfectly. Then again, that is our goal, to find perfect matches. The main flaw is that if our seed string doesn’t contain exactly the optimal pattern in nature, then we’ll never find the optimal pattern due to the following strings being doomed to return a zero or low probability when matched against the profile matrix built-up from a seed that doesn’t contain the correct k-mer. For example, if our seed string, dna[0], doesn’t contain the correct pattern, but the next string, dna[1], does contain it, then we will never see this in our script. The script will match dna[1] to the profile matrix from dna[0] and simply return the first few characters or a different false pattern. The correct pattern would actually have a probability of 0.0 here.

    Thanks for the informative post with the time put in for diagrams. It helped me understand the flawed angle the homework set had adopted.

    • Thanks for updating your comment to describe your line of thinking! Too be honest I haven’t looked at the problem for a while and so it was helpful for me too!

    • @John Blaise, thank you so much. You cannot believe the amount of time I spent scratching my head over this. Truly glad I came across your comment.
      Also, thank you too, MrGraeme, for this detailed explanation.

  26. Indeed very well explained . Had it not been for this explanation , I would have taken days to understand this .Thanks a lot .

  27. This should be on Stepik text instead. They have a habit of making things complicated unnecessarily.
    Thanks a lot man!

    • Thanks Suyash!
      I’m glad you found this helpful! I really like Stepik’s content though in this case I think the use of colour in the explaination makes it easier to follow.
      Cheers. Graeme

      • Yeah! Stepit challenges you which is the essence of learning. Determination is key in learning. Some of the questions took me days to solve and when solved I was better off at next challenges. Even MrGraeme alluded to this indirectly from his posting. Thanks Stepit for producing the likes of MrGraeme. Thanks to MrGraeme for the brilliant and excellent explanation!

  28. This was very helpful indeed. One question : If we were to change sequence of dna strings (as an example move GGCGTTCAGGCA to end from first), our set of solution motif will be completely different – am I correct ?

  29. Hi! Since nobody asked this in the comments, i feel like i may be the only one who is missing the point but would still highly appreciate if somebody could clear the doubt..
    How are we writing the code for the profile? Since till now in the course we are being provided with the profile of probabilities every time, so we haven’t had to write a code for the building the profile on our own..
    I am very new to programming and would appreciate if somebody could help!

  30. Hi! Since nobody asked this in the comments, i feel like i may be the only one who is missing the point but would still highly appreciate if somebody could clear the doubt..
    How are we writing the code for the profile? Since till now in the course we are being provided with the profile of probabilities every time, so we haven’t had to write a code for the building the profile on our own..
    I am very new to programming and would appreciate if somebody could help

Leave a Reply to Anders Cancel reply

Your email address will not be published. Required fields are marked *