Greedy Algorithm Implementation Walkthrough

Part of a series of articles on

In the previous article, I discussed the design decisions that were made in the greedy lottery wheel generator in order to get around the computational problems inherent in even a simple greedy generator for realistically-sized lotteries. In this article, I will walk through the mechanics of generating all tickets and matches, selecting a ticket for the wheel, and updating ticket meta-data in order to allow the next selection. This is in effect a textual walk-through of the generator code that I wrote, which I will post soon.

Generating All Tickets

Recall that to save space, we are storing each ticket as a 64-bit bitset. The simplistic way to generate all tickets in a t-from-r lottery is to run through combinations of t numbers from the set 1 through r, and set the bits in a bitset to correspond to the numbers in each combination. To be honest, I’m not sure how fast or slow this is, because I used a somewhat more complicated (and probably at least a little faster) method, due to trying to optimize this step to improve a previous attempt at designing a wheel generator.

First, I precompute an array that maps every number 0 through 64 to the list of 16-bit numbers with exactly that many bits set. Then, to generate all tickets for a t-from-r lottery, I iterate over all tuples of four numbers between 1 and 64 that add up exatly t. For each element of the tuple, I take the list of 16-bit numbers with exactly that many bits set. I then bitwise or together every combination of 16-bit numbers, one from each list, such that the number from the tuple’s first element’s list is the high-order 16 bits of the 64 bit bitfield, the element from the tuple’s second element’s list is the next-highest-order 16 bits, etc.

In pseudocode, it looks like this

 for each tuple <a> do
  if ( a + b + c + d = t ) then
    list_a = 16 bit numbers with exactly a bits set
    list_b = 16 bit numbers with exactly b bits set
    list_c = 16 bit numbers with exactly c bits set
    list_d = 16 bit numbers with exactly d bits set
    
    for each tuple  do

      ticket = (n_a << 48) | (n_b << 32) | (n_c << 16) | n_d
    done    
  endif
done

Again, I’m not sure how much of an improvement this really is. Any improvement would be due to reducing the number of or operations for each ticket from t to a constant four, and from reducing the levels of iteration involved from t levels of iterations from 1 to r to a constant four outer levels of iteration from 1 to 64 and four inner levels of iteration from 1 to at most 16.

Generating the Relationship Graph

Next, we must create the implicit relationship graph associating tickets and matches by determining which matches are in which tickets and recording that information in both ticket nodes and match nodes. The simplest way to do this is to iterate over all matches, since there are many fewer matches than there are tickets.

Since matches have exactly the same form as tickets, except for being composed of fewer balls, we can generate all matches using exactly the same algorithm described above for generating all tickets. The question is then: given a match, how do we know which tickets it is in?

Well, if a ticket contains t balls, and a match contains m balls, then any match plus any combination of t-m non-overlapping balls makes a ticket. Let’s call a combination of t-m balls a “filling combination”, since it “fills in” a match to create a ticket. Given a list of all matches and all filling combinations, we can try combining a given match with every filling combination, and if the match and filling combination do not overlap, then we have found a ticket that contains that match.

Happily, we can also generate all filling combinations using the ticket generation algorithm (again, they are just like smaller tickets). And it is quick to determine whether a match and a filling combination overlap. This is an advantage of storing the matches, tickets, and filling combinations as bitfields. All we must do is bitwise and together the match and the filling combination; they overlap if and only if the result is nonzero.

So, assuming we have computed all matches and all filling combinations using the ticket generation algorithm described above, we can generate the graph edges like this (in pseudocode):

for each match m do
  for each filling combination f do
    if ( m & f = 0 ) then
      matches[match_idx(m)].ticket_nodes.append(ticket_idx(m | f))
      tickets[ticket_idx(m | f)].match_nodes.append(match_idx(m))
    endif
  done
done

Note the idx calls. Remember that here the variables m and f refer to bitfields describing exactly which balls are involved. The idx functions map from the precise ticket and match descriptions to the index in the ticket node and match node arrays that correspond to that particular ticket or match. Once these relationships have been stored, we can forget about the index mappings, and thereby free up a considerable amount of memory, without consequence.

Selecting the Next Ticket for the Wheel

The very first thing we must do is select a ticket to start with. It actually does not matter at all which ticket we start with, since the problem is completely symmetric. As was discussed in the previous article, every ticket node starts with the same coverage potential. Furthermore, since the relationships between tickets and matches is symmetric, and all matches are initially meaningful, the ticket chosen truly does not make a difference to the ultimate size of the generated wheel. It will only make a difference insofar as determining exactly which particular numbers will appear on the wheel. Therefore, it’s perfectly fine to skip any and all computation here and just pick the first ticket in the ticket nodes array, which will probably be something like 1 2 3 4 5 6.

Selecting the next ticket for the wheel is trivial given an accurate value for coverage potential in each ticket node. Since this is a greedy algorithm, we want to select the ticket that has the largest remaining coverage potential. Therefore, all we really need to do is perform a linear scan over all ticket nodes, and find the ticket node with the largest coverage potential. This is an O(n) operation, which is quite fast compare to all of the other computation going on in this generator, so there is little need to spend time optimizing this further.

The only issue is what to do when there are multiple tickets with the same remaining coverage potential. It is not clear that the choice does not matter once we are no longer selecting the very first ticket. In particular, while any ticket in such a set would cover the same number of other tickets, they would not necessarily cover the same number of other matches. Similarly, some uncovered tickets may have more potential covering tickets than others at this point in the generation, and this could make some tickets in the maximum-coverage set more useful than others.

The question of how to select an optimal ticket from among a set of tickets with the same, maximum remaining coverage potential is itself interesting, and will be addressed in a later article. For now, we will address this issue by simply selecting at random from the set.

As far as storing the selected tickets, the simplest possible thing can be done: the index of the selected ticket node can be added to a set or vector of selected tickets. The index will be turned back into a series of ball numbers at the end.

Updating Coverage Potentials

And now we come to the meat of the computation algorithm. The above description of how the ticket selection code works was predicated on the (rather significant) assumption that the coverage potential meta-data stored in the ticket nodes was accurate and up-to-date. While all of the ticket nodes start with the same coverage potential, after a ticket is selected, this is no longer true. Unfortunately, there is no easy, implicit way to know how the coverage potential of each ticket changes when another ticket is select. We need to compute it.

In some way, we need to take the selected ticket, determine which tickets cover the same tickets that it does (and how many times over), and update the coverage potentials to reflect that. In other words, if ticket A covers currently-uncovered tickets B, C, and E, and ticket F covers B and ticket G covers C and E, then when we select A, we must find F, determine that it shares one coverage with A and reduce its coverage potential by one. We must also find G and determine that it shares two coverages with A and reduce its coverage potential by two.

The naiive way to approach this is, of course, to iterate over all tickets, determine how many matches they share with the selected ticket, and then determine how many currently-uncovered tickets correspond to each of those matches (again, by iteration). This is a great way to solve a realistically-sized problem if you have a supercomputer and decades of free time, but not really feasible for the constraints I imposed on myself here.

Instead, we are going to exploit the structural graph that I referred to earlier when discussing how the storage requirements could be reduced by forgetting the actual ball numbers involved in each ticket. If you recall, the graph for the toy example I’ve been using looks like this:

A visualization of 2-match relationships in a 3-from-6 lottery with intermediate rectangular nodes representing each 2-match. This nodes in this graph have been relabeled sequentially and have lost information about which tickets and matches they correspond to. Click for full size.

We are going to perform on this graph, the edges of which are stored in adjacency lists within the ticket and match nodes, a depth-limited depth-first traversal. Let’s show what this gives us.

First, consider limiting the depth to two. The search tree that is generated by starting at a selected ticket (ticket 0 for example) and proceeding down two levels from the root looks like this

A two-level relationship tree starting from selected ticket 0 in a 3-from-6 lottery with 2-matches. Click for full size.

The top level, the root, is the ticket we selected for the wheel. The first level down is the square match nodes which represent the matches contained in this ticket. The second level down is the oval ticket nodes which represent the tickets which also contain these matches. In other words, the nodes on the second level are for the tickets that are covered by the selected ticket at the root. This is a much, much quicker way to find the covered tickets than iterating over all tickets and making comparisons.

Note that we are excluding the selected ticket itself from appearing in the second level of the tree. There’s really no reason to do that except to make the diagrams less cluttered. In reality, since the underlying graph is not directed, the node for ticket 0 would actually appear on the second level as well, and you would have to ignore that loop in your code.

Now consider extending this search graph for two more levels.

A four-level relationship tree starting from selected ticket 0 in a 3-from-6 lottery with 2-matches. Click for full size.

(Again, I’ve taken liberties to reduce the size of the graph by not repeating match nodes in the third level.)

Now what we have on the third level is the matches contained in the tickets covered by the selected ticket. And what we have in the fourth level is the tickets covered by the covered tickets. But, since coverage is symmetric, the foruth level of the graph can also be describe as tickets that cover the same tickets that the selected ticket does. These fourth-level tickets are the tickets whose coverage potentials are reduced as a result of this selection.

If it helps you see things more clearly, here is the same search tree but with the ball numbers in the ticket nodes and match nodes shown explicitly:

A four-level relationship tree starting from selected ticket 0 in a 3-from-6 lottery with 2-matches. Here, ticket and match indexes have been replaced with full descriptions of the tickets and matches in question. Click for full size.

For example (following the right-most lines), selected ticket 1 2 3 contains match 2 3 which covers ticket 2 3 6. Ticket 2 3 6 contains match 3 6, so it would also have been covered by ticket 3 5 6. Therefore ticket 3 5 6 has had its coverage potential reduced by the selection of ticket 1 2 3.

There are still two problems remaining with this approach, both involving “over-charging” (reducing the coverage potential of a ticket by too much). First, we need to ignore second-level ticket nodes corresponding to tickets that are already covered. Obviously, if ticket 1 2 3 is selected but ticket 2 3 6 is already covered, we don’t want to reduce the coverage potential of ticket 3 5 6 because of that. This can be accomplished easily by simply checking to see if a level two node’s ticket index is in the set of covered ticket indices before continuing to traverse the tree. Since the set of covered ticket indices is going to be very small compared to the size of almost everything else, this check is in practice almost negligible, but in theory is at worst O(n) (but only for a very, very bad wheel).

Secondly, we want to avoid reducing the coverage potential of a ticket more than once due to the same overlapping ticket between it and the selected ticket. This is in fact not possible in the toy example I’ve been using, but consider a 4-from-6 lottery with 2-matches. Then if we select ticket 1 2 3 4, we should reduce the coverage potential of ticket 3 4 5 6 because they share coverage of ticket 1 3 4 5. But ticket 3 4 5 6 covers 1 3 4 5 through three different 2-matches: 3 4, 4 5, and 3 5. We need to make sure we only reduce the coverage potential of 3 4 5 6 due to this single shared coverage.

The way to avoid this is to simply not allow duplicates in the second level. Once we have visited the children of a covered ticket node in level two, we know we have reduce the coverage potential of every ticket sharing coverage of this one with the selected ticket exactly once. We do not want to do that again. Therefore, we must keep track of which ticket indices have been used as level-two nodes on a particular traversal of the graph, and not traverse their children again. The maxiumum size of level two of the search graph is in fact known: it is exactly the initial coverage potential of every ticket node. This is, comparatively speaking, a small number. Therefore, we can use a simple, memory-efficient bitset to remember which ticket indices we have seen so far in level two and avoid repeating their traversal.

A helpful effect of traversing the graph is the manner described above is that the number of times that you encounter a ticket in the fourth level is exactly equal to the size of the coverage reduction that needs to be applied to that ticket. For example, when ticket 1 2 3 is selected, ticket 3 5 6 needs to have its coverage potential reduced by four since it shares coverage of four different tickets with the selected ticket 1 2 3. If you review the search graph above, you will note that there are precisely four unique paths between the root at ticket 3 5 6 on level four.

In summary, what we have to do after selecting a ticket for the wheel in order to keep coverage potentials up to date is to perform a depth-first traversal of the relationship graph starting from the selected ticket. We limit the depth to four levels beyond the root, and we ensure that no node corresponding to a covered covered ticket appears at the second level. Further, we make sure that no ticket node appears in the second level more than once. Finally, we reduce the coverage potential of a node each time we encounter it in the fourth level, and mark a ticket as covered each time we encouter its node in the second level.

As pseudo-code, this might look like:

while ( maximum coverage potential > 0 ) do

  let t_max = ticket with max coverage potential

  t_max.covered = true

  add t_max to wheel

  dfs_stack.push(t_max)

  while ( dfs_stack not empty ) do

    let n = dfs_stack.pop()

    // Remember, at levels 1 and 3 n is a match node,
    // and at levels 2 and 4 n is a ticket node

    if ( dfs_level = 1 or dfs_level = 3 ) then

      for each ticket t in n.ticket_nodes do
        dfs_stack.push(t)
      done

    endif

    if ( dfs_level = 2 and n.covered = false and
         n has not been visited at level 2 ) then
        
      for each match m in n.match_nodes do
        dfs_stack.push(m)
      done

      n.covered = true
 
    endif

    if ( dfs_level = 4 ) then
      n.remaining_coverage -= 1
    endif

  done
done

This intentionally leaves out implementation details such as how we deal with having both ticket nodes and match nodes in the dfs_stack container at the same time, and how we keep track of dfs_level, et cetera. But this, as with most pseudocode, is in the interest of making the important part of the algorithm as clear as possible. Actual code for this will be linked in a later article.

Finishing

Once we have assembled the wheel by ticket index, there remains only one thing to do, and that is to conver the ticket indices in the wheel back into tickets described by a series of ball numbers, so that we know which tickets are actually in the wheel after all. As I laid out in the design article, the key that makes the ticket index scheme reversible is that tickets are generated in a deterministic order.

Therefore, if our wheel consists of tickets 2, 12, 105, and 4096, we simply run the same (relatively fast) procedure that we used to generate the tickets in the first place, and we print out the second, twelfth, 105th and 4096th tickets that we generate. Since the generation procedure is deterministic, the 105th ticket generated at the end must be identical to the 105th ticket generated at the beginning, and we need not keep an explicit correspondence table in memory.

In the next article I will address the asymptotic complexity of this algorithm, and describe the results of running an implementation of it.

« Greedy Algorithm Design Decisions

Implementation Challenges »

Share this content on:

5 comments

John says:

22 June 2009 at 07:31

very good your ideas.
Would you send this source-code?

thanks

Neil says:

2 July 2009 at 06:39

Interesting to see someone else’s take on this problem. I’ve been messing around with it for some time with little success.

Sergio says:

10 July 2009 at 05:16

I implemented this greedy. But my algorithm have a problem: The lottery (20, 6, 5) is very very fast, but, (20, 5, 3) is very slooooooww. Maybe it is because coeverage potential of tickets in (20, 5, 3).

Sergio says:

10 July 2009 at 05:21

In last picture (A four-level relationship tree …), i add the ticket “1 2 3” in level 2, because, if you don’t add it, the coverage potential of all tickets never are equal 0.

Tyler McHenry says:

10 July 2009 at 21:27

Good catch. My implementation (which I hope to post in this coming week) sets coverage potential of a selected ticket to 0 when selecting it. That’s a pretty obvious thing to do, but I didn’t state it explicitly in this article.

As to why (20, 6, 5) is faster than (20, 5, 3) for you I can’t say… larger numbers in every variable should lead to a longer running time, since nothing is inversely proportional. It may be an artifact of your implementation. My implementation takes 18 seconds to run on (20, 6, 5) and 10 seconds to run on (20, 5, 3).