Greedy Algorithm Design Decisions

Part of a series of articles on

In the previous article, I discussed the challenges inherent in implementing even the straightforward greedy algorithm on realistically-sized lottery problems. This article describes the design decisions and implementation choices that I made when writing my greedy algorithm based wheel generator in order to overcome the challenges described in the previous article. The implementation was done in C++, so some C++ jargon is found in this article, but it should be intelligible to most programmers.

Expressing Tickets and Matches

As mentioned in the previous article, the most obvious way to express a ticket or match is as an array (or vector, or set) of the numbers of each of the balls that it contains. This requires up to t bytes per ticket, and requires O(t) time to see if two tickets share a match.

For my implementation, I chose to instead express a ticket or match as a r-bit bit array, where bit b is set to 1 if and only if ball b exists in the ticket or match. This drastically cuts down the comparison time from O(t) to O(1), as a single bitwise AND operation will detect all balls in common between two tickets, and from there lookup tables can be used to count the number of bits set in the result.

This does, however change the storage requirement for each match or ticket from t bytes to r bits, which for some lotteries will be an increase. However, below I will describe a method of reducing overall memory requirements that renders this concern irrelevant.

Key Insights

Once we have all tickets and matches generated, the following insights are key to fitting the rest of the wheel generation algorithm into the resource constraints:

Once the relationships between tickets have been completely expressed, the actual numbers that constitute the tickets are of no significance and can be discarded.
It is only necessary to keep track of how many uncovered tickets an unselected ticket could potentially cover if selected. It is unnecessary to know exactly which tickets those are.
Every ticket initially covers the same number of other tickets, and this number can be computed through combinatorics alone.

Insight 1 saves a lot of memory. Even when tickets are expressed as bit arrays, 8 bytes are required to store the exact combination of numbers that a ticket contains. However, if the number of tickets is limited to being fewer than 2³¹, which is true of the 6-from-49 problem, then it would require only 4 bytes to store a sequential integer that would uniquely identify each ticket (but not describe its contents).

Since the tickets are generated in a deterministic order, it is possible to reverse this mapping once the wheel has been selected by simply re-generating all of the tickets in the same order and printing the contents of the tickets corresponding to the selected indexes.

This reduces the problem’s memory footprint by about half. Without this insight, the Set Cover implementation described here would require nearly 4 GB of memory – more than the limitation I imposed. The reason for restricting to 2³¹-1 tickets and not 2³²-1 will be explained later.

Insights 2 and 3 together save time. If we know immediately how many uncovered tickets a given ticket would cover, then it is simple to select the next ticket for the wheel in O(t) time. Because of insight 3, we know that we can initialize this information quickly and independently of the number of tickets. Then, only task remaining is to keep this information up-to-date after each selection. That procedure comprises the bulk of the algorithm.

Expressing Relationships between Tickets

Recall the earlier example of the covering relationship map in a 3-from-6 lottery:

A visualization of 2-match relationships in a 3-from-6 lottery. Click for full size.

The memory footprint of the relationship expression is proportional to the number of lines in the graph above. We can reduce the number of lines but keep the same relationship information by adding nodes that represent 2-matches. The convention will be that ticket nodes will be ovals and match nodes will be rectangles. Lines will then be drawn from each ticket node to each of the match nodes for 2-matches that it contains, like so:

A visualization of 2-match relationships in a 3-from-6 lottery with intermediate rectangular nodes representing each 2-match. Click for full size.

The same covering relationships still exist. In this graph, a ticket covers another ticket if there is a path between them that passes through a single match node. But notice that the number of lines have been greatly reduced. The degree of (number of lines touching) each ticket node has been reduced from 9 to 3, so the number of edges (lines) has decreased from 90 to 30 – a 66% reduction in memory requirement.

We can reduce the memory requirement yet further by applying insight 1 from above – that once the relationship structure is expressed, we need not remember the actual ticket numbers any longer. Therefore, we can replace the labels on the ticket nodes and match nodes by separate series of sequential numbers, as below:

A visualization of 2-match relationships in a 3-from-6 lottery with intermediate rectangular nodes representing each 2-match. This nodes in this graph have been relabeled sequentially and have lost information about which tickets and matches they correspond to. Click for full size.

By doing this, we have effectively compressed our 64-bit bit array representation of node labels to 32-bit integers (for ticket nodes) and 16-bit integers (for match nodes). Recall that the reverse mapping between sequential labels and bit array representations need not be stored, since the nodes and their labels were generated in a deterministic order. We can always re-generate this mapping when we need it, and we won’t need it again except for once at the very end.

But a graph is just a picture. How is this held in memory? My solution was to use a compact struct for each node, both ticket and match types, which stored an adjacency list for the node, as well as any other necessary meta-information for the node. The struct for ticket nodes is:

struct TicketNode {  
  unsigned is_covered         :1;
  unsigned remaining_coverage :31;
  std::vector<uint16_t> match_indexes;
};

Or, logically speaking for the non-C++ programmers among us:

Field	Type	Size	Description
is_covered	boolean	1 bit	True if the ticket described is covered by the current wheel.
remaining_coverage	unsigned integer	31 bits	Gives the number of currently uncovered tickets that would become covered if this ticket were selected for the wheel.
match_indexes	vector of 16-bit unsigned integers	12 + 2 * C(t,m) bytes	List of match indexes that correspond to m-matches contained in the ticket corresponding to this node.

The extra 12 bytes in the match_indexes field is overhead for using a std::vector. This could be eliminated by using an array instead, but this would be inconvenient for a program that aimed to support a range of problem sizes.

The struct for the match nodes is much simpler. It contains no information other than the adjacency list:

struct MatchNode {
  std::vector ticket_indexes;
};

Field	Type	Size	Description
ticket_indexes	vector of 32-bit unsigned integers	12 + 4 * C(r-m,t-m) bytes	List of ticket indexes that correspond to tickets that contain the m-matches corresponding to this node.

Notice that match nodes clearly require more space than ticket nodes, since ticket indexes must be larger, and since match nodes will always have a much higher degree than ticket nodes. Fortunately, there are many, many fewer match nodes than ticket nodes in realistic instances of Lottery Wheel Generation. The larger the match size is, the larger the minimal wheel would necessarily be, and so the less useful the wheel would be.

It is necessary to consider for a moment whether 31 bits and 16 bits is always going to be enough space for ticket and match nodes respectively. Recall that we are limiting ourselves to 64-ball lotteries, at most. In a 64-ball lottery, the largest number of possible tickets occurs when tickets contain 32 balls, and number of possible tickets is C(64,32). This turns out to be around 10¹⁸, and so would require at least 62 bits to sequentially number all tickets. So the answer is no, 31 bits will not always be enough. This is why I mentioned above that we are placing an arbitrary restriction of 2³¹-1 on the number of tickets in the lottery. The reason why the power is 31 and not 32 should now be apparent – I needed one extra bit for a boolean flag. Similarly, since the match size can be anything up to and including the ticket size, 16 bits will also not necessarily be enough for the match indexes, so we need an analogous restriction on the number of matches. Still, 32 and 16 bits are enough for realistically-sized lottery problems.

Storage Requirements

For the case of 3-matches in a 6-from-49 lottery, the struct definitions above lead us to conclude that we require 56 bytes for each ticket node, and 60,732 bytes (about 50 KiB) for each match node.

In such a case, there are C(49,6) = 13,983,816 tickets and C(49,3) = 18,424 matches. Therefore the memory requirement for this problem expression is 1,902,020,064 bytes, or about 1.77 GiB. This fits in under the self-imposed limit limit of 2 GB, even when it exists simultaneously with the roughly 107 MiB of ticket and match enumerations that are used to construct it.

However, this does not leave much room for temporary state data during algorithm execution, which is why we must delete the ticket and match enumerations after we have constructed the relationship graph. In doing that we allow about 234 MiB of free memory for temporary data during the algorithm’s execution, which turns out to be more than enough.

Computing Initial Coverage Potential

The “coverage potential” of a ticket, as defined in the previous section, is the number of currently uncovered drawn tickets which would become covered if we decided to include the ticket in question in our wheel. Due to the symmetry of the ticket relationships, each ticket starts out with an identical coverage potential.

The initial coverage potential of each ticket is simply the number of unique tickets that share a match with a given ticket. We known that for m-matches in a t-from-r lottery, the number of unique m-matches found on a ticket is C(m,t), and each m-match can be found on C(r-m,t-m) tickets, but some tickets will contain more than one of the m-matches found on another ticket. So we have to be careful not to count duplicates.

The formula for initial coverage, excluding duplicates, is:

$\sum_{k = m}^{t} \binom{t}{k} \binom{r-t}{t-k}$

This computes sum of the number of tickets having exactly k balls in common with a given ticket, as k goes from the match size to the ticket size. This particular form comes from a paper by Jans and Degraeve.

Selecting the Next Ticket for the Wheel

Selecting the first ticket is trivial. Due to the symmetry of the problem, it simply does not matter which ticket is chosen first. Any chosen ticket will yield the same structural state. Any choice is equivalent to any other choice except for a permutation in the ticket and match label numbers.

After one ticket has been selected, selecting the next ticket for the wheel is almost trivial given an accurate coverage potential value in each ticket node. The greedy algorithm, by definition, selects the ticket with the highest remaining coverage potential. We can determine which ticket has the highest coverage potential by doing a simple scan of all tickets in O(t) time, which is negligible compared to the time required by other parts of the algorithm.

What happens, though, if more than one ticket has the same, largest coverage potential? I chose, at least initially, to select randomly among these options. I will revisit this decision in a later article.

Updating Coverage Potentials

The simplicity of selecting the next ticket for the wheel was, however, predicated on the assumption that ticket nodes contain accurate, up-to-date values for their coverage potential fields. Initially this is easy to guarantee because all tickets have identical initial coverage potentials. But how do we efficiently keep these values up to date when a new ticket is selected for the wheel?

The approach to this will be to perform a depth-limited depth-first search through the problem graph, starting from the selected ticket to find all other tickets whose coverage potentials would be reduced by this selection. The exact implementation of this mechanism, and a discussion of its efficiency, will be the primary topic of the next article.

« An Introduction to the Lottery Problem

Greedy Algorithm Implementation Walkthrough »

Share this content on: