It’s easy enough to choose an item at random from a collection, if they all have the same probability of being chosen. When each of them has its own probability though, things get a little more complicated.
For example, if you wanted to choose an item from the following list, with the odds given, what’s the best way to approach it?
Item | Odds |
---|---|
A | 5% |
B | 5% |
C | 10% |
D | 15% |
E | 30% |
F | 12% |
G | 23% |
This might seem a strange thing to want to do, but looking into Markov Chains (like these: Simulating The Future – Part 1: A Brief Introduction to Markov Chains - what a coincidence!) and other simulations, it’s actually quite a handy thing to be able to do. There are several ways to do it, but this method seems to be fairly straightforward.
First calculate the cumulative odds for the items in the collection:
Item | Odds | Cumulative |
---|---|---|
A | 5% | 0.05 |
B | 5% | 0.10 |
C | 10% | 0.20 |
D | 15% | 0.35 |
E | 30% | 0.65 |
F | 12% | 0.72 |
G | 23% | 1.00 |
Generate a random number between 0 and the total of all of the odds, search the collection to find the first item where the cumulative odds are greater than the random number.
Examples
- If the random number is 0.351, the first cumulative value greater than this is 0.65, which gives us item E.
- If the random number is 0.0498, the first cumulative value greater than this is 0.05, which gives us item A.
- If the random number is 0.733, the first cumulative value greater than this is 1.00, which gives us item G.
We can work this out using a simple loop to iterate through the collection, and then break out of the loop when we find the number we’re looking for. However that’s not a very efficient way to do this, or anything else for that matter, as it’s an O(n) operation. There is a better way to find the position we want within the collection, and that’s to use the bisect.bisect_left
method in Python, and BinarySearch
methods in C#. Both of these are O(log n) operations, so they’re almost certainly the best way to go – using them isn’t going to cost you anything so there’s no disadvantage.
Notes:
There’s no need to sort anything to make it work with the bisect.bisect_left
or BinarySearch
– because they’re working on the cumulative odds, and those are already in ascending order.
If you’re working with occurrences rather than explicit probabilities, that’s not a problem. Just use the cumulative number of occurrences, and then generate a random number between 0 and the total number of occurrences. For example:
Item | Occurrences | Cumulative |
---|---|---|
A | 6 | 6 |
B | 56 | 62 |
C | 17 | 79 |
D | 6 | 85 |
E | 23 | 108 |
F | 46 | 154 |
G | 6 | 161 |
In this case, the random number would be in the range 0 to 161. The rest of the method is the same, using bisect.bisect_left
or BinarySearch
to calculate the selected item from the collection.
Code Examples
As a demonstration of this method here are two examples of its implementation.
Python
from bisect import bisect_left
from itertools import accumulate
from random import random
# Our original set of items, and the odds of each of them occurring
items_with_odds = {'a': 0.05, 'b': 0.05, 'c': 0.10, 'd': 0.15, 'e': 0.30, 'f': 0.12, 'g': 0.23}
# Calculate the cumulative odds of the items in the items_with_odds collection
cumulative_odds = list(accumulate([odds for odds in list(items_with_odds.values())]))
# The upper limit of the random number to choose an individual item
total_odds = cumulative_odds[-1]
# Initialize a dictionary where we can store the results. Quicker and faster to
# create it all now, rather than check each key's existence every time we update it.
results = dict.fromkeys(items_with_odds.keys(), 0)
# Carry out a statistically valid number of tests
for count in range(0, 1000000):
# Choose a value in [0, total_odds)
value = random() * total_odds
# Find the index of the smallest cumulative odds value >= the chosen value
i = bisect_left(cumulative_odds, value)
# Find the key at the selected index
selected_key = list(items_with_odds.keys())[i]
# Update the number of selections made for the item
results[selected_key] = results[selected_key] + 1
# Display the items, and the number of times each were selected
for key in sorted(results.keys()):
print('{}: {}'.format(key, results[key]))
C# In .Net Core
Just in case you want a fast implementation...
using System;
using System.Collections.Generic;
using System.Linq;
namespace ChoosingRandomItemsByOddsCS {
public static class Extensions {
public static IEnumerable<double> CumulativeSum(this IEnumerable<double> sequence) {
double total = 0.0;
foreach (var item in sequence) {
total += item;
yield return total;
}
}
}
class Program {
static void Main(string[] args) {
// Our original set of items, and the odds of each of them occurring
IReadOnlyDictionary<string, double> itemsWithOdds = new Dictionary<string, double> { { "a", 0.05 },
{ "b", 0.05 },
{ "c", 0.10 },
{ "d", 0.15 },
{ "e", 0.30 },
{ "f", 0.12 },
{ "g", 0.23 },
};
// Calculate the cumulative odds of the items in the itemsWithOdds collection
List<double> cumulativeOdds = new List<double>(itemsWithOdds.Values.CumulativeSum());
// The upper limit of the random number to choose an individual item
var totalOdds = cumulativeOdds.Last();
// Initialize a dictionary where we can store the results. Quicker and faster to
// create it all now, rather than check each key's existence every time we update it.
IDictionary<string, int> results = itemsWithOdds.Keys.ToDictionary(k => k, k => 0);
Random random = new Random();
// Carry out a statistically valid number of tests
foreach (int i in Enumerable.Range(0, 1000000)) {
// Choose a value in [0, total_odds)
var value = random.NextDouble() * totalOdds;
// Find the index of the smallest cumulative odds value >= the chosen value
var index = cumulativeOdds.BinarySearch(value);
index = index < 0 ? ~index : index;
// Find the key at the selected index
var selectedKey = itemsWithOdds.Keys.ToList() [index];
// Update the number of selections made for the item
results[selectedKey] += 1;
}
Console.WriteLine(string.Join(", ", results));
}
}
}
This All Looks a Little Familiar Somehow...
Yes, it does. That's because this functionality is implemented in the Python method:
random.choices()
from Python 3.6 onwards. Alternatively the same functionality is available in NumPy using the
numpy.random.choice()
method.