Nov 28, 2025
10 min
Iāve been programming for a decent amount of time. I started building mini games in middle school, learned fundamentals in high school, and now Iām about to graduate college. Throughout this journey, my beliefs on certain matters like OOP have changed due to personal experience and education. However, one core belief that has stuck with me is that deterministic programs are the ideal. Bugs are reproducible, answers are absolute, and my sanity stays intact. Living in a world of race conditions and side effects where my program might work was unimaginable.
You can then imagine my absolute disgust when I learned about probabilistic data structures. I remember scoffing at the idea of a skip list and thinking that I would never use it in practice. Afterall, it sounded like a linked list with extra steps. Also, I was never incentivized to learn it since Leetcode doesnāt exactly have a section on probabilistic data structures š.
Hopefully that was enough foreshadow for you to guess that Iāve now changed my beliefs. After working at large scales, Iāve found a new deep appreciation for how powerful probability is and Iāll be using bloom filters as an example.
Hereās a one sentence summary of what a bloom filter is designed to do: check if a value is in a set. Specifically, itās designed to filter out values not in the set. Ok I know that was two sentences, but just shut up.
To give a more concrete example, letās say Iām a networking company and I want to prevent a user from going to a blacklisted web URL. Letās say that we have all blacklisted URLs stored in a file. Then every time a user visits a URL we have to check that itās not in the file of blacklisted URLs. You could linearly search through each URL in the blacklist file, but that would be kind of inefficient. Thatās where bloom filters come in handy (kind of).
The way bloom filters work can be illustrated with the following table:
| Return Value | Actuality |
|---|---|
| FALSE | The value is 100% not in the set |
| TRUTH | The value MIGHT be in the set (probability of it being in the set depends on the bloom filter parameters) |
For the discrete math people out there, that means that bloom filters can NEVER have false negatives, but can have false positives. You might be wondering how this is even useful at all since if you get a truthy value, itās not guaranteed if that value exists in the set. And you would be right! In practice bloom filters only serve as a starting point to very efficiently filter out values not in the set, and if a truthy value is found, it goes through a more expensive operation for confirmation. Without bloom filters, you would essentially have to run expensive querying operations for all requests which can be much slower as Iāll explore later on.
Now that we have a high level idea of what bloom filters do, letās clear up how the operations of it work. A bloom filter is just an array of bits with a size of m (all bits initially set to 0) with two operations: add and lookup.
Hereās a basic implementation of a bloom filter using the URL example from earlier. Note that real-life implementations typically use more advanced hashing.
#include <vector>
#include <string>
#include <functional>
class BloomFilter {
private:
std::vector<bool> bitArr;
size_t m;
size_t k;
size_t hashFunc(const std::string& url, size_t i, size_t mod) {
std::hash<std::string> hasher;
size_t h1 = hasher(url);
size_t h2 = std::hash<std::string>{}(url + "salt");
return (h1 + i * h2) % mod;
}
public:
BloomFilter(const std::vector<std::string>& urls, size_t m = 5000, size_t k = 3)
: bitArr(m), m(m), k(k) {
for (size_t i = 0; i < urls.size(); i++) {
for (size_t j = 0; j < k; j++) {
size_t hash_value = hashFunc(urls.at(i), j, m);
bitArr[hash_value] = true;
}
}
}
void add(std::string url) {
for (size_t i = 0; i < k; i++) {
size_t hash_value = hashFunc(url, i, m);
bitArr[hash_value] = 1;
}
}
bool lookup(const std::string& url) {
for (size_t i = 0; i < k; i++) {
size_t hash_value = hashFunc(url, i, m);
if (!bitArr[hash_value]) {
return false;
}
}
return true;
}
};
If youāre still unsure about bloom filters, hereās a great article explaining it with visuals. Also a couple of additional notes on bloom filters:
Bloom filters are highly efficient. If you look at the operations, itās essentially a O(1) time and space complexity. Thus in theory, it should significantly outperform other query methods at scale. One big aspect of bloom filters is the probability of having a false positive. Ideally the lower the better, but oftentimes itās a balancing act that requires consideration into the project constraints.
The accuracy of the bloom filter is dependent on a couple of parameters: number of hash functions k, length of the bit array m, and the number of items that are going to be stored n. You can use this calculator to find the optimal parameters for the accuracy youāre looking for.
Ok, now onto some quick benchmarks that I made. In this make believe scenario, Iāll be generating n number of URLs and measuring the speed of different querying techniques. For the bloom filter specifically, I tuned the probability of a false positive to be 1 in 1,000,000 chance for each value of n. For each test, the times will be the result of performing 500 queries on the n number of URLs. For each query, I just chose a uniform distribution where thereās a 50% probability of it containing a URL in the list of URLs. In practice, the distribution of false negatives is an important consideration in which method to use. Also for fun, Iāll be tracking the number of false positives. ALSO, the below graph is vertically log scaled.
| n value | linear test (ms) | binary test (ms) | hash set test (ms) | bloom filter test (ms) | bloom filter accuracy | bloom filter m value |
|---|---|---|---|---|---|---|
| 100,000 | 2142.781 | 20.683 | 37.136 | 1.128 | 0/500 wrong | 29849749 |
| 1,000,000 | 22111.762 | 223.311 | 398.971 | 3.473 | 0/500 wrong | 298497488 |
| 10,000,000 | 230906.698 | 2334.339 | 4206.678 | 28.278 | 0/500 wrong | 2984974875 |
Note: Adding values to the bloom filter, hash set, and sorting the URLs for binary search is not accounted for. I am timing just the retrieval part.
Note: For the bloom filter, the following constants were true: p (probability of false positive) and k (number of hash functions), the only value that changed is m (size of bit array).
As you can see, the bloom filter is orders of magnitude faster than the rest of the methods, especially as the number of URLs to search for increases.
Another benchmark that I did was looking at the number of queries that each method can handle. For these tests, I set the constant size for n (number of URLs to search through) at 1,000,000. For the bloom filter specifically, I set the same p value (0.000001) with all the other parameters being the same. Similar to the previous test, I used a uniform distribution where about 50% queries contained a URL in the set.
| num_queries | binary_test (ms) | hash_set_test (ms) | bloom_filter_test (ms) | bloom filter accuracy |
|---|---|---|---|---|
| 10,000 | 264.498 | 409.608 | 21.85 | 0 / 10,000 |
| 1,000,000 | 2109.36 | 784.292 | 1060.813 | 2 / 1,000,000 |
| 100,000,000 | 185467.185 | 37705.957 | 102710.91 | 54 / 100,000,000 |
From the above table, it seems like the bloom filter is faster for a smaller number of queries. Interestingly though, as the queries scale, it becomes the second best option with the hash set being faster. The results sort of make sense since usually at large n values, thereās going to be more collisions the hash set needs to handle, but since thereās a constant n, the hashes probably arenāt colliding as much. For the bloom filter, since it needs to go through k hashes, that means that thereās going to be a little more overhead when querying the results which explains the difference. Also, small note on the accuracy: with the probability of a false positive being set, the results seem pretty normal.
I updated the bloom filter to use a more efficient hash function called murmur3 and retested the query numbers to see if it made a meaningful difference. I used Peter Scottās murmur3 implementation in C and wrapped it into a separate has function call:
std::vector<size_t> hashFunc2(const std::string &url) {
const void *data = url.data();
int size = static_cast<int>(url.size());
uint64_t hashes[2];
MurmurHash3_x64_128(data, size, 40, hashes);
const uint64_t h1 = hashes[0];
const uint64_t h2 = hashes[1];
std::vector<uint64_t> indices;
indices.reserve(k);
for (size_t i = 0; i < k; ++i) {
size_t nextHash = h1 + (uint64_t)i * h2;
indices.push_back(nextHash % m);
}
return indices;
}
| num_queries | binary_test (ms) | hash_set_test (ms) | bloom_filter_test (ms) | bloom filter accuracy |
|---|---|---|---|---|
| 10,000 | 267.557 | 418.424 | 12.378 | 0 / 10,000 |
| 1,000,000 | 2103.351 | 797.02 | 540.278 | 1 / 1,000,000 |
| 100,000,000 | 182500.903 | 37481.711 | 52915.78 | 47 / 100,000,000 |
Based on the table above with the new hash function, the bloom filter definitely improved significantly with a 200% speed improvement, making it faster than the hash set at larger query sizes. However, as the queries continue to scale, it faces the same problem of having to conduct multiple hashes which slows it down.
Ultimately though, bloom filters are probably still a more practical choice than hash sets in large systems since thereās a lower memory footprint and it balances large sets with efficient runtime.
Thanks for reading. Cya š.