Warren Wu

I’ve been programming for a decent amount of time. I started building mini games in middle school, learned fundamentals in high school, and now I’m about to graduate college. Throughout this journey, my beliefs on certain matters like OOP have changed due to personal experience and education. However, one core belief that has stuck with me is that deterministic programs are the ideal. Bugs are reproducible, answers are absolute, and my sanity stays intact. Living in a world of race conditions and side effects where my program might work was unimaginable.

You can then imagine my absolute disgust when I learned about probabilistic data structures. I remember scoffing at the idea of a skip list and thinking that I would never use it in practice. Afterall, it sounded like a linked list with extra steps. Also, I was never incentivized to learn it since Leetcode doesn’t exactly have a section on probabilistic data structures 😛.

Hopefully that was enough foreshadow for you to guess that I’ve now changed my beliefs. After working at large scales, I’ve found a new deep appreciation for how powerful probability is and I’ll be using bloom filters as an example.

Bloom Filters Explained

Here’s a one sentence summary of what a bloom filter is designed to do: check if a value is in a set. Specifically, it’s designed to filter out values not in the set. Ok I know that was two sentences, but just shut up.

To give a more concrete example, let’s say I’m a networking company and I want to prevent a user from going to a blacklisted web URL. Let’s say that we have all blacklisted URLs stored in a file. Then every time a user visits a URL we have to check that it’s not in the file of blacklisted URLs. You could linearly search through each URL in the blacklist file, but that would be kind of inefficient. That’s where bloom filters come in handy (kind of).

The way bloom filters work can be illustrated with the following table:

Return Value	Actuality
FALSE	The value is 100% not in the set
TRUTH	The value MIGHT be in the set (probability of it being in the set depends on the bloom filter parameters)

For the discrete math people out there, that means that bloom filters can NEVER have false negatives, but can have false positives. You might be wondering how this is even useful at all since if you get a truthy value, it’s not guaranteed if that value exists in the set. And you would be right! In practice bloom filters only serve as a starting point to very efficiently filter out values not in the set, and if a truthy value is found, it goes through a more expensive operation for confirmation. Without bloom filters, you would essentially have to run expensive querying operations for all requests which can be much slower as I’ll explore later on.

Now that we have a high level idea of what bloom filters do, let’s clear up how the operations of it work. A bloom filter is just an array of bits with a size of m (all bits initially set to 0) with two operations: add and lookup.

Add(value): Run the value through k number of hash functions, then mod it by m, set the resulting indexes from the hashes in the bit array to 1.
Lookup(value): Run the value through k number of hash functions, then mod it by m, check that each resulting index from the hashes is a 1. If one index does not result in 1, then return false, otherwise return true.

Here’s a basic implementation of a bloom filter using the URL example from earlier. Note that real-life implementations typically use more advanced hashing.

#include <vector>
#include <string>
#include <functional>

class BloomFilter {
private:
  std::vector<bool> bitArr;
  size_t m;
  size_t k;

  size_t hashFunc(const std::string& url, size_t i, size_t mod) {
    std::hash<std::string> hasher;
    size_t h1 = hasher(url);
    size_t h2 = std::hash<std::string>{}(url + "salt");
    return (h1 + i * h2) % mod;
  }

public:
  BloomFilter(const std::vector<std::string>& urls, size_t m = 5000, size_t k = 3)
      : bitArr(m), m(m), k(k) {
    for (size_t i = 0; i < urls.size(); i++) {
      for (size_t j = 0; j < k; j++) {
        size_t hash_value = hashFunc(urls.at(i), j, m);
        bitArr[hash_value] = true;
      }
    }
  }

  void add(std::string url) {
    for (size_t i = 0; i < k; i++) {
      size_t hash_value = hashFunc(url, i, m);
      bitArr[hash_value] = 1;
    }
  }

  bool lookup(const std::string& url) {
    for (size_t i = 0; i < k; i++) {
      size_t hash_value = hashFunc(url, i, m);
      if (!bitArr[hash_value]) {
        return false;
      }
    }
    return true;
  }
};

If you’re still unsure about bloom filters, here’s a great article explaining it with visuals. Also a couple of additional notes on bloom filters:

It doesn’t have a delete operation although variants like counting bloom filters and cuckoo filters support deletion if required.
It doesn’t “store” a copy of the original data which might be useful for security.
Operations can be easily parallelized adding even better performance.
The bit array in a bloom filter might be stored on disk if it’s too large for memory, so reading from the array might be slower.

Why Is It Practical?

Bloom filters are highly efficient. If you look at the operations, it’s essentially a O(1) time and space complexity. Thus in theory, it should significantly outperform other query methods at scale. One big aspect of bloom filters is the probability of having a false positive. Ideally the lower the better, but oftentimes it’s a balancing act that requires consideration into the project constraints.

The accuracy of the bloom filter is dependent on a couple of parameters: number of hash functions k, length of the bit array m, and the number of items that are going to be stored n. You can use this calculator to find the optimal parameters for the accuracy you’re looking for.

Ok, now onto some quick benchmarks that I made. In this make believe scenario, I’ll be generating n number of URLs and measuring the speed of different querying techniques. For the bloom filter specifically, I tuned the probability of a false positive to be 1 in 1,000,000 chance for each value of n. For each test, the times will be the result of performing 500 queries on the n number of URLs. For each query, I just chose a uniform distribution where there’s a 50% probability of it containing a URL in the list of URLs. In practice, the distribution of false negatives is an important consideration in which method to use. Also for fun, I’ll be tracking the number of false positives. ALSO, the below graph is vertically log scaled.

n value	linear test (ms)	binary test (ms)	hash set test (ms)	bloom filter test (ms)	bloom filter accuracy	bloom filter m value
100,000	2142.781	20.683	37.136	1.128	0/500 wrong	29849749
1,000,000	22111.762	223.311	398.971	3.473	0/500 wrong	298497488
10,000,000	230906.698	2334.339	4206.678	28.278	0/500 wrong	2984974875

Query Speeds

Note: Adding values to the bloom filter, hash set, and sorting the URLs for binary search is not accounted for. I am timing just the retrieval part.
Note: For the bloom filter, the following constants were true: p (probability of false positive) and k (number of hash functions), the only value that changed is m (size of bit array).

As you can see, the bloom filter is orders of magnitude faster than the rest of the methods, especially as the number of URLs to search for increases.

Another benchmark that I did was looking at the number of queries that each method can handle. For these tests, I set the constant size for n (number of URLs to search through) at 1,000,000. For the bloom filter specifically, I set the same p value (0.000001) with all the other parameters being the same. Similar to the previous test, I used a uniform distribution where about 50% queries contained a URL in the set.

num_queries	binary_test (ms)	hash_set_test (ms)	bloom_filter_test (ms)	bloom filter accuracy
10,000	264.498	409.608	21.85	0 / 10,000
1,000,000	2109.36	784.292	1060.813	2 / 1,000,000
100,000,000	185467.185	37705.957	102710.91	54 / 100,000,000

From the above table, it seems like the bloom filter is faster for a smaller number of queries. Interestingly though, as the queries scale, it becomes the second best option with the hash set being faster. The results sort of make sense since usually at large n values, there’s going to be more collisions the hash set needs to handle, but since there’s a constant n, the hashes probably aren’t colliding as much. For the bloom filter, since it needs to go through k hashes, that means that there’s going to be a little more overhead when querying the results which explains the difference. Also, small note on the accuracy: with the probability of a false positive being set, the results seem pretty normal.

I updated the bloom filter to use a more efficient hash function called murmur3 and retested the query numbers to see if it made a meaningful difference. I used Peter Scott’s murmur3 implementation in C and wrapped it into a separate has function call:

 std::vector<size_t> hashFunc2(const std::string &url) {
    const void *data = url.data();
    int size = static_cast<int>(url.size());
    uint64_t hashes[2];
    MurmurHash3_x64_128(data, size, 40, hashes);
    const uint64_t h1 = hashes[0];
    const uint64_t h2 = hashes[1];
    std::vector<uint64_t> indices;
    indices.reserve(k);
    for (size_t i = 0; i < k; ++i) {
      size_t nextHash = h1 + (uint64_t)i * h2;
      indices.push_back(nextHash % m);
    }
    return indices;
}

num_queries	binary_test (ms)	hash_set_test (ms)	bloom_filter_test (ms)	bloom filter accuracy
10,000	267.557	418.424	12.378	0 / 10,000
1,000,000	2103.351	797.02	540.278	1 / 1,000,000
100,000,000	182500.903	37481.711	52915.78	47 / 100,000,000

Based on the table above with the new hash function, the bloom filter definitely improved significantly with a 200% speed improvement, making it faster than the hash set at larger query sizes. However, as the queries continue to scale, it faces the same problem of having to conduct multiple hashes which slows it down.

Ultimately though, bloom filters are probably still a more practical choice than hash sets in large systems since there’s a lower memory footprint and it balances large sets with efficient runtime.

Thanks for reading. Cya 👋.

Bloom Filters And Practicality Of Probability

Bloom Filters Explained

Why Is It Practical?