What are your real-world applications of this versatile data structure?
They are useful for optimization in databases like sqlite and query engines like apache spark. Application developers can use them as concise representations of user data for filtering previously seen items.
The linked site gives a short introduction to bloom filters along with some links to further reading:
A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.
A classic use for them is spam filtering.
Suppose you have a set of spam detection systems/rules which are somewhat expensive to execute, eg a ML model or keyword blocklist. Spam tends to come in waves, and frequently it can be as simple as reposting the same message dozens of times.
Once your systems determine a piece of content is spam (or you manually flag content), it’s a good idea to insert the content into a bloom filter. This means that future posts of the identical content will be flagged without needing to execute the expensive checks, especially if there’s a surge of content stressing your systems.
Since it’s probabilistic, you can’t use this unless you have some sort of manual reviewing queue or system, as it’s possible for false positives to be flagged. However, you can also run more intensive checks once you’ve flagged content, to detect false positives.
The false positives can also be a feature, not a bug: with careful choice of hash functions, your bloom filter can actually detect slightly modified content, since most of the hashes may still be the same.
I’ve worked at companies which use this strategy so it’s very real world.
Cool, so in this case your filter is basically a classifier ML model. How would you set the hash functions then though?
Usually it’s a bunch of different string hashes of the text content. They could be different hashing algorithms, but it’s more common to take a single hash algorithm and simply create a bunch of hash functions that operate on different parts of the data.
If it’s not text data, there’s a whole bunch of other hashing strategies but I only ever saw bloom filters used with text.