Bloom filters: real-world applications

Reader9@programming.dev · 1 year ago

Bloom filters: real-world applications

the_sisko@startrek.website · 1 year ago

A classic use for them is spam filtering.

Suppose you have a set of spam detection systems/rules which are somewhat expensive to execute, eg a ML model or keyword blocklist. Spam tends to come in waves, and frequently it can be as simple as reposting the same message dozens of times.

Once your systems determine a piece of content is spam (or you manually flag content), it’s a good idea to insert the content into a bloom filter. This means that future posts of the identical content will be flagged without needing to execute the expensive checks, especially if there’s a surge of content stressing your systems.

Since it’s probabilistic, you can’t use this unless you have some sort of manual reviewing queue or system, as it’s possible for false positives to be flagged. However, you can also run more intensive checks once you’ve flagged content, to detect false positives.

The false positives can also be a feature, not a bug: with careful choice of hash functions, your bloom filter can actually detect slightly modified content, since most of the hashes may still be the same.

I’ve worked at companies which use this strategy so it’s very real world.

noli@programming.dev · 1 year ago

Cool, so in this case your filter is basically a classifier ML model. How would you set the hash functions then though?

the_sisko@startrek.website · 1 year ago

Usually it’s a bunch of different string hashes of the text content. They could be different hashing algorithms, but it’s more common to take a single hash algorithm and simply create a bunch of hash functions that operate on different parts of the data.

If it’s not text data, there’s a whole bunch of other hashing strategies but I only ever saw bloom filters used with text.