Bloom filter


Problem. If we have a large set of structured data (identified by record IDs) stored in a set of data files, what is the most efficient way to know which file might contain our required data? We don’t want to read each file, as that would be slow, and we have to read a lot of data from the disk. One solution can be to build an index on each data file and store it in a separate index file. This index can map each record ID to its offset in the data file. Each index file will be sorted on the record ID. Now, if we want to search an ID in this index, the best we can do is a Binary Search. Can we do better than that?

Solution. The Bloom filter data structure tells whether an element may be in a set, or definitely is not. The only possible errors are false positives, i.e., a search for a nonexistent element might give an incorrect answer. With more elements in the filter, the error rate increases. An empty Bloom filter is a bit-array of m bits, all set to 0. There are also k different hash functions, each of which maps a set element to one of the m bit positions.

The hash functions used in a Bloom filter should be independent and uniformly distributed. They should also be as fast as possible (cryptographic hashes such as sha1, though widely used therefore are not very good choices). Examples of fast, simple hashes that are independent enough include murmur, the fnv series of hashes, and HashMix.

Applications. Bloom filters find application in network routers, in web browsers(to detect the malicious urls), in password checkers(to not a set a weak or guessable or list of forbidden passwords), etc.
Google Bigtable, Apache HBase and Apache Cassandra and PostgreSQL use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.

Bloom filter vs Hash Table. Bloom filter is more memory efficient and more faster but does not accurate (possible false alarms) and does not support remove operation.

from math import log


class BloomFilter:
    """Bloom filter"""

    def __init__(self, items_count, fp_prob):
        assert 0 <= fp_prob <= 1, "Probability should be in range [0,1]"
        assert items_count > 0, "Items count must be positive"

        self._size = self._evaluate_size(items_count, fp_prob)
        self._hash_count = self._evaluate_hash_count(self._size, items_count)
        self._bit_array = [0] * self._size

    def add(self, item):
        """Add an element to bloom filter"""
        for i in range(self._hash_count):
            digest = self._evaluate_hash(item, i)
            self._bit_array[digest] = True

    def __contains__(self, item):
        """ x.__contains__(y) <==> y in x (possible false positive). """
        for i in range(self._hash_count):
            digest = self._evaluate_hash(item, i)
            if not self._bit_array[digest]:
                return False
        return True

    @classmethod
    def _evaluate_size(cls, items_count, fp_prob):
        m = -(items_count * log(fp_prob)) / (log(2) ** 2)
        return int(m)

    @classmethod
    def _evaluate_hash_count(cls, size, items_count):
        k = (size / items_count) * log(2)
        return int(k)

    def _evaluate_hash(self, item, k):
        return k * hash(item) % self._size


def test_contains():
    f = BloomFilter(items_count=100, fp_prob=0.01)
    f.add('x')
    assert 'x' in f


def test_not_contains():
    f = BloomFilter(items_count=100, fp_prob=0.01)
    f.add('x')
    assert 'y' not in f