**Problem**. If we have a large set of structured data (identified by record IDs) stored in a set of data
files, what is
the most efficient way to know which file might contain our required data? We donâ€™t want to read each file,
as that would be slow, and we have to read a lot of data from the disk. One solution can be to build an
index on each data file and store it in a separate index file. This index can map each record ID to its
offset in the data file. Each index file will be sorted on the record ID. Now, if we want to search an ID in
this index, the best we can do is a Binary Search. Can we do better than that?

**Solution**. The Bloom filter data structure tells whether an element may be in a set, or definitely is
not. The only possible errors are false positives, i.e., a search for a nonexistent element might give an
incorrect answer. With more elements in the filter, the error rate increases. An empty Bloom filter is a
bit-array of m bits, all set to 0. There are also k different hash functions, each of which maps a set
element to one of the m bit positions.

The hash functions used in a Bloom filter should be independent and uniformly distributed. They should also be as fast as possible (cryptographic hashes such as sha1, though widely used therefore are not very good choices). Examples of fast, simple hashes that are independent enough include murmur, the fnv series of hashes, and HashMix.

**Applications**. Bloom filters find application in network routers, in web browsers(to detect the
malicious urls), in password checkers(to not a set a weak or guessable or list of forbidden passwords), etc.

Google Bigtable, Apache HBase and Apache Cassandra and PostgreSQL use Bloom filters to reduce the disk
lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the
performance of a database query operation.

**Bloom filter vs Hash Table**. Bloom filter is more memory efficient and more faster but does not
accurate (possible false alarms) and does not support remove operation.

```
from math import log
class BloomFilter:
"""Bloom filter"""
def __init__(self, items_count, fp_prob):
assert 0 <= fp_prob <= 1, "Probability should be in range [0,1]"
assert items_count > 0, "Items count must be positive"
self._size = self._evaluate_size(items_count, fp_prob)
self._hash_count = self._evaluate_hash_count(self._size, items_count)
self._bit_array = [0] * self._size
def add(self, item):
"""Add an element to bloom filter"""
for i in range(self._hash_count):
digest = self._evaluate_hash(item, i)
self._bit_array[digest] = True
def __contains__(self, item):
""" x.__contains__(y) <==> y in x (possible false positive). """
for i in range(self._hash_count):
digest = self._evaluate_hash(item, i)
if not self._bit_array[digest]:
return False
return True
@classmethod
def _evaluate_size(cls, items_count, fp_prob):
m = -(items_count * log(fp_prob)) / (log(2) ** 2)
return int(m)
@classmethod
def _evaluate_hash_count(cls, size, items_count):
k = (size / items_count) * log(2)
return int(k)
def _evaluate_hash(self, item, k):
return k * hash(item) % self._size
def test_contains():
f = BloomFilter(items_count=100, fp_prob=0.01)
f.add('x')
assert 'x' in f
def test_not_contains():
f = BloomFilter(items_count=100, fp_prob=0.01)
f.add('x')
assert 'y' not in f
```