From Vectors to Buckets: How LSH Actually Works Under the Hood

Why Brute-Force Search Fails at Scale

You've got a growing pile of BERT embeddings. A query comes in. You compute cosine similarity against every vector. It works beautifully. Until it doesn't.

At a few thousand vectors, brute-force search is fine. At a few million? It's a bottleneck. CPU usage spikes, latency creeps up, and suddenly your "simple" similarity search becomes your biggest infrastructure cost.

This is a well-known scaling wall when working with approximate nearest neighbor (ANN) systems. And one of the earliest, cleanest solutions to this problem is Locality Sensitive Hashing (LSH). Let's break it down clearly, keeping the visuals that matter and cutting the noise.

Here's what brute-force looks like in Python:

def brute_force_search(query, vectors, top_k=5):
    scores = np.dot(vectors, query) / (
        np.linalg.norm(vectors, axis=1) * np.linalg.norm(query)
    )
    return np.argsort(scores)[-top_k:][::-1]

Every query compares against every vector. Time complexity is $O(N \times d)$ , where N is the number of vectors and d is the dimensionality. As N grows, latency grows linearly.

Diagram 1: Brute-force search, query vector compared against every vector in the dataset

And here's the deeper issue: traditional indexing structures like KD-Trees degrade in high dimensions (the curse of dimensionality). Once you're in 100+ dimensions, they behave almost like brute-force. So we need a strategy that avoids scanning everything. That strategy is Locality Sensitive Hashing.

The Core Idea Behind Locality Sensitive Hashing

Traditional hash functions try to prevent collisions. LSH tries to cause collisions, but only for similar vectors.

The central guarantee:

The probability that two vectors hash to the same bucket increases as their similarity increases.

Instead of searching the entire dataset, we:

Hash all vectors into buckets
Hash the query
Search only matching buckets
Compute exact similarity on those candidates

Diagram 2: LSH pipeline, embeddings to random hyperplanes to binary signatures to hash buckets to filtered search

That's the shift: from global search to probabilistic filtering.

The Geometry: Random Hyperplane LSH

For cosine similarity (common with BERT embeddings), we use Random Hyperplane LSH. Here's the intuition.

A hyperplane divides space into two halves. For each hyperplane, compute dot(vector, plane). If the result is >= 0, the bit is 1. If < 0, the bit is 0. Each hyperplane gives one bit. K hyperplanes give a K-bit signature.

def generate_hyperplanes(num_planes, dim):
    return np.random.randn(num_planes, dim)
 
def hash_vector(vector, hyperplanes):
    return tuple((np.dot(hyperplanes, vector) >= 0).astype(int))

Diagram 3: Hyperplane projection, a vector is projected against K hyperplanes to produce a K-bit binary signature

If a vector falls on the same side of many hyperplanes as another vector, it shares many bits, and it likely lands in the same bucket.

Collision probability between two vectors with angle $\theta$ is $P = 1 - \frac{\theta}{\pi}$ . Smaller angle means higher collision probability. That's why LSH aligns naturally with cosine similarity.

Why One Hash Table Isn't Enough

With just one table, similar vectors might still land in different buckets. To improve recall, we use multiple independent hash tables. Each table has its own random hyperplanes and produces independent signatures.

A vector is considered a candidate if it matches in any table. This dramatically increases the probability of finding true neighbors without making buckets too large. This multi-table trick is the real practical upgrade from the textbook version of LSH.

The Full LSH Query Pipeline

Here's a minimal LSH implementation that ties it all together:

class LSH:
    def __init__(self, num_tables, num_planes, dim):
        self.tables = [defaultdict(list) for _ in range(num_tables)]
        self.hyperplanes = [
            np.random.randn(num_planes, dim) for _ in range(num_tables)
        ]
 
    def _hash(self, vector, table_idx):
        return tuple(
            (np.dot(self.hyperplanes[table_idx], vector) >= 0).astype(int)
        )
 
    def index(self, vectors):
        for i, vec in enumerate(vectors):
            for t in range(len(self.tables)):
                key = self._hash(vec, t)
                self.tables[t][key].append(i)
 
    def query(self, vector, vectors, top_k=5):
        candidates = set()
        for t in range(len(self.tables)):
            key = self._hash(vector, t)
            candidates.update(self.tables[t][key])
        # exact re-ranking on candidates only
        scores = {
            i: np.dot(vectors[i], vector) / (
                np.linalg.norm(vectors[i]) * np.linalg.norm(vector)
            )
            for i in candidates
        }
        return sorted(scores, key=scores.get, reverse=True)[:top_k]

Diagram 4: Full query pipeline, query vector is normalized, hashed across L tables, candidates collected and deduplicated, exact cosine computed, top-K returned

Important detail: LSH does not eliminate exact similarity. It reduces how many vectors you compare exactly. That hybrid design, probabilistic filtering + exact re-ranking, is what makes it effective.

Why LSH Scales

Brute-force search always scans N vectors with linear growth. LSH scans only bucket contents, the candidate size is often small, and it shows sublinear behavior in practice.

A quick benchmark to see this in action:

dim, n_vectors = 128, 500_000
vectors = np.random.randn(n_vectors, dim)
query = np.random.randn(dim)
 
start = time.time()
brute_force_search(query, vectors)
print(f"Brute-force: {time.time() - start:.4f}s")
 
lsh = LSH(num_tables=5, num_planes=10, dim=dim)
lsh.index(vectors)
 
start = time.time()
lsh.query(query, vectors)
print(f"LSH:         {time.time() - start:.4f}s")

Engineering benchmarks often show 100x or more speedups once datasets reach hundreds of thousands of vectors. The larger N grows, the more dramatic the difference becomes.

Where LSH Fits Today

Modern vector databases often prefer graph-based ANN methods like HNSW because they achieve better recall-speed tradeoffs. But Locality Sensitive Hashing still matters because:

It's conceptually simple
It's easy to implement
It works well for cosine similarity
It's effective for deduplication
It scales cleanly in distributed systems

And perhaps most importantly, understanding LSH makes understanding modern ANN indexing much easier. It's foundational.

When Not to Use LSH

Avoid LSH when your dataset is small (<10K vectors), you need exact nearest neighbors, or you require ultra-high recall.

In small systems, brute-force with optimized NumPy is often simpler and faster. In high-precision production systems, graph-based methods usually win. But in large-scale similarity pipelines where speed matters more than perfect recall, LSH is still a powerful tool.

Final Thoughts

Locality Sensitive Hashing flips the usual idea of hashing on its head. Instead of spreading items apart, it brings similar items together.

By using random hyperplanes, probabilistic collision guarantees, and multiple hash tables, LSH transforms similarity search from "compare against everything" to "compare where similarity is likely." From vectors to buckets.

I recently built SuperBit with ❤️, an LSH implementation in Rust with bindings for both Rust (superbit_lsh on crates.io) and Python (superbit-lsh on PyPI). Check it out if you want to try LSH in your own projects.

References

LSH Explained → Core algorithm and probability guarantees.
Similarity Search Part 5: LSH → LSH in a broader similarity search series.
Understanding LSH → Hash families and practical applications.
LSH for Deduplication → Real-world case study at scale.