Avoid the pitfall! If you are not careful, the vector database in your RAG system will burn hundreds of thousands of RMB every year!

Written by

Caleb Hayes

Updated on:June-28th-2025

A few days ago, I was chatting with a former colleague whose team was responsible for managing a vector database containing a billion vector data. In order to avoid blowing up the disk, they used HNSW indexes and a 748-dimensional embedding model. But the boss was almost unable to continue working because the annual cloud server fee was still as high as tens of thousands of dollars!

The problem wasn’t that they chose the wrong embedding model, or even the wrong vector database. The problem was simple: their vectors were too big.

Why a billion embedding vectors cost so much

Taking 768 dimensions as an example, each float32 vector is 768×4 bytes = 3072 bytes, so one billion of them is 3.07 TB of raw data. If we add about 15% of HNSW index overhead, metadata, etc., the total is about 3.5 TB.

Using gp3 disks on AWS, the base rate is $0.08/GB·month, which is about $80/TB·month, so the storage cost per node per month is about $280. But in order to achieve sub-10 millisecond vector search latency, additional IOPS and throughput are usually required - so the actual cost is closer to $400–450/TB·month. At $450/TB·month, 3.5 TB of storage costs about $1575 per node per month. Add in replicas, development/test clusters, data ingestion pipelines, and compute nodes, and the cost of search can easily reach five figures per month.

The curse of dimensionality makes the situation worse. The higher the dimensionality, the sparser the data. To maintain high recall, the index needs to retrieve more candidate vectors, resulting in higher CPU and I/O costs. In short, every additional dimension is another brick in the cost wall.

But the real cost of a large index isn’t even the infrastructure cost, it’s the maintenance cost — you need a large search team to build and maintain it. If you can reduce the size of your index by 10x, you might be able to save 2–6 dedicated search engineers!

The cost of vector “underquantization”

"Underquantization" means that you retain more embedding dimensions than you actually need. In 2024, many people directly use the default output of the model - such as 768 or 1536 dimensions - because everyone thinks "the more the better". More importantly, when these search processes were built, quantization technology was not mature enough, so the final system was more bloated than the state-of-the-art technology (SOTA) at the time. And vector databases have no incentive to educate developers to use quantization correctly - because not quantizing can make them more money!

Modern benchmarks show that 64-dimensional binary vectors can maintain similar performance to 1024-dimensional vectors in many search tasks when using appropriate embedding models . While 1-bit per dimension binary quantization helps a lot, it cannot save a vector that is already too high in dimensionality.

To understand the huge savings that quantization brings at internet scale, let's do the math if you have 2 billion vectors:

748-dimensional, float32:

Raw data: 2e9 × 748 × 4 bytes ≈ 5.98 TB
After indexing (~25% overhead): ≈ 7.48 TB

128-dimensional, float32:

Raw data: 2e9 × 128 × 4 bytes = 1.024 TB
After indexing: ≈ 1.28 TB

64-dimensional, float32:

Raw data: 2e9 × 64 × 4 bytes = 0.512 TB
After indexing: ≈ 0.64 TB

64 dimensions, binary (1 bit/dimension):

Raw data: 2e9 × 64 bits ÷ 8 = 16 GB (0.016 TB)
HNSW pointer overhead: 2e9 × 256 B ≈ 0.512 TB
Total index size: ≈ 0.016 + 0.512 = 0.528 TB

This is an order of magnitude reduction in data volume, which can significantly reduce costs, speed up searches, and more importantly, make it easier to manage for a small search team.

New progress in quantization and disk indexing in 2025

In 2025, we have new tools that allow us to achieve these compressions more safely. Models trained with nested embeddings (Matryoshka representation learning) can have the first 64 or 128 dimensions carry almost all the semantic information. You can simply truncate the rest without retraining a lower-dimensional model. Such models are also usually trained to accommodate binary or scalar quantization.

In terms of indexing, KX 's qHNSW disk engine stores most of the vector data on SSDs and only stores minimal metadata in memory. Millions of 64-dimensional vectors take up only a few GB, query time is kept within 200 milliseconds, and CPU usage is greatly reduced.

The key is to use a two-stage retrieval:

1. First select the top K candidates using a compact and cheap index;

2. Use cross-encoder or BM25 for reordering.

This way, even if you lose a little precision during quantization, you can get it back through rerank. In real benchmarks, the recall drop is usually within 5%. It makes more sense to quickly build an aggressively quantized index first, and then focus on fine-tuning the reranker - after all, it is much easier to modify the rerank service than to re-embed 1 billion data points!

Simple steps to reduce costs

Start with small dimensions . Try 64 or 128 dimensions instead of the default 768 or 1536.
Aggressively quantize . Converting from float32 to int8 or binary is a basic operation; but the real cost savings come from directly reducing the dimension .
Use a two-stage search : first a rough screening on a compact index, then a re-ranking of a small number of candidates.
If possible, benchmark rigorously . Measure accuracy, latency, and cost before and after each change. But if you don’t plan to test, do n’t use large-dimensional embeddings —you simply have no evidence that they perform better on your data.
Don't rely on defaults . Many vector databases use high-dimensional models and full-memory indexes by default. Use qHNSW on disk instead , and set the dimensions explicitly .

Vector search is not a game of "the more dimensions the better" or "fill up memory faster" - it is an engineering optimization problem. In 2025, with mature quantization technology and disk indexing solutions, you can run large-scale searches with one or two machines . Prune your vectors, optimize your processes, and infrastructure costs will drop significantly.

https://medium.com/kx-systems/the-most-common-vector-search-mistake-is-costing-enterprises-hundreds-of-thousands-dd1ffd0b976d