Part 1: Building a Semantic Search Engine from Scratch: Embeddings, ChromaDB, and the Metadata Trap

When I set out to build a semantic search engine for 210 environmental datasets (using the UK Centre for Ecology & Hydrology (CEH) Environmental Information Data Centre (EIDC) catalogue), I thought the hard part would be the machine learning. Generate some embeddings, throw them in a vector database, query with cosine similarity. Simple, right?

I was wrong. The real challenges lurked in places I never expected: metadata that silently disappeared, similarity scores that made no sense, and first requests that took 30 seconds while users stared at a loading spinner.

This is the first part of a 3 part series of deep dives into building production-ready semantic search—the parts that tutorials skip.

The Architecture

Before diving into the problems, here’s what we’re building:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  User Query     │────▶│ Embedding       │────▶│ Vector Store    │
│  "soil data"    │     │ Service         │     │ (ChromaDB)      │
└─────────────────┘     └─────────────────┘     └────────┬────────┘
                                                         │
                        ┌─────────────────┐              │
                        │ Search Results  │◀─────────────┘
                        │ with Metadata   │
                        └─────────────────┘

I used sentence-transformers with the all-MiniLM-L6-v2 model (384 dimensions, fast, good quality) and ChromaDB for persistent vector storage. Sounds straightforward. It isn’t.

Problem 1: The 30-Second First Request

Here’s a pattern I see in tutorials everywhere:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')  # Loaded at import time

def embed(text):
    return model.encode(text)

This works fine locally. In production, it’s a major pain. The model loads when your module imports—which happens when your web server starts. But if you’re using a WSGI server with lazy worker spawning, the model loads on the first request. Your user waits 30 seconds while PyTorch initializes.

The fix is lazy loading with explicit pre-warming:

class EmbeddingService:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model_name = model_name
        self._model: Optional[SentenceTransformer] = None

    @property
    def model(self) -> SentenceTransformer:
        """Lazy load the model on first access."""
        if self._model is None:
            logger.info(f"Loading embedding model: {self.model_name}")
            self._model = SentenceTransformer(self.model_name)
            logger.info(f"Model loaded ({self._model.get_sentence_embedding_dimension()} dims)")
        return self._model

Then in your FastAPI lifespan:

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Pre-load on startup, not on first request
    logger.info("Pre-loading embedding model...")
    embedding_service = get_embedding_service()
    _ = embedding_service.model  # Trigger the load
    logger.info("Model ready")
    yield

Now the 30-second wait happens at server startup, not when a user is waiting.

Problem 2: The Metadata Trap

ChromaDB lets you store metadata with each document. Perfect for filtering: “find me soil datasets from 2020”, or “land cover data from the UK”. I built my indexer like this:

def index_dataset(self, dataset_id: str, title: str, description: str, metadata: dict):
    embedding = self.embedding_service.embed_document(f"{title}\n\n{description}")

    self.vector_store.add_documents(
        collection="datasets",
        ids=[dataset_id],
        documents=[f"{title}\n\n{description}"],
        embeddings=[embedding],
        metadatas=[metadata],  # Just pass it through, right?
    )

This worked in tests. In production, search results came back with empty metadata. No titles. No dataset IDs. Nothing.

The culprit? ChromaDB only accepts str, int, float, and bool as metadata values. Pass in a None, an empty string, a list, or a datetime? Silent failure. The document is stored, but your metadata vanishes.

Here’s what proper sanitization looks like:

def _sanitize_metadata(self, meta: dict, doc_id: str) -> dict:
    """
    ChromaDB metadata sanitization.

    This is more complex than it should be because:
    1. ChromaDB silently drops None values
    2. Empty strings cause issues
    3. Lists need conversion to strings
    4. Datetimes need ISO format
    5. Missing required fields break search results
    """
    clean = {}

    # ALWAYS ensure required fields first
    dataset_id = meta.get("dataset_id")
    if dataset_id and str(dataset_id).strip():
        clean["dataset_id"] = str(dataset_id).strip()
    else:
        clean["dataset_id"] = str(doc_id)  # Fallback to document ID

    title = meta.get("title")
    if title and isinstance(title, str) and title.strip():
        clean["title"] = title.strip()
    else:
        clean["title"] = "Untitled"  # Never leave empty

    # Process remaining fields
    for key, value in meta.items():
        if key in ("dataset_id", "title"):
            continue

        # Skip empties
        if value is None:
            continue
        if isinstance(value, str) and not value.strip():
            continue
        if isinstance(value, (list, tuple)) and len(value) == 0:
            continue

        # Convert to allowed types
        if isinstance(value, (str, int, float, bool)):
            clean[key] = value
        elif isinstance(value, datetime):
            clean[key] = value.isoformat()
        elif isinstance(value, (list, tuple)):
            clean[key] = ", ".join(str(v) for v in value)
        else:
            clean[key] = str(value)

    return clean

The key insight: you must guarantee that required fields like dataset_id and title are always present and non-empty. If they’re missing from the input, use fallbacks. If they’re empty strings, use defaults. Never trust incoming data.

I learned this the hard way after debugging why my search results showed “Unknown” for every title.

Problem 3: Distances vs. Similarity Scores

ChromaDB returns distances, not similarity scores. With the default L2 (Euclidean) distance, lower is better. A distance of 0 means identical vectors.

But users expect similarity scores where higher is better. “This result is 95% relevant” makes sense. “This result has distance 0.47” doesn’t.

The conversion is simple but not obvious:

def search(self, query_embedding: List[float], n_results: int = 10) -> SearchResults:
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
    )

    search_results = []
    for i, doc_id in enumerate(results["ids"][0]):
        distance = results["distances"][0][i]

        # Convert distance to similarity (0-1 range)
        score = 1.0 / (1.0 + distance)

        search_results.append(SearchResult(
            id=doc_id,
            distance=distance,
            score=score,  # Now higher is better
        ))

    return search_results

The formula 1 / (1 + distance) gives you:

Distance 0 → Score 1.0 (identical)
Distance 1 → Score 0.5
Distance 9 → Score 0.1
Distance ∞ → Score 0.0

For display, multiply by 100 and show as a percentage: “87% match”.

Problem 4: Embedding Cache Strategy

Generating embeddings is expensive, not in absolute terms (milliseconds), but it adds up. If users search for “land cover” ten times, why regenerate the embedding each time?

A simple cache with MD5 keys is a good solution, it’s fast, it’s easy, and it’s memory-efficient:

class EmbeddingService:
    def __init__(self, cache_size: int = 1000):
        self._cache: dict = {}
        self.cache_size = cache_size

    def _get_cache_key(self, text: str) -> str:
        return hashlib.md5(text.encode()).hexdigest()

    def embed_text(self, text: str, use_cache: bool = True) -> EmbeddingResult:
        if use_cache:
            cache_key = self._get_cache_key(text)
            if cache_key in self._cache:
                return self._cache[cache_key]

        embedding = self.model.encode(text, normalize_embeddings=True)

        result = EmbeddingResult(
            text=text,
            embedding=embedding.tolist(),
            model=self.model_name,
            dimensions=len(embedding),
        )

        if use_cache and len(self._cache) < self.cache_size:
            self._cache[cache_key] = result

        return result

A few notes:

Use normalize_embeddings=True when encoding. Normalized vectors make cosine similarity equivalent to dot product, which is faster.
Bound your cache size. Unbounded caches are memory leaks.
MD5 is fine here. We’re not doing cryptography; we just need fast, consistent hashing.

Problem 5: The Empty Collection Error

What happens when a user searches before any data is indexed? ChromaDB throws an error. The API returns a 500, and the user thinks our app is broken.

Defensive coding:

@router.get("/search")
async def search_datasets(query: str, pipeline: RAGPipeline = Depends(get_rag_pipeline)):
    try:
        # Check if collection exists and has data
        collections = vector_store.list_collections()
        if "datasets" not in collections:
            raise HTTPException(
                status_code=503,
                detail="No datasets indexed. Please run: python scripts/index_datasets.py"
            )

        stats = vector_store.get_collection_stats("datasets")
        if stats.get("count", 0) == 0:
            raise HTTPException(
                status_code=503,
                detail="No datasets indexed. Please run: python scripts/index_datasets.py"
            )

        # Now safe to search
        results = pipeline.retrieve(query=query, top_k=10)
        return format_results(results)

    except HTTPException:
        raise
    except Exception as e:
        logger.exception("Search error")
        raise HTTPException(status_code=500, detail=str(e))

The key is returning 503 Service Unavailable with a helpful message, not 500 Internal Server Error. The user knows what to do: run the indexing script.

The Complete Picture

Here’s how all these pieces fit together in the VectorStore class:

class VectorStore:
    DATASETS_COLLECTION = "datasets"

    def __init__(self, persist_directory: Optional[str] = None):
        if persist_directory:
            self.client = chromadb.PersistentClient(
                path=persist_directory,
                settings=ChromaSettings(anonymized_telemetry=False),
            )
        else:
            self.client = chromadb.Client()

    def add_documents(
        self,
        collection: str,
        ids: List[str],
        documents: List[str],
        embeddings: List[List[float]],
        metadatas: Optional[List[dict]] = None,
    ) -> int:
        coll = self.client.get_or_create_collection(name=collection)

        # Sanitize all metadata
        clean_metadatas = [
            self._sanitize_metadata(m or {}, ids[i])
            for i, m in enumerate(metadatas or [{}] * len(ids))
        ]

        coll.add(
            ids=ids,
            documents=documents,
            embeddings=embeddings,
            metadatas=clean_metadatas,
        )

        return len(ids)

    def search(
        self,
        collection: str,
        query_embedding: List[float],
        n_results: int = 10,
        where: Optional[dict] = None,
    ) -> SearchResults:
        coll = self.client.get_or_create_collection(name=collection)

        query_params = {
            "query_embeddings": [query_embedding],
            "n_results": n_results,
        }
        if where:
            query_params["where"] = where

        results = coll.query(**query_params)

        search_results = []
        if results["ids"] and results["ids"][0]:
            for i, doc_id in enumerate(results["ids"][0]):
                distance = results["distances"][0][i]
                score = 1.0 / (1.0 + distance)

                search_results.append(SearchResult(
                    id=doc_id,
                    content=results["documents"][0][i],
                    metadata=results["metadatas"][0][i],
                    distance=distance,
                    score=score,
                ))

        return SearchResults(results=search_results, total=len(search_results))

Lessons Learned

Lazy load, but pre-warm. Don’t load ML models at import time, but do trigger the load at server startup.
Never trust metadata. Sanitize everything. Provide fallbacks. Log warnings when data is weird.
Convert distances to scores. Users understand “95% match” better than “distance 0.1”.
Cache embeddings. It’s easy, it’s fast, and it saves compute.
Fail gracefully. Empty collections, missing models, network errors—handle them all with helpful messages.

The ML part of semantic search is the easy part. Building a system that works reliably in production? That’s where the real engineering happens.

Part 2 of this series will cover Multi-Format ETL Parsing and Metadata Extraction: How to extract unified data from ISO 19115 XML, JSON, RDF Turtle, and Schema.org—using the Template Method pattern, XPath navigation, and aggressive normalization.