Retrieval

Enhancing Query Retrieval Precision Through Optimized Embedding Text Selection

What we learned from testing text-embedding-3-large on real user queries

By Alex P. Wang | November 16, 2024 | Tags: Retrieval, Embeddings, Semantic Search

Semantic search matches user queries by meaning, not by keyword. But the model is only half the system. The other half is the text you choose to embed. Get that wrong and even the best embedding model returns weak matches.

This study tests four factors that change retrieval accuracy in practice: normalization, synonyms, typos, and conversational phrasing. The model is text-embedding-3-large. The vector store is ChromaDB. The results are concrete numbers from real queries, not theory.

The model captures meaning. The embedding text decides what meaning is in scope.

The Setup

text-embedding-3-large produces 3072-dimensional vectors and captures nuanced semantic relationships. ChromaDB stores those vectors and runs cosine similarity search. Lower distance score means closer semantic match. 0.0 is an exact hit.

Each test below uses the same pattern. A user query is embedded. ChromaDB returns the nearest stored embedding texts with their distance scores. The scores are what reveal whether the embedding text was chosen well.

The Four Factors

Text Normalization

Consistent casing and punctuation across queries and embeddings.

Synonyms and Alternate Phrasing

Coverage for the different ways users say the same thing.

Common Typos

Tolerance for misspellings and mobile keyboard slips.

Conversational Language

Phrasing that mirrors how real users actually ask.

1. Text Normalization

The question is simple. Does lowercasing and stripping punctuation actually move the needle? The results are blunt.

Query: "Wi-Fi"

0.000 for 'Wi-Fi'
0.134 for 'wi-fi'
0.240 for 'WiFi'
0.312 for 'wifi'

Query: "breakfast hours"

0.000 for 'breakfast hours'
0.052 for 'Breakfast Hours'

Query: "breakfast time?"

0.08813 for 'breakfast time'
0.18912 for 'what time is breakfast'
0.20969 for 'breakfast serving time'
0.24073 for 'what are breakfast times'

Query: "breakfast time" (no punctuation)

0.00000 for 'breakfast time'
0.15923 for 'breakfast serving time'
0.17118 for 'what time is breakfast'
0.18837 for 'breakfast hours'

Capitalization differences moved scores by up to 0.3 for the same word. Removing the trailing question mark cut the top score from 0.088 to 0.0. The embedding model treats "WiFi" and "wifi" as different signals. That is noise the system does not need.

Normalize the query. Normalize the stored text. Apply both rules to both sides.

2. Synonyms and Alternate Phrasing

Users do not phrase things one way. They say "breakfast time", "what time is breakfast", "breakfast hours", "when is breakfast". The model handles these well only if the stored embedding texts cover the variation.

Query: "breakfast time"

0.000 for 'breakfast time'
0.242 for 'what time is breakfast'
0.253 for 'breakfast hours'
0.280 for 'when is breakfast'

Query: "internet access"

0.000 for 'internet access'
0.072 for 'internet connection'
0.143 for 'how to get online'

Adding 'breakfast hours' and 'what time is breakfast' alongside 'breakfast time' gave a much tighter score band for the same intent. The trade-off is small: a few extra rows in the vector database, no measurable redundancy cost.

Embed the question. Embed the rephrasings of the question. The model rewards coverage.

3. Common Typos

Typos are normal. The question is whether to store them as embedding texts or fix them before search.

Query: "breakfat time" (missing s)

0.073 for 'breakfast time'
0.253 for 'breakfast hours'

Query: "interent access" (transposed letters)

0.110 for 'internet access'
0.297 for 'wifi access'

A typo introduces a small distance penalty. The correct embedding still wins. Storing the misspelled version separately adds rows without improving the match.

Do not embed typos. Preprocess them out of the query instead.

4. Concise, Conversational Language

Hospitality queries do not look like database keywords. They look like questions a guest would ask out loud.

Query: "how do I connect to wifi?"

0.000 for 'how do I connect to wifi'
0.128 for 'wifi access'

Query: "is there free wifi"

0.00000 for 'is there free wifi'
0.09686 for 'is there wifi'
0.13923 for 'is there wi-fi'
0.17553 for 'is wifi free here'

Embedding texts written as natural questions match natural queries more tightly than technical descriptions of the same content. The form should match the input the system will actually receive.

Recommendations

Normalize Both Sides

Lowercase and strip unnecessary punctuation on user queries and stored embedding texts. Apply the same rule to both.

Embed Variations, Not Just the Canonical Phrase

Add synonyms and alternate phrasings as separate embedding texts. Cover the natural ways users ask.

Fix Typos in the Query Pipeline

Use a spelling correction library such as SymSpell upstream. Do not bloat the index with misspellings.

Write Embedding Texts as Questions

Match the phrasing of real user inputs. Conversational text aligns with conversational queries.

What This Study Did Not Cover

Four areas are worth a separate study: embedding text length and granularity, multilingual queries, domain fine-tuning beyond general-purpose pretrained models, and user-context personalization. Each of these is a meaningful axis for further precision gains, especially in hospitality and other vertical applications.

Final Thoughts

A semantic search system is only as good as the text it embeds. The model does the heavy lifting on meaning, but it cannot fix sloppy inputs or missing coverage. Normalization removes noise. Synonyms expand coverage. Conversational phrasing aligns with how users actually ask. Typos belong in a preprocessing step, not in the vector store.

None of this is exotic. It is design discipline applied to the embedding text itself. The model is the engine. The embedding text is the road.

Pick the model carefully. Pick the embedding text more carefully.

Back to Ideas & Notes