Alex Wang
Nov 16, 2024
This paper explores strategies to optimize embedding texts for semantic search, focusing on the impact of normalization, synonyms, alternate phrasing, and typo handling. Using text-embedding-3-large with ChromaDB, we analyze test results to provide recommendations for creating embeddings that enhance retrieval accuracy. Key findings demonstrate the importance of text standardization, incorporation of contextual synonyms, and query preprocessing while cautioning against embedding literal typos. These conclusions are supported by detailed test results, illustrating how semantic models handle variations in user input.
Semantic search systems use vector-based representations to match user queries to relevant information based on intent and meaning rather than exact keyword matches. Embedding models such as text-embedding-3-large generate multi-dimensional vectors where similar meanings are represented as closely related points in the vector space. For instance, "breakfast time" and "when is breakfast served" produce vectors that align closely despite differences in phrasing. However, the effectiveness of semantic search depends heavily on how embedding texts are constructed. Factors such as normalization, inclusion of synonyms, and handling of typos can influence retrieval accuracy. This study investigates these factors using a series of tests, providing evidence-based recommendations for embedding text creation.
To conduct this study, an embedding model and a vector database were utilized to enable semantic search and evaluate query retrieval precision. Embedding models convert textual data into high-dimensional vector representations, capturing semantic meaning in a format that allows for effective comparison and matching (Mikolov et al., 2013). A vector database is essential for storing these embeddings and performing similarity searches efficiently, enabling the alignment of user queries with relevant embedding texts based on their vector representations (Chhabra, 2023). For this study, the text-embedding-3-large model, which generates 3072-dimensional embeddings, was selected. This model is among the most advanced in capturing nuanced semantic relationships, making it particularly suitable for tasks requiring high precision and contextual understanding (Open AI Embeddings). To manage and query these embeddings, we utilized ChromaDB, a robust vector database designed for efficient storage and retrieval of high-dimensional data (ChromaDB). ChromaDB was chosen for its reliability and support for cosine similarity (Wikipedia)—a standard metric for evaluating semantic similarity—ensuring consistent and accurate results. These tools allowed us to rigorously test and analyze various embedding text selection strategies. This study was designed to evaluate how different text selection strategies influence semantic similarity and query retrieval performance. Each test case explores a distinct aspect of embedding text optimization, examining its impact on accuracy and consistency across a variety of user query scenarios. The four areas were chosen for their direct relevance to improving semantic search retrieval accuracy in practical settings. Each represents a distinct aspect of user query variability: - Text Normalization ensures consistent formatting. - Synonyms address natural variability in word choice. - Typos account for errors in user input. - Natural Conversational Language matches real-world phrasing.
Objective: To determine if text normalization (e.g., converting to lowercase, removing punctuation) improves retrieval accuracy by reducing variability caused by formatting differences.
Test Results: 1. Query: "Wi-Fi" - Distance Scores: - 0.0 for 'Wi-Fi' - 0.134 for 'wi-fi' - 0.240 for 'WiFi' - 0.312 for 'wifi' 2. Query: "breakfast hours" - Distance Scores: - 0.0 for 'breakfast hours' - 0.052 for 'Breakfast Hours' 3. Query: breakfast time? - Distance Scores: - 0.08813, {'embedding_text': 'breakfast time'} - 0.18912, {'embedding_text': 'what time is breakfast'} - 0.20969, {'embedding_text': 'breakfast serving time'} - 0.24073, {'embedding_text': 'what are breakfast times'} 4. Query: breakfast time - Distance Scores: - 0.0, {'embedding_text': 'breakfast time'} - 0.15923, {'embedding_text': 'breakfast serving time'} - 0.17118, {'embedding_text': 'what time is breakfast'} - 0.18837, {'embedding_text': 'breakfast hours'} - 0.20495, {'embedding_text': 'what are breakfast times'}Objective: To evaluate the impact of including synonyms and alternate phrasings in embedding texts...
Objective: To assess whether literal typo embeddings are necessary...
Objective: To evaluate the effectiveness of conversational language phrasing in embeddings...
This study demonstrates that optimizing embedding text selection through normalization, synonyms, typo preprocessing, and natural language phrasing significantly enhances semantic search performance...
Future work should explore advanced areas, such as multilingual query handling, domain-specific fine-tuning, and dynamic embedding strategies...