Lessons1 Learned in Development: Chunk Overlap Size in RAG
Key Finding
Chunk overlap size really matters in Retrieval-Augmented Generation (RAG) systems.
Effects of Chunk Overlap Size
Too Small Overlap
- ❌ Risks cutting off important context, reducing retrieval accuracy
- ❌ Leads to fragmented chunks, weaker embeddings, and lower generation quality
- ⚠️ Often causes missing or inconsistent answers in downstream tasks like QA
Too Large Overlap
- ❌ Causes high redundancy and increases storage and compute cost
- ❌ May reduce retrieval diversity due to overlapping content dominating top-k results
- ⚠️ Adds latency without proportional gains in performance
Conclusion
Finding the optimal chunk overlap size is crucial for RAG system performance. The overlap should be large enough to preserve important context across chunk boundaries, but not so large that it creates excessive redundancy and computational overhead.
Lessons2 Learned: Data Format Really Matters
Key Finding
Data format and metadata preservation are critical when chunking documents for RAG systems.
The Problem
After chunking, key information often gets lost because important metadata may lie at the beginning of paragraphs. When documents are split into chunks, crucial context like contact names, document sources, or other metadata can be separated from the relevant content.
The Solution
When applying splitters using LlamaIndex, always attach important metadata at the beginning of each chunk to ensure critical information is preserved.
Example Implementation
text_splitter = SentenceSplitter(chunk_size=192, chunk_overlap=64)
# Convert Documents to text strings and chunk them
all_texts = []
for doc in all_documents:
    # Split the document into chunks
    nodes = text_splitter.get_nodes_from_documents([doc])
    for node in nodes:
        # Always prepend metadata to preserve important context
        text = '[Contact] means the message is from: ' + doc.metadata["contact_name"] + '\n' + node.get_content()
        all_texts.append(text)
Key Takeaway
By keeping the contact name (or any important metadata) always appearing before the chat history in each chunk, we ensure that the important information is attached and preserved throughout the retrieval process.
Lessons3 Learned: OpenAI Embedding Model Performance Issues
Key Finding
OpenAI's embedding model shows surprisingly poor performance compared to older, open-source alternatives in specific retrieval scenarios.
The Problem
When comparing embedding models for retrieval tasks, OpenAI's embedding model consistently fails to identify the correct document even when the intent is clear, while older models like Facebook's Contriever perform correctly.
Evidence from LEANN Project
Reference: Embedding Model Comparison Code
Test Scenario
conference_text = "[Title]: COLING 2025 Conference\n[URL]: https://coling2025.org/"
browser_text = "[Title]: Browser Use Tool\n[URL]: https://github.com/browser-use"
# Two queries with same intent but different wording
query1 = "Tell me my browser history about some conference i often visit"
query2 = "browser history about conference I often visit"
Expected vs Actual Results
- Expected: Both queries should retrieve conference_text(the conference-related document)
- Facebook/Contriever (4-year-old model): ✅ Always correct
- OpenAI Embedding: ❌ Always wrong
Key Takeaway
This surprising result suggests that OpenAI's embedding model training techniques may not be as advanced as expected for certain retrieval tasks, especially when compared to specialized open-source alternatives.
Credit: This finding was originally discovered by @andylizf