Lessons Learned in Development: LEANN Project

Here are my thoughts and comments on various research papers and topics in systems and ML, along with practical lessons learned from developing the LEANN project.

Lessons1: Chunk Overlap Size in RAG

Key Finding

Chunk overlap size really matters in Retrieval-Augmented Generation (RAG) systems.

Effects of Chunk Overlap Size

Too Small Overlap

❌ Risks cutting off important context, reducing retrieval accuracy
❌ Leads to fragmented chunks, weaker embeddings, and lower generation quality
⚠️ Often causes missing or inconsistent answers in downstream tasks like QA

Too Large Overlap

❌ Causes high redundancy and increases storage and compute cost
❌ May reduce retrieval diversity due to overlapping content dominating top-k results
⚠️ Adds latency without proportional gains in performance

Conclusion

Finding the optimal chunk overlap size is crucial for RAG system performance. The overlap should be large enough to preserve important context across chunk boundaries, but not so large that it creates excessive redundancy and computational overhead.

Lessons2: Data Format Really Matters

Key Finding

Data format and metadata preservation are critical when chunking documents for RAG systems.

The Problem

After chunking, key information often gets lost because important metadata may lie at the beginning of paragraphs. When documents are split into chunks, crucial context like contact names, document sources, or other metadata can be separated from the relevant content.

The Solution

When applying splitters using LlamaIndex, always attach important metadata at the beginning of each chunk to ensure critical information is preserved.

Example Implementation

text_splitter = SentenceSplitter(chunk_size=192, chunk_overlap=64)

# Convert Documents to text strings and chunk them
all_texts = []
for doc in all_documents:
    # Split the document into chunks
    nodes = text_splitter.get_nodes_from_documents([doc])
    for node in nodes:
        # Always prepend metadata to preserve important context
        text = '[Contact] means the message is from: ' + doc.metadata["contact_name"] + '\n' + node.get_content()
        all_texts.append(text)

Key Takeaway

By keeping the contact name (or any important metadata) always appearing before the chat history in each chunk, we ensure that the important information is attached and preserved throughout the retrieval process.

Lessons3: OpenAI Embedding Model Performance Issues

Key Finding

OpenAI's embedding model shows surprisingly poor performance compared to older, open-source alternatives in specific retrieval scenarios.

The Problem

When comparing embedding models for retrieval tasks, OpenAI's embedding model consistently fails to identify the correct document even when the intent is clear, while older models like Facebook's Contriever perform correctly.

Evidence from LEANN Project

Reference: Embedding Model Comparison Code

Test Scenario

conference_text = "[Title]: COLING 2025 Conference\n[URL]: https://coling2025.org/"
browser_text = "[Title]: Browser Use Tool\n[URL]: https://github.com/browser-use"

# Two queries with same intent but different wording
query1 = "Tell me my browser history about some conference i often visit"
query2 = "browser history about conference I often visit"

Expected vs Actual Results

Expected: Both queries should retrieve conference_text (the conference-related document)
Facebook/Contriever (4-year-old model): ✅ Always correct
OpenAI Embedding: ❌ Always wrong

Key Takeaway

This surprising result suggests that OpenAI's embedding model training techniques may not be as advanced as expected for certain retrieval tasks, especially when compared to specialized open-source alternatives.

Credit: This finding was originally discovered by @andylizf

My Experience

Development Journey

Working on the LEANN project has been an eye-opening experience that challenged many of my assumptions about modern AI systems. Here are some key insights from my development journey:

Key Learnings

Don't assume newer = better: The OpenAI embedding model performance issue was particularly surprising and taught me to always benchmark and validate assumptions.
Data preprocessing is crucial: The metadata preservation lesson came from hours of debugging why retrieval wasn't working as expected.
Chunking strategy matters: The overlap size experiments revealed how subtle parameter choices can dramatically impact system performance.

Technical Challenges

Embedding model selection: Finding the right balance between performance and efficiency
Data format optimization: Ensuring metadata preservation across different chunking strategies
System integration: Making different components work together seamlessly

Future Directions

Based on these lessons, I'm exploring: - More sophisticated chunking strategies - Alternative embedding models and training approaches - Better evaluation metrics for retrieval systems

These lessons represent real-world insights from building production RAG systems. The findings may challenge common assumptions in the field.