Grid Search Optimization
ForzaEmbed implements a grid search optimization that reduces computation time by avoiding redundant calculations.
Problem Statement
In a naive grid search approach, all chunking strategies would be tested with all combinations of chunk_size and chunk_overlap parameters. However, some chunking strategies completely ignore these parameters because they use linguistic sentence boundaries instead of fixed sizes.
Testing these parameter-insensitive strategies with different chunk sizes produces identical results, wasting valuable computation time and storage space.
Chunking Strategy Classification
ForzaEmbed classifies chunking strategies into two categories:
Parameter-Sensitive Strategies
These strategies use chunk_size and chunk_overlap parameters:
- langchain
Uses
RecursiveCharacterTextSplitterwith precise size controlRespects exact chunk_size limits
Implements chunk_overlap for context preservation
Splits recursively on separators while maintaining size constraints
- semchunk
Semantic chunking with size limits
Uses chunk_size as maximum limit
Maintains semantic coherence within size constraints
Produces different results with different sizes
- raw
Character-based chunking with overlap
Pure character-based splitting at chunk_size
Implements sliding window with chunk_overlap
Produces exact-sized chunks
Parameter-Insensitive Strategies
These strategies ignore chunk_size and chunk_overlap:
- nltk
Uses
nltk.sent_tokenize()for sentence segmentationSplits on sentence boundaries using linguistic rules
Ignores chunk_size and chunk_overlap completely
Results depend only on text content and language
- spacy
Uses spaCy’s sentence segmentation (
doc.sents)Splits using machine learning-based sentence detection
Ignores chunk_size and chunk_overlap completely
Results depend on linguistic model, not size parameters
Optimization Strategy
The smart grid search optimization works as follows:
Strategy Classification
Separates chunking strategies into sensitive and insensitive groups
Combination Generation
For parameter-sensitive strategies: Generate all valid combinations of chunk_size × chunk_overlap
For parameter-insensitive strategies: Generate only one combination using dummy values (first chunk_size and chunk_overlap)
Validation
Ensures chunk_size > chunk_overlap for all sensitive strategy combinations
Performance Impact
Example Configuration
Consider a typical configuration:
grid_search_params:
chunk_size: [10, 20, 50, 100, 250, 500, 1000] # 7 values
chunk_overlap: [0, 5, 10, 25, 50, 100, 200] # 7 values
chunking_strategy: ["langchain", "raw", "semchunk", "nltk", "spacy"]
similarity_metrics: ["cosine", "euclidean", "manhattan", "dot_product", "chebyshev"]
themes:
horaires: ["opening hours"]
schedule: ["schedule", "timetable"]
closures: ["closed on"]
models_to_test:
- type: "fastembed"
name: "BAAI/bge-small-en-v1.5"
dimensions: 384
# ... 5 more models
Results
Naive Approach (without optimization):
Valid chunk pairs: 35 (from 7 × 7, filtered for size > overlap)
Total combinations: 6 models × 35 pairs × 5 strategies × 5 metrics × 3 themes
Result: 15,750 combinations
Smart Approach (with optimization):
Parameter-sensitive (langchain, raw, semchunk):
6 models × 35 pairs × 3 strategies × 5 metrics × 3 themes = 9,450
Parameter-insensitive (nltk, spacy):
6 models × 1 pair × 2 strategies × 5 metrics × 3 themes = 180
Total: 9,630 combinations
Optimization Results:
Combinations eliminated: 6,120
Reduction: 38.9%
Speedup: 1.64x
Per-Strategy Savings
For each parameter-insensitive strategy:
Without optimization: 35 chunk configurations tested
With optimization: 1 chunk configuration tested
Savings: 34 configurations per strategy
With 6 models, 5 metrics, and 3 themes:
Savings per insensitive strategy: (35 - 1) × 6 × 5 × 3 = 3,060 combinations
Implementation Details
The optimization is implemented in src/core/core.py:
# Define parameter-insensitive strategies
PARAMETER_INSENSITIVE_STRATEGIES = {"nltk", "spacy"}
def _generate_smart_combinations(self, param_grid: dict) -> list:
"""
Generates parameter combinations intelligently by avoiding
redundant combinations for chunking strategies that don't
use chunk_size/chunk_overlap.
"""
# Separate strategies
sensitive_strategies = [
s for s in strategies
if s not in PARAMETER_INSENSITIVE_STRATEGIES
]
insensitive_strategies = [
s for s in strategies
if s in PARAMETER_INSENSITIVE_STRATEGIES
]
# Generate all combinations for sensitive strategies
# Generate only one combination for insensitive strategies
# ...
Automatic Detection
The optimization is completely automatic. No changes to your configuration files or API calls are required.
When running a grid search, you’ll see logging output like:
Smart combination generation:
3 parameter-sensitive strategies (langchain, raw, semchunk),
2 parameter-insensitive strategies (nltk, spacy)
Transparency and Verification
To verify the optimization and understand the savings for your specific configuration, run:
python demo_smart_optimization.py
This script displays:
Strategy classification
Combination counts (naive vs optimized)
Reduction percentage and speedup factor
Detailed per-strategy breakdown
Best Practices
Maximize Optimization Benefits
To maximize the benefits of this optimization:
Include both types of strategies in your configuration
Mix parameter-sensitive (langchain, semchunk, raw) with parameter-insensitive (nltk, spacy)
Use multiple chunk_size and chunk_overlap values
The more values you test, the greater the savings for insensitive strategies
Test many configurations
With multiple models, metrics, and themes, the multiplicative effect increases savings
When to Use Each Strategy Type
Use parameter-sensitive strategies when:
You need precise control over chunk size
Working with structured data or code
Chunk size significantly impacts your use case
You want to test different granularities
Use parameter-insensitive strategies when:
Natural sentence boundaries are important
Working with narrative or conversational text
Linguistic coherence is a priority
You want grammatically complete chunks