Configuration Guide
ForzaEmbed uses YAML configuration files to define grid search parameters and settings.
Configuration File Structure
A typical configuration file has four main sections:
grid_search_params:
# Parameters to test in grid search
models_to_test:
# Embedding models to evaluate
# General settings
similarity_threshold: 0.6
output_dir: "reports"
database:
# Database optimization settings
multiprocessing:
# Performance tuning
Grid Search Parameters
chunk_size
List of chunk sizes (in characters) to test.
chunk_size: [100, 250, 500, 1000]
Smaller values (50-200): Better for fine-grained analysis, more chunks
Medium values (200-500): Balanced approach
Larger values (500-2000): Capture more context, fewer chunks
chunk_overlap
Overlap between consecutive chunks (prevents splitting related content).
chunk_overlap: [0, 10, 25, 50]
0: No overlap
10-25: Recommended for most cases
50+: High overlap, useful for ensuring context continuity
chunking_strategy
Method used to split text into chunks.
chunking_strategy: ["langchain", "semchunk", "nltk", "spacy", "raw"]
Available strategies:
langchain: Recursive character-based splitting
semchunk: Semantic chunking that respects sentence boundaries
nltk: Sentence tokenization using NLTK
spacy: Advanced NLP-based segmentation
raw: Simple character-based splitting
similarity_metrics
Distance/similarity metrics for comparing embeddings.
similarity_metrics: ["cosine", "dot_product", "euclidean", "manhattan", "chebyshev"]
cosine: Measures angle between vectors (default, normalized)
dot_product: Combines angle and magnitude
euclidean: Straight-line distance
manhattan: Sum of absolute differences
chebyshev: Maximum difference along any dimension
themes
Define semantic themes for filtering relevant chunks.
themes:
schedule:
- "opening hours"
- "schedule"
- "Monday to Friday"
- "closed on weekends"
location:
- "address"
- "located at"
- "you can find us"
Each theme is a list of keywords/phrases that define what to look for.
Models Configuration
FastEmbed Models
- type: "fastembed"
name: "BAAI/bge-small-en-v1.5"
dimensions: 384
Popular FastEmbed models:
BAAI/bge-small-en-v1.5(384D): Fast, good qualityBAAI/bge-base-en-v1.5(768D): Better quality, slowernomic-ai/nomic-embed-text-v1.5(768D): Strong performance
Sentence Transformers
- type: "sentence_transformers"
name: "all-MiniLM-L6-v2"
dimensions: 384
Popular models:
all-MiniLM-L6-v2(384D): Fast and efficientall-mpnet-base-v2(768D): High qualityparaphrase-multilingual-mpnet-base-v2(768D): Multilingual
Hugging Face Transformers
- type: "transformers"
name: "jinaai/jina-embeddings-v3"
dimensions: 1024
API-based Models
- type: "api"
name: "text-embedding-3-small"
dimensions: 1536
base_url: "https://api.openai.com/v1"
timeout: 30
Requires setting environment variables:
export OPENAI_API_KEY="your-api-key"
General Settings
similarity_threshold
Threshold for classifying chunks as “similar” or “different”.
similarity_threshold: 0.6
Values range from 0.0 to 1.0
Higher values = stricter filtering
Affects T-SNE visualization coloring
output_dir
Directory for saving reports and databases.
output_dir: "reports"
generate_filtered_markdowns
Generate filtered markdown files containing only relevant chunks.
generate_filtered_markdowns: false
Database Settings
intelligent_quantization
Compress data to reduce database size.
database:
intelligent_quantization: true
true: Compress floats to 16-bit, reduce storage by ~50%
false: Full 64-bit precision
Performance Settings
multiprocessing:
max_workers_api: 16
max_workers_local: null # Auto-detect CPU cores
maxtasksperchild: 10
embedding_batch_size_api: 100
embedding_batch_size_local: 500
file_batch_size: 50
api_batch_sizes:
mistral: 50
openai: 100
voyage: 100
default: 100
max_workers_api: Parallel API calls
max_workers_local: Parallel local computations
maxtasksperchild: Restart workers after N tasks (memory management)
embedding_batch_size_*: Batch size for embedding computation
file_batch_size: Number of files processed in parallel
Complete Example
grid_search_params:
chunk_size: [100, 250, 500]
chunk_overlap: [0, 10, 25]
chunking_strategy: ["langchain", "semchunk"]
similarity_metrics: ["cosine", "dot_product"]
themes:
hours:
- "opening hours"
- "schedule"
- "Monday"
- "closed"
models_to_test:
- type: "fastembed"
name: "BAAI/bge-small-en-v1.5"
dimensions: 384
similarity_threshold: 0.6
output_dir: "reports"
generate_filtered_markdowns: false
database:
intelligent_quantization: true
multiprocessing:
max_workers_api: 8
max_workers_local: null
maxtasksperchild: 10
embedding_batch_size_api: 100
embedding_batch_size_local: 500
file_batch_size: 50
api_batch_sizes:
default: 100