Welcome to ForzaEmbed’s documentation!

ForzaEmbed is a Python framework for systematically benchmarking text embedding models and processing strategies. It performs an exhaustive grid search across a configurable parameter space to help you find the optimal configuration for your document corpus.

License: MIT Python Version

Key Features

  • Automated Grid Search: Test all combinations of chunk sizes, overlap, chunking strategies, similarity metrics, and embedding models

  • Standalone Interactive Visualization: Single-file HTML report with embedded data for visualizing embedding similarities directly on text (no server required)

  • Multiple Embedding Models: Support for FastEmbed, Sentence Transformers, Hugging Face, and API-based models

  • Flexible Chunking: Compare different text segmentation strategies (LangChain, SemChunk, NLTK, spaCy, raw)

  • Comprehensive Metrics: Silhouette analysis with intra/inter-cluster distance decomposition

  • Caching: SQLite-based caching to avoid redundant computations

  • Performance Tracking: Measure and compare embedding computation time across configurations

Quick Start

Installation:

git clone https://github.com/berangerthomas/ForzaEmbed.git
cd ForzaEmbed
uv sync

Basic Usage:

from src.core.core import ForzaEmbed

# Initialize ForzaEmbed
app = ForzaEmbed(
    db_path="reports/my_analysis.db",
    config_path="configs/config.yml"
)

# Run grid search
app.run_grid_search(data_source="markdowns/")

# Generate reports
app.generate_reports(top_n=25)

Contents

Additional Information

Indices and tables