Core Module

The core module contains the main orchestration logic for ForzaEmbed.

ForzaEmbed Class

class src.core.core.ForzaEmbed(db_path='reports/config_ForzaEmbed.db', config_path='configs/config.yml')[source]

Bases: object

Main orchestrator for the embedding analysis and reporting pipeline.

This class manages the complete workflow for embedding analysis, including loading configurations, running grid searches across multiple parameter combinations, and generating comprehensive reports.

db_path

Path to the SQLite database file.

config_path

Path to the YAML configuration file.

config

The loaded application configuration.

config_name

Name derived from the configuration file.

db

The embedding database instance.

output_dir

Directory for output files.

processor

The data processor instance.

report_generator

The report generator instance.

__init__(db_path='reports/config_ForzaEmbed.db', config_path='configs/config.yml')[source]

Initialize the ForzaEmbed instance.

Parameters:
  • db_path (str) – Path to the SQLite database file for storing results.

  • config_path (str) – Path to the YAML configuration file.

Run the complete grid search pipeline.

Executes the embedding analysis across all parameter combinations defined in the configuration. Supports resumption from the last completed combination.

Parameters:
  • data_source (str | Path | list[str]) – The source of markdown data. Can be a directory path (str or Path) or a list of markdown content strings.

  • resume (bool) – If True, resumes from the last completed combination.

generate_reports(top_n=25, single_file=False)[source]

Generate all reports and visualizations.

Creates comprehensive reports from the data stored in the database, including metric comparisons, charts, and interactive visualizations.

Parameters:
  • top_n (int) – Number of top combinations to include in reports. Use -1 to include all combinations.

  • single_file (bool) – If True, generates a single HTML file containing all results. If False, generates separate files per input.

Processor Class

class src.core.processing.Processor(db, config)[source]

Bases: object

Handle core data processing logic for embedding analysis.

This class orchestrates the processing pipeline for a single test run, delegating specific tasks to specialized services for embedding generation, similarity calculation, and visualization.

db

The embedding database instance.

config

The application configuration.

embedding_service

Service for embedding generation and caching.

similarity_service

Service for similarity calculations.

visualization_service

Service for t-SNE visualization.

__init__(db, config)[source]

Initialize the Processor.

Parameters:
  • db (EmbeddingDatabase) – The embedding database instance for storing results.

  • config (AppConfig) – The application configuration.

run_test(rows, model_config, chunk_size, chunk_overlap, themes, theme_name, chunking_strategy, similarity_metric, processed_files, pbar)[source]

Process a test run for a specific parameter combination.

Handles the complete workflow for processing files including embedding generation, similarity calculation, and metric evaluation.

Parameters:
  • rows (list[tuple[str, str]]) – List of (name, content) tuples for files to process.

  • model_config (ModelConfig) – The model configuration to use.

  • chunk_size (int) – Size of text chunks in characters.

  • chunk_overlap (int) – Overlap between chunks in characters.

  • themes (list[str]) – List of theme keywords to compare against.

  • theme_name (str) – Name identifier for the theme set.

  • chunking_strategy (str) – The chunking strategy to use.

  • similarity_metric (str) – The similarity metric to use.

  • processed_files (list[str]) – List of file names already processed.

  • pbar (tqdm) – Progress bar object for status updates.

Returns:

Dictionary containing processing results with file data and metrics.

Return type:

dict[str, Any]

Configuration

Configuration management for ForzaEmbed.

This module defines Pydantic models for application configuration and provides functions to load and validate YAML configuration files. It handles all configuration aspects including grid search parameters, model settings, database options, and multiprocessing settings.

Example

Load a configuration file:

from src.core.config import load_config

config = load_config("configs/config.yml")
print(config.models_to_test)
class src.core.config.GridSearchParams(*, chunk_size, chunk_overlap, chunking_strategy, similarity_metrics, themes)[source]

Bases: BaseModel

Configuration for grid search parameters.

chunk_size

List of chunk sizes to test (in characters).

Type:

List[int]

chunk_overlap

List of chunk overlaps to test (in characters).

Type:

List[int]

chunking_strategy

List of chunking strategies to evaluate.

Type:

List[str]

similarity_metrics

List of similarity metrics to use.

Type:

List[str]

themes

Mapping of theme names to lists of theme keywords.

Type:

Dict[str, List[str]]

chunk_size: List[int]
chunk_overlap: List[int]
chunking_strategy: List[str]
similarity_metrics: List[str]
themes: Dict[str, List[str]]
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.core.config.ModelConfig(*, type, name, dimensions, base_url=None, timeout=None)[source]

Bases: BaseModel

Configuration for an embedding model.

type

The type of model (e.g., ‘api’, ‘fastembed’, ‘sentence_transformers’).

Type:

str

name

The model name or identifier.

Type:

str

dimensions

The embedding dimension of the model.

Type:

int

base_url

Optional base URL for API-based models.

Type:

str | None

timeout

Optional request timeout in seconds for API models.

Type:

int | None

type: str
name: str
dimensions: int
base_url: str | None
timeout: int | None
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.core.config.DatabaseSettings(*, intelligent_quantization)[source]

Bases: BaseModel

Configuration for database settings.

intelligent_quantization

Whether to enable intelligent quantization for reducing storage size.

Type:

bool

intelligent_quantization: bool
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.core.config.MultiprocessingSettings(*, max_workers_api, max_workers_local=None, maxtasksperchild, embedding_batch_size_api, embedding_batch_size_local, file_batch_size, api_batch_sizes)[source]

Bases: BaseModel

Configuration for multiprocessing settings.

max_workers_api

Maximum number of workers for API-based embedding calls.

Type:

int

max_workers_local

Optional maximum workers for local model inference.

Type:

int | None

maxtasksperchild

Maximum tasks per worker child before respawning.

Type:

int

embedding_batch_size_api

Batch size for API embedding requests.

Type:

int

embedding_batch_size_local

Batch size for local model embedding.

Type:

int

file_batch_size

Number of files to process per batch.

Type:

int

api_batch_sizes

Mapping of provider names to their specific batch sizes.

Type:

Dict[str, int]

max_workers_api: int
max_workers_local: int | None
maxtasksperchild: int
embedding_batch_size_api: int
embedding_batch_size_local: int
file_batch_size: int
api_batch_sizes: Dict[str, int]
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.core.config.AppConfig(*, grid_search_params, models_to_test, similarity_threshold, output_dir, generate_filtered_markdowns=False, database, multiprocessing)[source]

Bases: BaseModel

Main application configuration.

grid_search_params

Configuration for grid search parameters.

Type:

src.core.config.GridSearchParams

models_to_test

List of model configurations to evaluate.

Type:

List[src.core.config.ModelConfig]

similarity_threshold

Threshold for similarity-based filtering.

Type:

float

output_dir

Directory path for output files.

Type:

str

generate_filtered_markdowns

Whether to generate filtered markdown files.

Type:

bool

database

Database-related settings.

Type:

src.core.config.DatabaseSettings

multiprocessing

Multiprocessing-related settings.

Type:

src.core.config.MultiprocessingSettings

grid_search_params: GridSearchParams
models_to_test: List[ModelConfig]
similarity_threshold: float
output_dir: str
generate_filtered_markdowns: bool
database: DatabaseSettings
multiprocessing: MultiprocessingSettings
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

src.core.config.load_config(config_path)[source]

Load and validate a YAML configuration file.

Parameters:

config_path (str) – Path to the YAML configuration file.

Returns:

A validated AppConfig instance.

Raises:
  • FileNotFoundError – If the configuration file does not exist.

  • yaml.YAMLError – If the YAML file is malformed.

  • pydantic.ValidationError – If the configuration fails validation.

Return type:

AppConfig