Core Module
The core module contains the main orchestration logic for ForzaEmbed.
ForzaEmbed Class
- class src.core.core.ForzaEmbed(db_path='reports/config_ForzaEmbed.db', config_path='configs/config.yml')[source]
Bases:
objectMain orchestrator for the embedding analysis and reporting pipeline.
This class manages the complete workflow for embedding analysis, including loading configurations, running grid searches across multiple parameter combinations, and generating comprehensive reports.
- db_path
Path to the SQLite database file.
- config_path
Path to the YAML configuration file.
- config
The loaded application configuration.
- config_name
Name derived from the configuration file.
- db
The embedding database instance.
- output_dir
Directory for output files.
- processor
The data processor instance.
- report_generator
The report generator instance.
- __init__(db_path='reports/config_ForzaEmbed.db', config_path='configs/config.yml')[source]
Initialize the ForzaEmbed instance.
- run_grid_search(data_source, resume=True)[source]
Run the complete grid search pipeline.
Executes the embedding analysis across all parameter combinations defined in the configuration. Supports resumption from the last completed combination.
Processor Class
- class src.core.processing.Processor(db, config)[source]
Bases:
objectHandle core data processing logic for embedding analysis.
This class orchestrates the processing pipeline for a single test run, delegating specific tasks to specialized services for embedding generation, similarity calculation, and visualization.
- db
The embedding database instance.
- config
The application configuration.
- embedding_service
Service for embedding generation and caching.
- similarity_service
Service for similarity calculations.
- visualization_service
Service for t-SNE visualization.
- __init__(db, config)[source]
Initialize the Processor.
- Parameters:
db (EmbeddingDatabase) – The embedding database instance for storing results.
config (AppConfig) – The application configuration.
- run_test(rows, model_config, chunk_size, chunk_overlap, themes, theme_name, chunking_strategy, similarity_metric, processed_files, pbar)[source]
Process a test run for a specific parameter combination.
Handles the complete workflow for processing files including embedding generation, similarity calculation, and metric evaluation.
- Parameters:
rows (list[tuple[str, str]]) – List of (name, content) tuples for files to process.
model_config (ModelConfig) – The model configuration to use.
chunk_size (int) – Size of text chunks in characters.
chunk_overlap (int) – Overlap between chunks in characters.
themes (list[str]) – List of theme keywords to compare against.
theme_name (str) – Name identifier for the theme set.
chunking_strategy (str) – The chunking strategy to use.
similarity_metric (str) – The similarity metric to use.
processed_files (list[str]) – List of file names already processed.
pbar (tqdm) – Progress bar object for status updates.
- Returns:
Dictionary containing processing results with file data and metrics.
- Return type:
Configuration
Configuration management for ForzaEmbed.
This module defines Pydantic models for application configuration and provides functions to load and validate YAML configuration files. It handles all configuration aspects including grid search parameters, model settings, database options, and multiprocessing settings.
Example
Load a configuration file:
from src.core.config import load_config
config = load_config("configs/config.yml")
print(config.models_to_test)
- class src.core.config.GridSearchParams(*, chunk_size, chunk_overlap, chunking_strategy, similarity_metrics, themes)[source]
Bases:
BaseModelConfiguration for grid search parameters.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class src.core.config.ModelConfig(*, type, name, dimensions, base_url=None, timeout=None)[source]
Bases:
BaseModelConfiguration for an embedding model.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class src.core.config.DatabaseSettings(*, intelligent_quantization)[source]
Bases:
BaseModelConfiguration for database settings.
- intelligent_quantization
Whether to enable intelligent quantization for reducing storage size.
- Type:
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class src.core.config.MultiprocessingSettings(*, max_workers_api, max_workers_local=None, maxtasksperchild, embedding_batch_size_api, embedding_batch_size_local, file_batch_size, api_batch_sizes)[source]
Bases:
BaseModelConfiguration for multiprocessing settings.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class src.core.config.AppConfig(*, grid_search_params, models_to_test, similarity_threshold, output_dir, generate_filtered_markdowns=False, database, multiprocessing)[source]
Bases:
BaseModelMain application configuration.
- grid_search_params
Configuration for grid search parameters.
- models_to_test
List of model configurations to evaluate.
- Type:
- database
Database-related settings.
- multiprocessing
Multiprocessing-related settings.
- grid_search_params: GridSearchParams
- models_to_test: List[ModelConfig]
- database: DatabaseSettings
- multiprocessing: MultiprocessingSettings
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- src.core.config.load_config(config_path)[source]
Load and validate a YAML configuration file.
- Parameters:
config_path (str) – Path to the YAML configuration file.
- Returns:
A validated AppConfig instance.
- Raises:
FileNotFoundError – If the configuration file does not exist.
yaml.YAMLError – If the YAML file is malformed.
pydantic.ValidationError – If the configuration fails validation.
- Return type: