Core Module
The core module contains the main orchestration logic for ForzaEmbed.
ForzaEmbed Class
Processor Class
- class src.core.processing.Processor(db, config)[source]
Bases:
objectHandle core data processing logic for embedding analysis.
This class orchestrates the processing pipeline for a single test run, delegating specific tasks to specialized services for embedding generation, similarity calculation, and visualization.
- db
The embedding database instance.
- config
The application configuration.
- embedding_service
Service for embedding generation and caching.
- similarity_service
Service for similarity calculations.
- visualization_service
Service for t-SNE visualization.
- __init__(db, config)[source]
Initialize the Processor.
- Parameters:
db (EmbeddingDatabase) – The embedding database instance for storing results.
config (AppConfig) – The application configuration.
- run_test(rows, model_config, chunk_size, chunk_overlap, themes, theme_name, chunking_strategy, similarity_metric, processed_files, pbar)[source]
Process a test run for a specific parameter combination.
Handles the complete workflow for processing files including embedding generation, similarity calculation, and metric evaluation.
- Parameters:
rows (list[tuple[str, str]]) – List of (name, content) tuples for files to process.
model_config (ModelConfig) – The model configuration to use.
chunk_size (int) – Size of text chunks in characters.
chunk_overlap (int) – Overlap between chunks in characters.
themes (list[str]) – List of theme keywords to compare against.
theme_name (str) – Name identifier for the theme set.
chunking_strategy (str) – The chunking strategy to use.
similarity_metric (str) – The similarity metric to use.
processed_files (list[str]) – List of file names already processed.
pbar (tqdm) – Progress bar object for status updates.
- Returns:
Dictionary containing processing results with file data and metrics.
- Return type:
Configuration
Configuration management for ForzaEmbed.
This module defines Pydantic models for application configuration and provides functions to load and validate YAML configuration files. It handles all configuration aspects including grid search parameters, model settings, database options, and multiprocessing settings.
Example
Load a configuration file:
from src.core.config import load_config
config = load_config("configs/config.yml")
print(config.models_to_test)
- class src.core.config.GridSearchParams(*, chunk_size, chunk_overlap, chunking_strategy, similarity_metrics, themes)[source]
Bases:
BaseModelConfiguration for grid search parameters.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class src.core.config.ModelConfig(*, type, name, dimensions, base_url=None, timeout=None, max_tokens=None, pooling_strategy='max')[source]
Bases:
BaseModelConfiguration for an embedding model.
- max_tokens
Optional maximum number of tokens per text. When a text exceeds this limit, it will be split into smaller chunks and recombined. If None, uses model default (typically 512).
- Type:
int | None
- pooling_strategy
Optional strategy for combining chunk embeddings when text exceeds max_tokens. Options: “max” (default), “average”, “weighted”, “last”. - “max”: Max pooling - captures most salient features - “average”: Mean of all chunks - preserves overall semantics - “weighted”: First chunks weighted more - useful for structured documents - “last”: Uses only last chunk - useful for summaries/conclusions
- Type:
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class src.core.config.DatabaseSettings(*, intelligent_quantization, quantize_metrics=True)[source]
Bases:
BaseModelConfiguration for database settings.
- intelligent_quantization
Whether to enable intelligent quantization for reducing storage size.
- Type:
- quantize_metrics
Whether to quantize metrics (similarities, scores). If True, metrics are stored with reduced precision (uint16) to save space. If False, metrics are stored in full float32 precision. Set to False if you need exact metric values without any quantization loss.
- Type:
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class src.core.config.EmbeddingPoolingStrategy[source]
Bases:
strStrategy for combining embeddings when text exceeds model token limit.
When a text is too long for the embedding model, it’s split into smaller chunks and their embeddings are combined using one of these strategies:
- “max”: Max pooling - takes the maximum value across all chunks for each
dimension. Best for capturing the most salient features.
- “average”: Average pooling - computes the mean of all chunk embeddings.
Preserves overall semantic content but may dilute important features.
- “weighted”: Weighted pooling - gives more importance to the first chunks.
Useful when the beginning of text is more informative.
- “last”: Uses only the last chunk embedding. Useful when the end of text
contains summaries or conclusions.
- MAX = 'max'
- AVERAGE = 'average'
- WEIGHTED = 'weighted'
- LAST = 'last'
- class src.core.config.MultiprocessingSettings(*, max_workers_api=16, max_workers_local=None, maxtasksperchild=10, embedding_batch_size_api=100, embedding_batch_size_local=500, file_batch_size=50, api_batch_sizes=<factory>)[source]
Bases:
BaseModelConfiguration for multiprocessing settings.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class src.core.config.AppConfig(*, grid_search_params, models_to_test, output_dir='reports', generate_filtered_markdowns=False, database, multiprocessing)[source]
Bases:
BaseModelMain application configuration.
- grid_search_params
Configuration for grid search parameters.
- models_to_test
List of model configurations to evaluate.
- Type:
- database
Database-related settings.
- multiprocessing
Multiprocessing-related settings.
- grid_search_params: GridSearchParams
- models_to_test: List[ModelConfig]
- database: DatabaseSettings
- multiprocessing: MultiprocessingSettings
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- src.core.config.load_config(config_path)[source]
Load and validate a YAML configuration file.
- Parameters:
config_path (str) – Path to the YAML configuration file.
- Returns:
A validated AppConfig instance.
- Raises:
FileNotFoundError – If the configuration file does not exist.
yaml.YAMLError – If the YAML file is malformed.
pydantic.ValidationError – If the configuration fails validation.
- Return type: