Core Module

The core module contains the main orchestration logic for ForzaEmbed.

ForzaEmbed Class

Processor Class

class src.core.processing.Processor(db, config)[source]

Bases: object

Handle core data processing logic for embedding analysis.

This class orchestrates the processing pipeline for a single test run, delegating specific tasks to specialized services for embedding generation, similarity calculation, and visualization.

db: The embedding database instance.

config: The application configuration.

embedding_service: Service for embedding generation and caching.

similarity_service: Service for similarity calculations.

visualization_service: Service for t-SNE visualization.

__init__(db, config)[source]

Initialize the Processor.

Parameters:

db (EmbeddingDatabase) – The embedding database instance for storing results.
config (AppConfig) – The application configuration.

run_test(rows, model_config, chunk_size, chunk_overlap, themes, theme_name, chunking_strategy, similarity_metric, processed_files, pbar)[source]

Process a test run for a specific parameter combination.

Handles the complete workflow for processing files including embedding generation, similarity calculation, and metric evaluation.

Parameters:

rows (list[tuple[str, str]]) – List of (name, content) tuples for files to process.
model_config (ModelConfig) – The model configuration to use.
chunk_size (int) – Size of text chunks in characters.
chunk_overlap (int) – Overlap between chunks in characters.
themes (list[str]) – List of theme keywords to compare against.
theme_name (str) – Name identifier for the theme set.
chunking_strategy (str) – The chunking strategy to use.
similarity_metric (str) – The similarity metric to use.
processed_files (list[str]) – List of file names already processed.
pbar (tqdm) – Progress bar object for status updates.

Returns:

Dictionary containing processing results with file data and metrics.

Return type:

dict[str, Any]

Configuration

Configuration management for ForzaEmbed.

This module defines Pydantic models for application configuration and provides functions to load and validate YAML configuration files. It handles all configuration aspects including grid search parameters, model settings, database options, and multiprocessing settings.

Example

Load a configuration file:

from src.core.config import load_config

config = load_config("configs/config.yml")
print(config.models_to_test)

class src.core.config.GridSearchParams(*, chunk_size, chunk_overlap, chunking_strategy, similarity_metrics, themes)[source]

Bases: BaseModel

Configuration for grid search parameters.

chunk_size

List of chunk sizes to test (in characters).

Type:: List[int]

chunk_overlap

List of chunk overlaps to test (in characters).

Type:: List[int]

chunking_strategy

List of chunking strategies to evaluate.

Type:: List[str]

similarity_metrics

List of similarity metrics to use.

Type:: List[str]

themes

Mapping of theme names to lists of theme keywords.

Type:: Dict[str, List[str]]

chunk_size: List[int]

chunk_overlap: List[int]

chunking_strategy: List[str]

similarity_metrics: List[str]

themes: Dict[str, List[str]]

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.core.config.ModelConfig(*, type, name, dimensions, base_url=None, timeout=None, max_tokens=None, pooling_strategy='max')[source]

Bases: BaseModel

Configuration for an embedding model.

type

The type of model (e.g., ‘api’, ‘fastembed’, ‘sentence_transformers’).

Type:: str

name

The model name or identifier.

Type:: str

dimensions

The embedding dimension of the model.

Type:: int

base_url

Optional base URL for API-based models.

Type:: str | None

timeout

Optional request timeout in seconds for API models.

Type:: int | None

max_tokens

Optional maximum number of tokens per text. When a text exceeds this limit, it will be split into smaller chunks and recombined. If None, uses model default (typically 512).

Type:: int | None

pooling_strategy

Optional strategy for combining chunk embeddings when text exceeds max_tokens. Options: “max” (default), “average”, “weighted”, “last”. - “max”: Max pooling - captures most salient features - “average”: Mean of all chunks - preserves overall semantics - “weighted”: First chunks weighted more - useful for structured documents - “last”: Uses only last chunk - useful for summaries/conclusions

Type:: str

type: str

name: str

dimensions: int

base_url: str | None

timeout: int | None

max_tokens: int | None

pooling_strategy: str

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.core.config.DatabaseSettings(*, intelligent_quantization, quantize_metrics=True)[source]

Bases: BaseModel

Configuration for database settings.

intelligent_quantization

Whether to enable intelligent quantization for reducing storage size.

Type:: bool

quantize_metrics

Whether to quantize metrics (similarities, scores). If True, metrics are stored with reduced precision (uint16) to save space. If False, metrics are stored in full float32 precision. Set to False if you need exact metric values without any quantization loss.

Type:: bool

intelligent_quantization: bool

quantize_metrics: bool

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.core.config.EmbeddingPoolingStrategy[source]

Bases: str

Strategy for combining embeddings when text exceeds model token limit.

When a text is too long for the embedding model, it’s split into smaller chunks and their embeddings are combined using one of these strategies:

“max”: Max pooling - takes the maximum value across all chunks for each
dimension. Best for capturing the most salient features.
“average”: Average pooling - computes the mean of all chunk embeddings.
Preserves overall semantic content but may dilute important features.
“weighted”: Weighted pooling - gives more importance to the first chunks.
Useful when the beginning of text is more informative.
“last”: Uses only the last chunk embedding. Useful when the end of text
contains summaries or conclusions.

MAX = 'max'

AVERAGE = 'average'

WEIGHTED = 'weighted'

LAST = 'last'

class src.core.config.MultiprocessingSettings(*, max_workers_api=16, max_workers_local=None, maxtasksperchild=10, embedding_batch_size_api=100, embedding_batch_size_local=500, file_batch_size=50, api_batch_sizes=<factory>)[source]

Bases: BaseModel

Configuration for multiprocessing settings.

max_workers_api

Maximum number of workers for API-based embedding calls.

Type:: int

max_workers_local

Optional maximum workers for local model inference.

Type:: int | None

maxtasksperchild

Maximum tasks per worker child before respawning.

Type:: int

embedding_batch_size_api

Batch size for API embedding requests.

Type:: int

embedding_batch_size_local

Batch size for local model embedding.

Type:: int

file_batch_size

Number of files to process per batch.

Type:: int

api_batch_sizes

Mapping of provider names to their specific batch sizes.

Type:: Dict[str, int]

max_workers_api: int

max_workers_local: int | None

maxtasksperchild: int

embedding_batch_size_api: int

embedding_batch_size_local: int

file_batch_size: int

api_batch_sizes: Dict[str, int]

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.core.config.AppConfig(*, grid_search_params, models_to_test, output_dir='reports', generate_filtered_markdowns=False, database, multiprocessing)[source]