Core Module

The core module contains the main orchestration logic for ForzaEmbed.

ForzaEmbed Class

Processor Class

class src.core.processing.Processor(db, config)[source]

Bases: object

Handle core data processing logic for embedding analysis.

This class orchestrates the processing pipeline for a single test run, delegating specific tasks to specialized services for embedding generation, similarity calculation, and visualization.

db

The embedding database instance.

config

The application configuration.

embedding_service

Service for embedding generation and caching.

similarity_service

Service for similarity calculations.

visualization_service

Service for t-SNE visualization.

__init__(db, config)[source]

Initialize the Processor.

Parameters:
  • db (EmbeddingDatabase) – The embedding database instance for storing results.

  • config (AppConfig) – The application configuration.

run_test(rows, model_config, chunk_size, chunk_overlap, themes, theme_name, chunking_strategy, similarity_metric, processed_files, pbar)[source]

Process a test run for a specific parameter combination.

Handles the complete workflow for processing files including embedding generation, similarity calculation, and metric evaluation.

Parameters:
  • rows (list[tuple[str, str]]) – List of (name, content) tuples for files to process.

  • model_config (ModelConfig) – The model configuration to use.

  • chunk_size (int) – Size of text chunks in characters.

  • chunk_overlap (int) – Overlap between chunks in characters.

  • themes (list[str]) – List of theme keywords to compare against.

  • theme_name (str) – Name identifier for the theme set.

  • chunking_strategy (str) – The chunking strategy to use.

  • similarity_metric (str) – The similarity metric to use.

  • processed_files (list[str]) – List of file names already processed.

  • pbar (tqdm) – Progress bar object for status updates.

Returns:

Dictionary containing processing results with file data and metrics.

Return type:

dict[str, Any]

Configuration

Configuration management for ForzaEmbed.

This module defines Pydantic models for application configuration and provides functions to load and validate YAML configuration files. It handles all configuration aspects including grid search parameters, model settings, database options, and multiprocessing settings.

Example

Load a configuration file:

from src.core.config import load_config

config = load_config("configs/config.yml")
print(config.models_to_test)
class src.core.config.GridSearchParams(*, chunk_size, chunk_overlap, chunking_strategy, similarity_metrics, themes)[source]

Bases: BaseModel

Configuration for grid search parameters.

chunk_size

List of chunk sizes to test (in characters).

Type:

List[int]

chunk_overlap

List of chunk overlaps to test (in characters).

Type:

List[int]

chunking_strategy

List of chunking strategies to evaluate.

Type:

List[str]

similarity_metrics

List of similarity metrics to use.

Type:

List[str]

themes

Mapping of theme names to lists of theme keywords.

Type:

Dict[str, List[str]]

chunk_size: List[int]
chunk_overlap: List[int]
chunking_strategy: List[str]
similarity_metrics: List[str]
themes: Dict[str, List[str]]
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.core.config.ModelConfig(*, type, name, dimensions, base_url=None, timeout=None, max_tokens=None, pooling_strategy='max')[source]

Bases: BaseModel

Configuration for an embedding model.

type

The type of model (e.g., ‘api’, ‘fastembed’, ‘sentence_transformers’).

Type:

str

name

The model name or identifier.

Type:

str

dimensions

The embedding dimension of the model.

Type:

int

base_url

Optional base URL for API-based models.

Type:

str | None

timeout

Optional request timeout in seconds for API models.

Type:

int | None

max_tokens

Optional maximum number of tokens per text. When a text exceeds this limit, it will be split into smaller chunks and recombined. If None, uses model default (typically 512).

Type:

int | None

pooling_strategy

Optional strategy for combining chunk embeddings when text exceeds max_tokens. Options: “max” (default), “average”, “weighted”, “last”. - “max”: Max pooling - captures most salient features - “average”: Mean of all chunks - preserves overall semantics - “weighted”: First chunks weighted more - useful for structured documents - “last”: Uses only last chunk - useful for summaries/conclusions

Type:

str

type: str
name: str
dimensions: int
base_url: str | None
timeout: int | None
max_tokens: int | None
pooling_strategy: str
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.core.config.DatabaseSettings(*, intelligent_quantization, quantize_metrics=True)[source]

Bases: BaseModel

Configuration for database settings.

intelligent_quantization

Whether to enable intelligent quantization for reducing storage size.

Type:

bool

quantize_metrics

Whether to quantize metrics (similarities, scores). If True, metrics are stored with reduced precision (uint16) to save space. If False, metrics are stored in full float32 precision. Set to False if you need exact metric values without any quantization loss.

Type:

bool

intelligent_quantization: bool
quantize_metrics: bool
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.core.config.EmbeddingPoolingStrategy[source]

Bases: str

Strategy for combining embeddings when text exceeds model token limit.

When a text is too long for the embedding model, it’s split into smaller chunks and their embeddings are combined using one of these strategies:

  • “max”: Max pooling - takes the maximum value across all chunks for each

    dimension. Best for capturing the most salient features.

  • “average”: Average pooling - computes the mean of all chunk embeddings.

    Preserves overall semantic content but may dilute important features.

  • “weighted”: Weighted pooling - gives more importance to the first chunks.

    Useful when the beginning of text is more informative.

  • “last”: Uses only the last chunk embedding. Useful when the end of text

    contains summaries or conclusions.

MAX = 'max'
AVERAGE = 'average'
WEIGHTED = 'weighted'
LAST = 'last'
class src.core.config.MultiprocessingSettings(*, max_workers_api=16, max_workers_local=None, maxtasksperchild=10, embedding_batch_size_api=100, embedding_batch_size_local=500, file_batch_size=50, api_batch_sizes=<factory>)[source]

Bases: BaseModel

Configuration for multiprocessing settings.

max_workers_api

Maximum number of workers for API-based embedding calls.

Type:

int

max_workers_local

Optional maximum workers for local model inference.

Type:

int | None

maxtasksperchild

Maximum tasks per worker child before respawning.

Type:

int

embedding_batch_size_api

Batch size for API embedding requests.

Type:

int

embedding_batch_size_local

Batch size for local model embedding.

Type:

int

file_batch_size

Number of files to process per batch.

Type:

int

api_batch_sizes

Mapping of provider names to their specific batch sizes.

Type:

Dict[str, int]

max_workers_api: int
max_workers_local: int | None
maxtasksperchild: int
embedding_batch_size_api: int
embedding_batch_size_local: int
file_batch_size: int
api_batch_sizes: Dict[str, int]
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class src.core.config.AppConfig(*, grid_search_params, models_to_test, output_dir='reports', generate_filtered_markdowns=False, database, multiprocessing)[source]

Bases: BaseModel

Main application configuration.

grid_search_params

Configuration for grid search parameters.

Type:

src.core.config.GridSearchParams

models_to_test

List of model configurations to evaluate.

Type:

List[src.core.config.ModelConfig]

output_dir

Directory path for output files.

Type:

str

generate_filtered_markdowns

Whether to generate filtered markdown files.

Type:

bool

database

Database-related settings.

Type:

src.core.config.DatabaseSettings

multiprocessing

Multiprocessing-related settings.

Type:

src.core.config.MultiprocessingSettings

grid_search_params: GridSearchParams
models_to_test: List[ModelConfig]
output_dir: str
generate_filtered_markdowns: bool
database: DatabaseSettings
multiprocessing: MultiprocessingSettings
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

src.core.config.load_config(config_path)[source]

Load and validate a YAML configuration file.

Parameters:

config_path (str) – Path to the YAML configuration file.

Returns:

A validated AppConfig instance.

Raises:
  • FileNotFoundError – If the configuration file does not exist.

  • yaml.YAMLError – If the YAML file is malformed.

  • pydantic.ValidationError – If the configuration fails validation.

Return type:

AppConfig