Utilities
Utility functions and classes for data processing and storage.
Database
Database management module for ForzaEmbed.
This module provides the EmbeddingDatabase class for managing all database operations including storing embeddings, results, and metadata. It implements intelligent quantization for efficient storage and caching mechanisms for improved performance.
Example
Basic database usage:
from src.utils.database import EmbeddingDatabase
db = EmbeddingDatabase("results.db", config)
db.save_embeddings_batch("model_name", embeddings_dict)
cached = db.get_embeddings_by_hashes("model_name", ["hash1", "hash2"])
- class src.utils.database.EmbeddingDatabase(db_path, config)[source]
Bases:
objectManage SQLite database for embeddings, results, and metadata.
This class handles all database operations for ForzaEmbed, including storing and retrieving embeddings, processing results, and various metadata. Implements intelligent quantization to reduce storage size.
- db_path
Path to the SQLite database file.
- config
Application configuration (dict or AppConfig).
- quantization_enabled
Whether intelligent quantization is enabled.
- engine
SQLAlchemy database engine.
- Session
SQLAlchemy session factory.
- add_model(name, base_model_name, model_type, chunk_size, chunk_overlap, theme_name, chunking_strategy, similarity_metric)[source]
Add a model run to the database.
- Parameters:
name (str) – Unique run name identifier.
base_model_name (str) – The underlying model name.
model_type (str) – Type of model (api, fastembed, etc.).
chunk_size (int) – Chunk size used.
chunk_overlap (int) – Chunk overlap used.
theme_name (str) – Theme set name.
chunking_strategy (str) – Chunking strategy used.
similarity_metric (str) – Similarity metric used.
- add_generated_file(model_name, file_type, file_path)[source]
Add a generated file record to the database.
- save_processing_result(model_name, file_id, results)[source]
Save detailed processing result for a file and model.
- save_processing_results_batch(results_batch)[source]
Save a batch of processing results in a single transaction.
- get_all_processing_results()[source]
Retrieve all processing results organized by model run name.
Fetches raw file-level results without model-level aggregation.
- get_processed_files_with_similarities(run_name)[source]
Retrieve files that have been processed with similarity scores.
- get_embeddings_by_hashes(base_model_name, text_hashes)[source]
Retrieve embeddings from cache by model and text hashes.
- save_embeddings_batch(base_model_name, embeddings)[source]
Save a batch of embeddings to the cache.
Applies intelligent quantization to reduce storage size when enabled.
- save_tsne_coordinates(tsne_key, file_id, coordinates)[source]
Save t-SNE coordinates for a given configuration.
- get_tsne_coordinates(tsne_key, file_id)[source]
Retrieve t-SNE coordinates for a given configuration.
- get_all_processing_results_for_run(model_name)[source]
Get all processing results for a specific run with dequantization.
Data Loader
Data loading utilities for ForzaEmbed.
This module provides functions for loading markdown content from various sources including directories and lists of strings. It handles file I/O and content extraction.
Example
Load markdown files from a directory:
from src.utils.data_loader import load_markdown_files
files = load_markdown_files("markdowns/")
for name, content in files:
print(f"Loaded: {name}")
Text Processing
Text processing utilities for ForzaEmbed.
This module provides utility functions for text chunking, pattern matching, and context extraction. It supports multiple chunking strategies including langchain, semchunk, nltk, spacy, and raw character-based chunking.
Example
Chunk text using different strategies:
from src.utils.utils import chunk_text
chunks = chunk_text(text, chunk_size=500, chunk_overlap=50, strategy="langchain")
- src.utils.utils.get_spacy_model(language)[source]
Load and cache a spaCy model for a given language.
Downloads the model if it’s not available locally.
- Parameters:
language (str) – Language code (‘fr’ for French, ‘en’ for English).
- Returns:
Loaded spaCy Language model.
- Raises:
ValueError – If the language is not supported.
- Return type:
Language
- src.utils.utils.chunk_text(text, chunk_size, chunk_overlap, strategy='langchain', language='fr')[source]
Split text into segments using a specified strategy.
Supports multiple chunking strategies with different characteristics. Some strategies (nltk, spacy) ignore chunk_size and chunk_overlap.
- Parameters:
text (str) – Text to split.
chunk_size (int) – Size of chunks in characters (ignored by nltk, spacy).
chunk_overlap (int) – Overlap between chunks (ignored by nltk, spacy).
strategy (str) – Chunking strategy to use. One of: ‘langchain’, ‘semchunk’, ‘nltk’, ‘spacy’, ‘raw’.
language (str) – Language of the text (‘fr’ or ‘en’).
- Returns:
List of extracted text segments.
- Raises:
ValueError – If an unknown chunking strategy is specified.
- Return type:
- src.utils.utils.contains_horaire_pattern(text, keywords)[source]
Check if text contains patterns related to opening hours.
Uses regex patterns to detect time-related content including days, times, and action keywords.
Database Models
SQLAlchemy ORM models for the ForzaEmbed database.
This module defines all database models used for storing embedding results, metrics, and metadata. Uses SQLAlchemy 2.0 declarative mapping with type annotations.
- Models:
Model: Stores model run configurations. EvaluationMetric: Stores evaluation metrics for each model run. GeneratedFile: Tracks generated output files. GlobalChart: Stores paths to global chart images. ProcessingResult: Stores detailed processing results per file. EmbeddingCache: Caches computed embeddings for reuse. TSNECoordinate: Caches t-SNE coordinate calculations.
- class src.utils.models.Base(**kwargs)[source]
Bases:
DeclarativeBaseBase class for all SQLAlchemy ORM models.
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class src.utils.models.Model(**kwargs)[source]
Bases:
BaseStores model run configuration and metadata.
- type
Model type (api, fastembed, sentence_transformers, etc.).
- Type:
sqlalchemy.orm.base.Mapped[str]
- created_at
Timestamp of creation.
- Type:
sqlalchemy.orm.base.Mapped[datetime.datetime]
- metrics
Related evaluation metrics.
- Type:
sqlalchemy.orm.base.Mapped[src.utils.models.EvaluationMetric]
- generated_files
Related generated files.
- Type:
sqlalchemy.orm.base.Mapped[list[src.utils.models.GeneratedFile]]
- metrics: Mapped[EvaluationMetric]
- generated_files: Mapped[list[GeneratedFile]]
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class src.utils.models.EvaluationMetric(**kwargs)[source]
Bases:
BaseStores evaluation metrics for a model run.
- intra_cluster_distance_normalized
Normalized intra-cluster distance.
- Type:
sqlalchemy.orm.base.Mapped[float | None]
- inter_cluster_distance_normalized
Normalized inter-cluster distance.
- Type:
sqlalchemy.orm.base.Mapped[float | None]
- embedding_computation_time
Time taken to compute embeddings.
- Type:
sqlalchemy.orm.base.Mapped[float | None]
- created_at
Timestamp of creation.
- Type:
sqlalchemy.orm.base.Mapped[datetime.datetime]
- model
Related model instance.
- Type:
sqlalchemy.orm.base.Mapped[src.utils.models.Model]
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class src.utils.models.GeneratedFile(**kwargs)[source]
Bases:
BaseTracks generated output files for a model run.
- created_at
Timestamp of creation.
- Type:
sqlalchemy.orm.base.Mapped[datetime.datetime]
- model
Related model instance.
- Type:
sqlalchemy.orm.base.Mapped[src.utils.models.Model]
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class src.utils.models.GlobalChart(**kwargs)[source]
Bases:
BaseStores paths to global chart images.
- created_at
Timestamp of creation.
- Type:
sqlalchemy.orm.base.Mapped[datetime.datetime]
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class src.utils.models.ProcessingResult(**kwargs)[source]
Bases:
BaseStores detailed processing results for each file.
- created_at
Timestamp of creation.
- Type:
sqlalchemy.orm.base.Mapped[datetime.datetime]
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class src.utils.models.EmbeddingCache(**kwargs)[source]
Bases:
BaseCaches computed embeddings for reuse.
- text_hash
Hash of the embedded text (part of composite primary key).
- Type:
sqlalchemy.orm.base.Mapped[str]
- created_at
Timestamp of creation.
- Type:
sqlalchemy.orm.base.Mapped[datetime.datetime]
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class src.utils.models.TSNECoordinate(**kwargs)[source]
Bases:
BaseCaches t-SNE coordinate calculations.
- created_at
Timestamp of creation.
- Type:
sqlalchemy.orm.base.Mapped[datetime.datetime]
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.