Utilities

Utility functions and classes for data processing and storage.

Database

Database management module for ForzaEmbed.

This module provides the EmbeddingDatabase class for managing all database operations including storing embeddings, results, and metadata. It implements intelligent quantization for efficient storage and caching mechanisms for improved performance.

Example

Basic database usage:

from src.utils.database import EmbeddingDatabase

db = EmbeddingDatabase("results.db", config)
db.save_embeddings_batch("model_name", embeddings_dict)
cached = db.get_embeddings_by_hashes("model_name", ["hash1", "hash2"])

class src.utils.database.EmbeddingDatabase(db_path, config)[source]

Bases: object

Manage SQLite database for embeddings, results, and metadata.

This class handles all database operations for ForzaEmbed, including storing and retrieving embeddings, processing results, and various metadata. Implements intelligent quantization to reduce storage size.

db_path: Path to the SQLite database file.

config: Application configuration (dict or AppConfig).

quantization_enabled: Whether intelligent quantization is enabled.

engine: SQLAlchemy database engine.

Session: SQLAlchemy session factory.

__init__(db_path, config)[source]

Initialize the EmbeddingDatabase.

Parameters:

db_path (str) – Path to the SQLite database file.
config (AppConfig | Dict[str, Any]) – Application configuration, either as AppConfig or dict.

add_model(name, base_model_name, model_type, chunk_size, chunk_overlap, theme_name, chunking_strategy, similarity_metric)[source]

Add a model run to the database.

Parameters:

name (str) – Unique run name identifier.
base_model_name (str) – The underlying model name.
model_type (str) – Type of model (api, fastembed, etc.).
chunk_size (int) – Chunk size used.
chunk_overlap (int) – Chunk overlap used.
theme_name (str) – Theme set name.
chunking_strategy (str) – Chunking strategy used.
similarity_metric (str) – Similarity metric used.

add_generated_file(model_name, file_type, file_path)[source]

Add a generated file record to the database.

Parameters:

model_name (str) – The model run name.
file_type (str) – Type of the generated file.
file_path (str) – Path to the generated file.

add_global_chart(chart_type, file_path)[source]

Add or update a global chart record.

Parameters:

chart_type (str) – Type identifier for the chart.
file_path (str) – Path to the chart image file.

model_exists(name)[source]

Check if a model with the specified run name exists.

Parameters:: name (str) – The run name to check.
Returns:: True if the model exists, False otherwise.
Return type:: bool

save_processing_result(model_name, file_id, results)[source]

Save detailed processing result for a file and model.

Parameters:

model_name (str) – The model run name.
file_id (str) – Identifier for the processed file.
results (Dict[str, Any]) – Dictionary of processing results.

save_processing_results_batch(results_batch)[source]

Save a batch of processing results in a single transaction.

Parameters:: results_batch (List[Tuple[str, str, Dict[str, Any]]]) – List of (model_name, file_id, results) tuples.

get_processed_files(model_name)[source]

Retrieve file IDs that have been processed for a model.

Parameters:: model_name (str) – The model run name.
Returns:: List of file IDs that have been processed.
Return type:: List[str]

get_model_info(run_name)[source]

Retrieve information about a model by its run name.

Parameters:: run_name (str) – The unique run name identifier.
Returns:: Dictionary with model information, or None if not found.
Return type:: Dict[str, Any] | None

get_all_processing_results()[source]

Retrieve all processing results organized by model run name.

Fetches raw file-level results without model-level aggregation.

Returns:: Dictionary mapping model names to their file results.
Return type:: Dict[str, Any]

get_all_models()[source]

get_model_files(model_name)[source]

Retrieve all generated files for a model.

Parameters:: model_name (str) – The model run name.
Returns:: List of dictionaries with file type and path.
Return type:: List[Dict[str, str]]

get_global_charts()[source]

Retrieve all global charts.

Returns:: List of dictionaries with chart type and path.
Return type:: List[Dict[str, str]]

vacuum_database()[source]

Vacuum the database to reclaim space.

get_all_run_names()[source]

Retrieve all existing run names.

Returns:: List of unique run name identifiers.
Return type:: list[str]

get_processed_files_with_similarities(run_name)[source]

Retrieve files that have been processed with similarity scores.

Parameters:: run_name (str) – The model run name.
Returns:: List of file IDs that have similarity data.
Return type:: list[str]

get_embeddings_by_hashes(base_model_name, text_hashes)[source]

Retrieve embeddings from cache by model and text hashes.

Parameters:

base_model_name (str) – The base model name used for embeddings.
text_hashes (List[str]) – List of text hash values to retrieve.

Returns:

Dictionary mapping text hashes to embedding arrays.

Return type:

Dict[str, ndarray]

save_embeddings_batch(base_model_name, embeddings)[source]

Save a batch of embeddings to the cache.

Applies intelligent quantization to reduce storage size when enabled.

Parameters:

base_model_name (str) – The base model name for the embeddings.
embeddings (Dict[str, ndarray]) – Dictionary mapping text hashes to embedding arrays.

save_projection_coordinates(projection_key, file_id, coordinates)[source]

Save t-SNE coordinates for a given configuration.

Parameters:

projection_key (str) – Unique key for the t-SNE configuration.
file_id (str) – Identifier for the file.
coordinates (Dict[str, List[float]]) – Dictionary with ‘x’ and ‘y’ coordinate lists.

get_projection_coordinates(projection_key, file_id)[source]

Retrieve t-SNE coordinates for a given configuration.

Parameters:

projection_key (str) – Unique key for the t-SNE configuration.
file_id (str) – Identifier for the file.

Returns:

Dictionary with ‘x’ and ‘y’ coordinate lists, or None if not found.

Return type:

Dict[str, List[float]] | None

clear_tsne_cache()[source]

Clear all cached t-SNE coordinates.

get_run_details(run_name)[source]

Retrieve detailed information for a specific run.

Parameters:: run_name (str) – The unique run name identifier.
Returns:: Dictionary with full run details, or None if not found.
Return type:: Dict[str, Any] | None

get_all_processing_results_for_run(model_name)[source]

Get all processing results for a specific run with dequantization.

Parameters:: model_name (str) – The model run name.
Returns:: Dictionary mapping file IDs to their processing results.
Return type:: Dict[str, Dict[str, Any]]

update_metrics_for_file(model_name, file_id, metrics)[source]

Update metrics for a specific file in a model run.

Parameters:

model_name (str) – The model run name.
file_id (str) – Identifier for the file.
metrics (Dict[str, Any]) – Dictionary of metric values to update.

get_db_modification_time()[source]

Get the last modification time of the database file.

Returns:: Unix timestamp of the last modification.
Return type:: float

Data Loader

Data loading utilities for ForzaEmbed.

This module provides functions for loading markdown content from various sources including directories and lists of strings. It handles file I/O and content extraction.

Example

Load markdown files from a directory:

from src.utils.data_loader import load_markdown_files

files = load_markdown_files("markdowns/")
for name, content in files:
    print(f"Loaded: {name}")

src.utils.data_loader.load_markdown_files(data_source)[source]

Load markdown content from various sources.

Supports loading from: 1. A directory path (str or Path) to load all .md files. 2. A list of strings where each string is markdown content.

Parameters:: data_source (str | Path | List[str]) – The source of markdown data. Can be a directory path or a list of markdown content strings.
Returns:: List of tuples containing (name, content) pairs.
Raises:: TypeError – If data_source is not a supported type.
Return type:: List[Tuple[str, str]]

Text Processing

Text processing utilities for ForzaEmbed.

This module provides utility functions for text chunking, pattern matching, and context extraction. It supports multiple chunking strategies including langchain, semchunk, nltk, spacy, and raw character-based chunking.

Example

Chunk text using different strategies:

from src.utils.utils import chunk_text

chunks = chunk_text(text, chunk_size=500, chunk_overlap=50, strategy="langchain")

src.utils.utils.get_spacy_model(language)[source]

Load and cache a spaCy model for a given language.

Downloads the model if it’s not available locally.

Parameters:: language (str) – Language code (‘fr’ for French, ‘en’ for English).
Returns:: Loaded spaCy Language model.
Raises:: ValueError – If the language is not supported.
Return type:: Language

src.utils.utils.chunk_text(text, chunk_size, chunk_overlap, strategy='langchain', language='fr')[source]

Split text into segments using a specified strategy.

Supports multiple chunking strategies with different characteristics. Some strategies (nltk, spacy) ignore chunk_size and chunk_overlap.

Parameters:

text (str) – Text to split.
chunk_size (int) – Size of chunks in characters (ignored by nltk, spacy).
chunk_overlap (int) – Overlap between chunks (ignored by nltk, spacy).
strategy (str) – Chunking strategy to use. One of: ‘langchain’, ‘semchunk’, ‘nltk’, ‘spacy’, ‘raw’.
language (str) – Language of the text (‘fr’ or ‘en’).

Returns:

List of extracted text segments.

Raises:

ValueError – If an unknown chunking strategy is specified.

Return type:

List[str]

src.utils.utils.contains_horaire_pattern(text, keywords)[source]

Check if text contains patterns related to opening hours.

Uses regex patterns to detect time-related content including days, times, and action keywords.

Parameters:

text (str) – Text to analyze.
keywords (dict[str, list[str]]) – Dictionary with ‘jours’ (days) and ‘actions’ keyword lists.

Returns:

True if an opening hours pattern is found, False otherwise.

Return type:

bool

src.utils.utils.extract_context_around_phrase(phrases, phrase_index)[source]

Extract and highlight context around a target sentence.

Parameters:

phrases (list[str]) – List of sentences.
phrase_index (int) – Index of the target sentence.

Returns:

The target sentence wrapped in markdown bold formatting, or empty string if index is out of bounds.

Return type:

str

Database Models

SQLAlchemy ORM models for the ForzaEmbed database.

This module defines all database models used for storing embedding results, metrics, and metadata. Uses SQLAlchemy 2.0 declarative mapping with type annotations.

Models:: Model: Stores model run configurations. ProcessingResult: Stores detailed processing results per file. EmbeddingCache: Caches computed embeddings for reuse. ProjectionCoordinate: Caches dimensional reduction coordinates.

class src.utils.models.Base(**kwargs)[source]

Bases: DeclarativeBase

Base class for all SQLAlchemy ORM models.

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

metadata: ClassVar[MetaData] = MetaData(): Refers to the _schema.MetaData collection that will be used for new _schema.Table objects.

See also

orm_declarative_metadata

registry: ClassVar[registry] = <sqlalchemy.orm.decl_api.registry object>: Refers to the _orm.registry in use where new _orm.Mapper objects will be associated.

class src.utils.models.Model(**kwargs)[source]

Bases: Base

Stores model run configuration and metadata.

id

Primary key.

Type:: sqlalchemy.orm.base.Mapped[int]

name

Unique run name identifier.

Type:: sqlalchemy.orm.base.Mapped[str]

base_model_name

The underlying model name.

Type:: sqlalchemy.orm.base.Mapped[str]

type

Model type (api, fastembed, sentence_transformers, etc.).

Type:: sqlalchemy.orm.base.Mapped[str]

chunk_size

Chunk size used in this run.

Type:: sqlalchemy.orm.base.Mapped[int]

chunk_overlap

Chunk overlap used in this run.

Type:: sqlalchemy.orm.base.Mapped[int]

theme_name

Theme set name used.

Type:: sqlalchemy.orm.base.Mapped[str]

chunking_strategy

Chunking strategy used.

Type:: sqlalchemy.orm.base.Mapped[str]

similarity_metric

Similarity metric used.

Type:: sqlalchemy.orm.base.Mapped[str | None]

created_at

Timestamp of creation.

Type:: sqlalchemy.orm.base.Mapped[datetime.datetime]

id: Mapped[int]

name: Mapped[str]

base_model_name: Mapped[str]

type: Mapped[str]

chunk_size: Mapped[int]

chunk_overlap: Mapped[int]

theme_name: Mapped[str]

chunking_strategy: Mapped[str]

similarity_metric: Mapped[str | None]

created_at: Mapped[datetime]

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class src.utils.models.ProcessingResult(**kwargs)[source]

Bases: Base

Stores detailed processing results for each file.

id

Primary key.

Type:: sqlalchemy.orm.base.Mapped[int]

model_name

The model run name.

Type:: sqlalchemy.orm.base.Mapped[str]

file_id

Identifier for the processed file.

Type:: sqlalchemy.orm.base.Mapped[str]

results_blob

Serialized results data.

Type:: sqlalchemy.orm.base.Mapped[bytes]

created_at

Timestamp of creation.

Type:: sqlalchemy.orm.base.Mapped[datetime.datetime]

id: Mapped[int]

model_name: Mapped[str]

file_id: Mapped[str]

results_blob: Mapped[bytes]

created_at: Mapped[datetime]

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class src.utils.models.EmbeddingCache(**kwargs)[source]

Bases: Base

Caches computed embeddings for reuse.

model_name

The model name (part of composite primary key).

Type:: sqlalchemy.orm.base.Mapped[str]

text_hash

Hash of the embedded text (part of composite primary key).

Type:: sqlalchemy.orm.base.Mapped[str]

vector

Serialized embedding vector.

Type:: sqlalchemy.orm.base.Mapped[bytes]

dimension

Dimension of the embedding vector.

Type:: sqlalchemy.orm.base.Mapped[int]

created_at

Timestamp of creation.

Type:: sqlalchemy.orm.base.Mapped[datetime.datetime]

model_name: Mapped[str]

text_hash: Mapped[str]

vector: Mapped[bytes]

dimension: Mapped[int]

created_at: Mapped[datetime]

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class src.utils.models.ProjectionCoordinate(**kwargs)[source]

Bases: Base

Caches dimensional reduction coordinate calculations (t-SNE, UMAP, PCA).

id

Primary key.

Type:: sqlalchemy.orm.base.Mapped[int]

projection_key

Unique key for the projection configuration.

Type:: sqlalchemy.orm.base.Mapped[str]

file_id

Identifier for the file.

Type:: sqlalchemy.orm.base.Mapped[str]

coordinates

Serialized coordinate data.

Type:: sqlalchemy.orm.base.Mapped[bytes]

created_at

Timestamp of creation.

Type:: sqlalchemy.orm.base.Mapped[datetime.datetime]

id: Mapped[int]

projection_key: Mapped[str]

file_id: Mapped[str]

coordinates: Mapped[bytes]

created_at: Mapped[datetime]

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.