Utilities

Utility functions and classes for data processing and storage.

Database

Database management module for ForzaEmbed.

This module provides the EmbeddingDatabase class for managing all database operations including storing embeddings, results, and metadata. It implements intelligent quantization for efficient storage and caching mechanisms for improved performance.

Example

Basic database usage:

from src.utils.database import EmbeddingDatabase

db = EmbeddingDatabase("results.db", config)
db.save_embeddings_batch("model_name", embeddings_dict)
cached = db.get_embeddings_by_hashes("model_name", ["hash1", "hash2"])
class src.utils.database.EmbeddingDatabase(db_path, config)[source]

Bases: object

Manage SQLite database for embeddings, results, and metadata.

This class handles all database operations for ForzaEmbed, including storing and retrieving embeddings, processing results, and various metadata. Implements intelligent quantization to reduce storage size.

db_path

Path to the SQLite database file.

config

Application configuration (dict or AppConfig).

quantization_enabled

Whether intelligent quantization is enabled.

engine

SQLAlchemy database engine.

Session

SQLAlchemy session factory.

__init__(db_path, config)[source]

Initialize the EmbeddingDatabase.

Parameters:
  • db_path (str) – Path to the SQLite database file.

  • config (AppConfig | Dict[str, Any]) – Application configuration, either as AppConfig or dict.

add_model(name, base_model_name, model_type, chunk_size, chunk_overlap, theme_name, chunking_strategy, similarity_metric)[source]

Add a model run to the database.

Parameters:
  • name (str) – Unique run name identifier.

  • base_model_name (str) – The underlying model name.

  • model_type (str) – Type of model (api, fastembed, etc.).

  • chunk_size (int) – Chunk size used.

  • chunk_overlap (int) – Chunk overlap used.

  • theme_name (str) – Theme set name.

  • chunking_strategy (str) – Chunking strategy used.

  • similarity_metric (str) – Similarity metric used.

add_evaluation_metrics(model_name, metrics)[source]

Add or update evaluation metrics for a model.

Parameters:
  • model_name (str) – The model run name.

  • metrics (Dict[str, float]) – Dictionary of metric names to values.

add_generated_file(model_name, file_type, file_path)[source]

Add a generated file record to the database.

Parameters:
  • model_name (str) – The model run name.

  • file_type (str) – Type of the generated file.

  • file_path (str) – Path to the generated file.

add_global_chart(chart_type, file_path)[source]

Add or update a global chart record.

Parameters:
  • chart_type (str) – Type identifier for the chart.

  • file_path (str) – Path to the chart image file.

model_exists(name)[source]

Check if a model with the specified run name exists.

Parameters:

name (str) – The run name to check.

Returns:

True if the model exists, False otherwise.

Return type:

bool

save_processing_result(model_name, file_id, results)[source]

Save detailed processing result for a file and model.

Parameters:
  • model_name (str) – The model run name.

  • file_id (str) – Identifier for the processed file.

  • results (Dict[str, Any]) – Dictionary of processing results.

save_processing_results_batch(results_batch)[source]

Save a batch of processing results in a single transaction.

Parameters:

results_batch (List[Tuple[str, str, Dict[str, Any]]]) – List of (model_name, file_id, results) tuples.

get_processed_files(model_name)[source]

Retrieve file IDs that have been processed for a model.

Parameters:

model_name (str) – The model run name.

Returns:

List of file IDs that have been processed.

Return type:

List[str]

get_model_info(run_name)[source]

Retrieve information about a model by its run name.

Parameters:

run_name (str) – The unique run name identifier.

Returns:

Dictionary with model information, or None if not found.

Return type:

Dict[str, Any] | None

get_all_processing_results()[source]

Retrieve all processing results organized by model run name.

Fetches raw file-level results without model-level aggregation.

Returns:

Dictionary mapping model names to their file results.

Return type:

Dict[str, Any]

get_all_models()[source]

Retrieve all models with their metrics.

Returns:

List of dictionaries containing model information and metrics.

Return type:

List[Dict[str, Any]]

get_model_files(model_name)[source]

Retrieve all generated files for a model.

Parameters:

model_name (str) – The model run name.

Returns:

List of dictionaries with file type and path.

Return type:

List[Dict[str, str]]

get_global_charts()[source]

Retrieve all global charts.

Returns:

List of dictionaries with chart type and path.

Return type:

List[Dict[str, str]]

vacuum_database()[source]

Vacuum the database to reclaim space.

get_all_run_names()[source]

Retrieve all existing run names.

Returns:

List of unique run name identifiers.

Return type:

list[str]

get_processed_files_with_similarities(run_name)[source]

Retrieve files that have been processed with similarity scores.

Parameters:

run_name (str) – The model run name.

Returns:

List of file IDs that have similarity data.

Return type:

list[str]

get_embeddings_by_hashes(base_model_name, text_hashes)[source]

Retrieve embeddings from cache by model and text hashes.

Parameters:
  • base_model_name (str) – The base model name used for embeddings.

  • text_hashes (List[str]) – List of text hash values to retrieve.

Returns:

Dictionary mapping text hashes to embedding arrays.

Return type:

Dict[str, ndarray]

save_embeddings_batch(base_model_name, embeddings)[source]

Save a batch of embeddings to the cache.

Applies intelligent quantization to reduce storage size when enabled.

Parameters:
  • base_model_name (str) – The base model name for the embeddings.

  • embeddings (Dict[str, ndarray]) – Dictionary mapping text hashes to embedding arrays.

save_tsne_coordinates(tsne_key, file_id, coordinates)[source]

Save t-SNE coordinates for a given configuration.

Parameters:
  • tsne_key (str) – Unique key for the t-SNE configuration.

  • file_id (str) – Identifier for the file.

  • coordinates (Dict[str, List[float]]) – Dictionary with ‘x’ and ‘y’ coordinate lists.

get_tsne_coordinates(tsne_key, file_id)[source]

Retrieve t-SNE coordinates for a given configuration.

Parameters:
  • tsne_key (str) – Unique key for the t-SNE configuration.

  • file_id (str) – Identifier for the file.

Returns:

Dictionary with ‘x’ and ‘y’ coordinate lists, or None if not found.

Return type:

Dict[str, List[float]] | None

clear_tsne_cache()[source]

Clear all cached t-SNE coordinates.

get_run_details(run_name)[source]

Retrieve detailed information for a specific run.

Parameters:

run_name (str) – The unique run name identifier.

Returns:

Dictionary with full run details, or None if not found.

Return type:

Dict[str, Any] | None

get_all_processing_results_for_run(model_name)[source]

Get all processing results for a specific run with dequantization.

Parameters:

model_name (str) – The model run name.

Returns:

Dictionary mapping file IDs to their processing results.

Return type:

Dict[str, Dict[str, Any]]

update_metrics_for_file(model_name, file_id, metrics)[source]

Update metrics for a specific file in a model run.

Parameters:
  • model_name (str) – The model run name.

  • file_id (str) – Identifier for the file.

  • metrics (Dict[str, Any]) – Dictionary of metric values to update.

get_db_modification_time()[source]

Get the last modification time of the database file.

Returns:

Unix timestamp of the last modification.

Return type:

float

Data Loader

Data loading utilities for ForzaEmbed.

This module provides functions for loading markdown content from various sources including directories and lists of strings. It handles file I/O and content extraction.

Example

Load markdown files from a directory:

from src.utils.data_loader import load_markdown_files

files = load_markdown_files("markdowns/")
for name, content in files:
    print(f"Loaded: {name}")
src.utils.data_loader.load_markdown_files(data_source)[source]

Load markdown content from various sources.

Supports loading from: 1. A directory path (str or Path) to load all .md files. 2. A list of strings where each string is markdown content.

Parameters:

data_source (str | Path | List[str]) – The source of markdown data. Can be a directory path or a list of markdown content strings.

Returns:

List of tuples containing (name, content) pairs.

Raises:

TypeError – If data_source is not a supported type.

Return type:

List[Tuple[str, str]]

Text Processing

Text processing utilities for ForzaEmbed.

This module provides utility functions for text chunking, pattern matching, and context extraction. It supports multiple chunking strategies including langchain, semchunk, nltk, spacy, and raw character-based chunking.

Example

Chunk text using different strategies:

from src.utils.utils import chunk_text

chunks = chunk_text(text, chunk_size=500, chunk_overlap=50, strategy="langchain")
src.utils.utils.get_spacy_model(language)[source]

Load and cache a spaCy model for a given language.

Downloads the model if it’s not available locally.

Parameters:

language (str) – Language code (‘fr’ for French, ‘en’ for English).

Returns:

Loaded spaCy Language model.

Raises:

ValueError – If the language is not supported.

Return type:

Language

src.utils.utils.chunk_text(text, chunk_size, chunk_overlap, strategy='langchain', language='fr')[source]

Split text into segments using a specified strategy.

Supports multiple chunking strategies with different characteristics. Some strategies (nltk, spacy) ignore chunk_size and chunk_overlap.

Parameters:
  • text (str) – Text to split.

  • chunk_size (int) – Size of chunks in characters (ignored by nltk, spacy).

  • chunk_overlap (int) – Overlap between chunks (ignored by nltk, spacy).

  • strategy (str) – Chunking strategy to use. One of: ‘langchain’, ‘semchunk’, ‘nltk’, ‘spacy’, ‘raw’.

  • language (str) – Language of the text (‘fr’ or ‘en’).

Returns:

List of extracted text segments.

Raises:

ValueError – If an unknown chunking strategy is specified.

Return type:

List[str]

src.utils.utils.contains_horaire_pattern(text, keywords)[source]

Check if text contains patterns related to opening hours.

Uses regex patterns to detect time-related content including days, times, and action keywords.

Parameters:
  • text (str) – Text to analyze.

  • keywords (dict[str, list[str]]) – Dictionary with ‘jours’ (days) and ‘actions’ keyword lists.

Returns:

True if an opening hours pattern is found, False otherwise.

Return type:

bool

src.utils.utils.extract_context_around_phrase(phrases, phrase_index)[source]

Extract and highlight context around a target sentence.

Parameters:
  • phrases (list[str]) – List of sentences.

  • phrase_index (int) – Index of the target sentence.

Returns:

The target sentence wrapped in markdown bold formatting, or empty string if index is out of bounds.

Return type:

str

Database Models

SQLAlchemy ORM models for the ForzaEmbed database.

This module defines all database models used for storing embedding results, metrics, and metadata. Uses SQLAlchemy 2.0 declarative mapping with type annotations.

Models:

Model: Stores model run configurations. EvaluationMetric: Stores evaluation metrics for each model run. GeneratedFile: Tracks generated output files. GlobalChart: Stores paths to global chart images. ProcessingResult: Stores detailed processing results per file. EmbeddingCache: Caches computed embeddings for reuse. TSNECoordinate: Caches t-SNE coordinate calculations.

class src.utils.models.Base(**kwargs)[source]

Bases: DeclarativeBase

Base class for all SQLAlchemy ORM models.

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

metadata: ClassVar[MetaData] = MetaData()

Refers to the _schema.MetaData collection that will be used for new _schema.Table objects.

See also

orm_declarative_metadata

registry: ClassVar[registry] = <sqlalchemy.orm.decl_api.registry object>

Refers to the _orm.registry in use where new _orm.Mapper objects will be associated.

class src.utils.models.Model(**kwargs)[source]

Bases: Base

Stores model run configuration and metadata.

id

Primary key.

Type:

sqlalchemy.orm.base.Mapped[int]

name

Unique run name identifier.

Type:

sqlalchemy.orm.base.Mapped[str]

base_model_name

The underlying model name.

Type:

sqlalchemy.orm.base.Mapped[str]

type

Model type (api, fastembed, sentence_transformers, etc.).

Type:

sqlalchemy.orm.base.Mapped[str]

chunk_size

Chunk size used in this run.

Type:

sqlalchemy.orm.base.Mapped[int]

chunk_overlap

Chunk overlap used in this run.

Type:

sqlalchemy.orm.base.Mapped[int]

theme_name

Theme set name used.

Type:

sqlalchemy.orm.base.Mapped[str]

chunking_strategy

Chunking strategy used.

Type:

sqlalchemy.orm.base.Mapped[str]

similarity_metric

Similarity metric used.

Type:

sqlalchemy.orm.base.Mapped[str | None]

created_at

Timestamp of creation.

Type:

sqlalchemy.orm.base.Mapped[datetime.datetime]

metrics

Related evaluation metrics.

Type:

sqlalchemy.orm.base.Mapped[src.utils.models.EvaluationMetric]

generated_files

Related generated files.

Type:

sqlalchemy.orm.base.Mapped[list[src.utils.models.GeneratedFile]]

id: Mapped[int]
name: Mapped[str]
base_model_name: Mapped[str]
type: Mapped[str]
chunk_size: Mapped[int]
chunk_overlap: Mapped[int]
theme_name: Mapped[str]
chunking_strategy: Mapped[str]
similarity_metric: Mapped[str | None]
created_at: Mapped[datetime]
metrics: Mapped[EvaluationMetric]
generated_files: Mapped[list[GeneratedFile]]
__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class src.utils.models.EvaluationMetric(**kwargs)[source]

Bases: Base

Stores evaluation metrics for a model run.

id

Primary key.

Type:

sqlalchemy.orm.base.Mapped[int]

model_name

Foreign key to the model.

Type:

sqlalchemy.orm.base.Mapped[str]

silhouette_score

Silhouette clustering score.

Type:

sqlalchemy.orm.base.Mapped[float | None]

intra_cluster_distance_normalized

Normalized intra-cluster distance.

Type:

sqlalchemy.orm.base.Mapped[float | None]

inter_cluster_distance_normalized

Normalized inter-cluster distance.

Type:

sqlalchemy.orm.base.Mapped[float | None]

embedding_computation_time

Time taken to compute embeddings.

Type:

sqlalchemy.orm.base.Mapped[float | None]

created_at

Timestamp of creation.

Type:

sqlalchemy.orm.base.Mapped[datetime.datetime]

model

Related model instance.

Type:

sqlalchemy.orm.base.Mapped[src.utils.models.Model]

id: Mapped[int]
model_name: Mapped[str]
silhouette_score: Mapped[float | None]
intra_cluster_distance_normalized: Mapped[float | None]
inter_cluster_distance_normalized: Mapped[float | None]
embedding_computation_time: Mapped[float | None]
created_at: Mapped[datetime]
model: Mapped[Model]
__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class src.utils.models.GeneratedFile(**kwargs)[source]

Bases: Base

Tracks generated output files for a model run.

id

Primary key.

Type:

sqlalchemy.orm.base.Mapped[int]

model_name

Foreign key to the model.

Type:

sqlalchemy.orm.base.Mapped[str]

file_type

Type of the generated file.

Type:

sqlalchemy.orm.base.Mapped[str]

file_path

Path to the generated file.

Type:

sqlalchemy.orm.base.Mapped[str]

created_at

Timestamp of creation.

Type:

sqlalchemy.orm.base.Mapped[datetime.datetime]

model

Related model instance.

Type:

sqlalchemy.orm.base.Mapped[src.utils.models.Model]

id: Mapped[int]
model_name: Mapped[str]
file_type: Mapped[str]
file_path: Mapped[str]
created_at: Mapped[datetime]
model: Mapped[Model]
__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class src.utils.models.GlobalChart(**kwargs)[source]

Bases: Base

Stores paths to global chart images.

id

Primary key.

Type:

sqlalchemy.orm.base.Mapped[int]

chart_type

Type identifier for the chart.

Type:

sqlalchemy.orm.base.Mapped[str]

file_path

Path to the chart image file.

Type:

sqlalchemy.orm.base.Mapped[str]

created_at

Timestamp of creation.

Type:

sqlalchemy.orm.base.Mapped[datetime.datetime]

id: Mapped[int]
chart_type: Mapped[str]
file_path: Mapped[str]
created_at: Mapped[datetime]
__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class src.utils.models.ProcessingResult(**kwargs)[source]

Bases: Base

Stores detailed processing results for each file.

id

Primary key.

Type:

sqlalchemy.orm.base.Mapped[int]

model_name

The model run name.

Type:

sqlalchemy.orm.base.Mapped[str]

file_id

Identifier for the processed file.

Type:

sqlalchemy.orm.base.Mapped[str]

results_blob

Serialized results data.

Type:

sqlalchemy.orm.base.Mapped[bytes]

created_at

Timestamp of creation.

Type:

sqlalchemy.orm.base.Mapped[datetime.datetime]

id: Mapped[int]
model_name: Mapped[str]
file_id: Mapped[str]
results_blob: Mapped[bytes]
created_at: Mapped[datetime]
__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class src.utils.models.EmbeddingCache(**kwargs)[source]

Bases: Base

Caches computed embeddings for reuse.

model_name

The model name (part of composite primary key).

Type:

sqlalchemy.orm.base.Mapped[str]

text_hash

Hash of the embedded text (part of composite primary key).

Type:

sqlalchemy.orm.base.Mapped[str]

vector

Serialized embedding vector.

Type:

sqlalchemy.orm.base.Mapped[bytes]

dimension

Dimension of the embedding vector.

Type:

sqlalchemy.orm.base.Mapped[int]

created_at

Timestamp of creation.

Type:

sqlalchemy.orm.base.Mapped[datetime.datetime]

model_name: Mapped[str]
text_hash: Mapped[str]
vector: Mapped[bytes]
dimension: Mapped[int]
created_at: Mapped[datetime]
__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class src.utils.models.TSNECoordinate(**kwargs)[source]

Bases: Base

Caches t-SNE coordinate calculations.

id

Primary key.

Type:

sqlalchemy.orm.base.Mapped[int]

tsne_key

Unique key for the t-SNE configuration.

Type:

sqlalchemy.orm.base.Mapped[str]

file_id

Identifier for the file.

Type:

sqlalchemy.orm.base.Mapped[str]

coordinates

Serialized coordinate data.

Type:

sqlalchemy.orm.base.Mapped[bytes]

created_at

Timestamp of creation.

Type:

sqlalchemy.orm.base.Mapped[datetime.datetime]

id: Mapped[int]
tsne_key: Mapped[str]
file_id: Mapped[str]
coordinates: Mapped[bytes]
created_at: Mapped[datetime]
__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.