Services Module

The services module contains business logic services that coordinate between different components.

Embedding Service

Embedding service for ForzaEmbed.

This module provides the EmbeddingService class that handles embedding generation and caching. It abstracts the different embedding clients and provides a unified interface for the processing pipeline.

Example

Generate embeddings using the service:

from src.services.embedding_service import EmbeddingService

service = EmbeddingService(db, config)
embed_func = service.get_embedding_function(model_config)
embeddings, time = service.get_or_create_embeddings(embed_func, "model", texts)

class src.services.embedding_service.EmbeddingService(db, config)[source]

Bases: object

Handle embedding generation and caching.

Provides a unified interface for generating embeddings using different backends (API, FastEmbed, Sentence Transformers, etc.) with automatic caching.

db: The embedding database for caching.

config: The application configuration.

multiprocessing_config: Multiprocessing settings from config.

__init__(db, config)[source]

Initialize the EmbeddingService.

Parameters:

db (EmbeddingDatabase) – The embedding database for caching.
config (AppConfig) – The application configuration.

get_embedding_function(model_config)[source]

Create the appropriate embedding function based on model type.

Parameters:: model_config (ModelConfig) – Configuration for the embedding model.
Returns:: A callable that takes a list of texts and returns embeddings.
Raises:: ValueError – If the model type is unsupported or API model lacks base_url.
Return type:: Callable[[List[str]], List[List[float]]]

get_or_create_embeddings(embedding_function, base_model_name, phrases)[source]

Retrieve embeddings from cache or generate and cache them.

Checks the database cache for existing embeddings. For phrases not in cache, generates new embeddings using the provided function and stores them.

Parameters:

embedding_function (Callable[[List[str]], List[List[float]]]) – Function to generate embeddings for texts.
base_model_name (str) – Name of the embedding model for cache key.
phrases (list[str]) – List of text phrases to embed.

Returns:

Dictionary mapping text hashes to embedding arrays.
Computation time in seconds for new embeddings.

Return type:

A tuple containing

static get_text_hash(text)[source]

Generate a SHA-256 hash for a given text.

Parameters:: text (str) – The text to hash.
Returns:: Hexadecimal string of the SHA-256 hash.
Return type:: str

Similarity Service

Similarity calculation service for ForzaEmbed.

This module provides the SimilarityService class that handles various similarity and distance metric calculations between embeddings. It supports cosine, dot product, euclidean, manhattan, and chebyshev metrics.

Example

Calculate similarity between theme and phrase embeddings:

from src.services.similarity_service import SimilarityService

similarities = SimilarityService.calculate_similarity(
    embed_themes, embed_phrases, "cosine"
)
validated = SimilarityService.validate_similarities(similarities, "cosine")

class src.services.similarity_service.SimilarityService[source]

Bases: object

Handle similarity calculations and validation.

Provides static methods for computing various similarity metrics between embedding matrices and validating/normalizing the results.

static calculate_similarity(embed_themes, embed_phrases, metric)[source]

Calculate similarity between theme embeddings and phrase embeddings.

Parameters:

embed_themes (ndarray) – Theme embeddings array of shape (n_themes, n_dims).
embed_phrases (ndarray) – Phrase embeddings array of shape (n_phrases, n_dims).
metric (str) – The similarity metric to use. One of ‘cosine’, ‘dot_product’, ‘euclidean’, ‘manhattan’, or ‘chebyshev’.

Returns:

Similarity matrix of shape (n_themes, n_phrases).

Raises:

ValueError – If an unknown similarity metric is specified.

Return type:

ndarray

static validate_similarities(similarities, metric)[source]

Validate and clean similarities based on the metric used.

Handles NaN and infinite values, then normalizes the similarity values to an appropriate range based on the metric type.

Parameters:

similarities (ndarray) – Raw similarity matrix to validate.
metric (str) – The similarity metric that was used. One of ‘cosine’, ‘dot_product’, ‘euclidean’, ‘manhattan’, or ‘chebyshev’.

Returns:

Cleaned and normalized similarity matrix with values in [0, 1].

Return type:

ndarray

Visualization Service

Visualization service for ForzaEmbed.

This module provides the VisualizationService class that handles dimensionality reduction (t-SNE, UMAP, PCA) and caching for embedding visualizations.

class src.services.visualization_service.VisualizationService(db)[source]

Bases: object

Handle visualization tasks like UMAP, PCA and t-SNE coordinate generation.

Manages the computation and caching of projection coordinates for embedding visualizations.

db: The embedding database for caching coordinates.

__init__(db)[source]

Initialize the VisualizationService.

Parameters:: db (EmbeddingDatabase) – The embedding database for caching.

get_or_create_projections(embeddings, base_key, file_id, similarities)[source]

Compute or retrieve projection coordinates (UMAP, t-SNE, PCA).

Checks the database cache for existing coordinates using method-specific keys.

Parameters:

embeddings (ndarray) – Embedding matrix of shape (n_samples, n_dims).
base_key (str) – Base cache key for the computation.
file_id (str) – Identifier for the file being visualized.
similarities (ndarray) – Similarity matrix for determining labels.
threshold – Similarity threshold for labeling points.

Returns:

Dictionary containing projection data for umap, tsne, and pca. Returns None if embeddings have <= 1 sample or on error.

Return type:

Dict[str, Any] | None