Services Module

The services module contains business logic services that coordinate between different components.

Embedding Service

Embedding service for ForzaEmbed.

This module provides the EmbeddingService class that handles embedding generation and caching. It abstracts the different embedding clients and provides a unified interface for the processing pipeline.

Example

Generate embeddings using the service:

from src.services.embedding_service import EmbeddingService

service = EmbeddingService(db, config)
embed_func = service.get_embedding_function(model_config)
embeddings, time = service.get_or_create_embeddings(embed_func, "model", texts)
class src.services.embedding_service.EmbeddingService(db, config)[source]

Bases: object

Handle embedding generation and caching.

Provides a unified interface for generating embeddings using different backends (API, FastEmbed, Sentence Transformers, etc.) with automatic caching.

db

The embedding database for caching.

config

The application configuration.

multiprocessing_config

Multiprocessing settings from config.

__init__(db, config)[source]

Initialize the EmbeddingService.

Parameters:
get_embedding_function(model_config)[source]

Create the appropriate embedding function based on model type.

Parameters:

model_config (ModelConfig) – Configuration for the embedding model.

Returns:

A callable that takes a list of texts and returns embeddings.

Raises:

ValueError – If the model type is unsupported or API model lacks base_url.

Return type:

Callable[[List[str]], List[List[float]]]

get_or_create_embeddings(embedding_function, base_model_name, phrases)[source]

Retrieve embeddings from cache or generate and cache them.

Checks the database cache for existing embeddings. For phrases not in cache, generates new embeddings using the provided function and stores them.

Parameters:
  • embedding_function (Callable[[List[str]], List[List[float]]]) – Function to generate embeddings for texts.

  • base_model_name (str) – Name of the embedding model for cache key.

  • phrases (list[str]) – List of text phrases to embed.

Returns:

  • Dictionary mapping text hashes to embedding arrays.

  • Computation time in seconds for new embeddings.

Return type:

A tuple containing

static get_text_hash(text)[source]

Generate a SHA-256 hash for a given text.

Parameters:

text (str) – The text to hash.

Returns:

Hexadecimal string of the SHA-256 hash.

Return type:

str

Similarity Service

Similarity calculation service for ForzaEmbed.

This module provides the SimilarityService class that handles various similarity and distance metric calculations between embeddings. It supports cosine, dot product, euclidean, manhattan, and chebyshev metrics.

Example

Calculate similarity between theme and phrase embeddings:

from src.services.similarity_service import SimilarityService

similarities = SimilarityService.calculate_similarity(
    embed_themes, embed_phrases, "cosine"
)
validated = SimilarityService.validate_similarities(similarities, "cosine")
class src.services.similarity_service.SimilarityService[source]

Bases: object

Handle similarity calculations and validation.

Provides static methods for computing various similarity metrics between embedding matrices and validating/normalizing the results.

static calculate_similarity(embed_themes, embed_phrases, metric)[source]

Calculate similarity between theme embeddings and phrase embeddings.

Parameters:
  • embed_themes (ndarray) – Theme embeddings array of shape (n_themes, n_dims).

  • embed_phrases (ndarray) – Phrase embeddings array of shape (n_phrases, n_dims).

  • metric (str) – The similarity metric to use. One of ‘cosine’, ‘dot_product’, ‘euclidean’, ‘manhattan’, or ‘chebyshev’.

Returns:

Similarity matrix of shape (n_themes, n_phrases).

Raises:

ValueError – If an unknown similarity metric is specified.

Return type:

ndarray

static validate_similarities(similarities, metric)[source]

Validate and clean similarities based on the metric used.

Handles NaN and infinite values, then normalizes the similarity values to an appropriate range based on the metric type.

Parameters:
  • similarities (ndarray) – Raw similarity matrix to validate.

  • metric (str) – The similarity metric that was used. One of ‘cosine’, ‘dot_product’, ‘euclidean’, ‘manhattan’, or ‘chebyshev’.

Returns:

Cleaned and normalized similarity matrix with values in [0, 1].

Return type:

ndarray

Visualization Service

Visualization service for ForzaEmbed.

This module provides the VisualizationService class that handles t-SNE coordinate generation and caching for embedding visualizations.

Example

Generate t-SNE visualization data:

from src.services.visualization_service import VisualizationService

service = VisualizationService(db)
tsne_data = service.get_or_create_tsne_data(
    embeddings, "key", "file_id", similarities, 0.5
)
class src.services.visualization_service.VisualizationService(db)[source]

Bases: object

Handle visualization tasks like t-SNE coordinate generation.

Manages the computation and caching of t-SNE coordinates for embedding visualizations.

db

The embedding database for caching t-SNE coordinates.

__init__(db)[source]

Initialize the VisualizationService.

Parameters:

db (EmbeddingDatabase) – The embedding database for caching.

get_or_create_tsne_data(embeddings, tsne_key, file_id, similarities, threshold)[source]

Compute or retrieve t-SNE coordinates for a given combination.

Checks the database cache for existing t-SNE coordinates. If not found, computes new coordinates using sklearn’s TSNE implementation.

Parameters:
  • embeddings (ndarray) – Embedding matrix of shape (n_samples, n_dims).

  • tsne_key (str) – Cache key for the t-SNE computation.

  • file_id (str) – Identifier for the file being visualized.

  • similarities (ndarray) – Similarity matrix for determining labels.

  • threshold (float) – Similarity threshold for labeling points.

Returns:

  • ‘x’: List of x-coordinates.
    • ’y’: List of y-coordinates.

    • ’labels’: List of threshold-based labels.

    • ’similarities’: List of similarity scores.

    • ’title’: Visualization title.

    • ’threshold’: The threshold value used.

Returns None if embeddings have <= 1 sample or on error.

Return type:

Dictionary containing t-SNE visualization data with keys