JaVAD Main Module

Classes

Processor

class javad.Processor(model_name: str = 'balanced', checkpoint: str | None = None, step: float | None = None, onset: float = 0.0, offset: float = 0.0, padding: tuple = (0.0, 0.0), min_duration: float = 0.3, min_silence: float = 0.3, batch_size: int = 32, num_workers: int = 0, threshold: float | None = None, device: device | str = device(type='cpu'))[source]

get_min_input() → int[source]: Get the minimum input duration in samples.

intervals(audio: ndarray | Tensor, step: float | None = None) → List[Tuple[float, float]][source]

Process audio to find voice activity intervals.

This method analyzes the audio signal to detect voice activity and returns a list of time intervals where speech is present. The process includes: 1. Getting voice activity predictions 2. Converting predictions to initial intervals 3. Filtering out intervals that are too short 4. Padding remaining intervals 5. Handling onset/offset boundaries 6. Merging intervals that are too close together

Parameters:

audio (Union[np.ndarray, torch.Tensor]) – Input audio signal
step (Union[float, None], optional) – Step size for processing audio. If None, uses default configuration. Defaults to None.

Returns:

List of intervals where voice activity is detected. Each interval is a tuple of (start_time, end_time) in seconds.

Return type:

List[Tuple[float, float]]

Note

The returned intervals are affected by several configuration parameters: - min_duration: Minimum valid interval duration - padding: Amount of padding to add to intervals - onset: Start time to ignore - offset: End time to ignore - min_silence: Minimum silence duration between intervals

logits(audio: ndarray | Tensor, step: float | None = None) → Tensor[source]

Process audio data to generate logit predictions using a trained model. This method converts audio input to a spectrogram, normalizes it, splits it into overlapping windows, and processes these windows through the model to generate predictions. The predictions from overlapping windows are averaged to produce the final output.

Parameters:

audio (Union[np.ndarray, torch.Tensor]) – Input audio data as either numpy array or PyTorch tensor.
step (Optional[float]) – Step size in seconds for sliding window. If None, uses the configured default step size. Defaults to None.

Returns:

Averaged model predictions across all windows, with shape: matching the temporal dimension of the input spectrogram.

Return type:

torch.Tensor

Raises:

Warning – If the standard deviation of the spectrogram is zero, indicating potential issues with the input audio.

Notes

The input audio is converted to a log-mel spectrogram before processing
The spectrogram is normalized using mean and standard deviation
Processing is done in batches to handle resources efficiently
Overlapping predictions are averaged to smooth transitions

predict(audio: ndarray | Tensor, step: float | None = None) → Tensor[source]: Predict voice activity from audio signal. Converts logits (values) to boolean predictions.

static predictions_to_intervals(bool_array: Tensor, fps: int) → List[Tuple[float, float]][source]

Converts a boolean tensor array of predictions into a list of time intervals. This function identifies contiguous sequences of True values in the boolean array and converts them into time intervals based on the given frames per second (fps).

Parameters:

bool_array (torch.Tensor) – A 1D boolean tensor where True values represent active segments
fps (int) – Frames per second, used to convert frame indices to time in seconds

Returns:

A list of tuples where each tuple contains: (start_time, end_time) in seconds for each detected interval

Return type:

List[Tuple[float, float]]

preload_mel_filters(n_mels: int) → Tensor[source]: Load mel filter bank matrices for a given number of mel bins.

Pipeline

class javad.Pipeline(model_name: str = 'balanced', checkpoint: str | None = None, mode: str = 'gradual', threshold: float | None = None, device: device | str = device(type='cpu'))[source]

detect(chunk: List | ndarray | Tensor, min_duration: float = 0.0) → bool[source]

Detect speech presence in the provided audio chunk. This method analyzes an audio chunk to detect speech segments and determines if any speech segment exceeds the minimum duration threshold.

Parameters:

chunk (Union[List, np.ndarray, torch.Tensor]) – Audio data chunk to analyze.
min_duration (float, optional) – Minimum duration in seconds for a speech segment to be considered valid. Defaults to 0.0. If 0.0, uses the pipeline’s default minimum duration.

Returns:

True if speech segments longer than min_duration are detected, False otherwise.

Return type:

bool

Notes

The method maintains a detection_carry state variable to handle speech segments that span multiple chunks
Speech segments are identified by analyzing state changes in model predictions
Duration is calculated based on the configured frames per second (fps)

intervals(chunk: List | ndarray | Tensor) → List[Tuple] | Dict[source]

Process the chunk of data and return intervals based on predictions. This method processes input data chunks and returns time intervals based on the prediction mode. For ‘instant’ mode, it directly converts predictions from latest chunk to intervals. For ‘gradual’ mode, it maintains and updates intervals across all chunks.

Parameters:: chunk (Union[List, numpy.ndarray, torch.Tensor]) – The chunk of data to process.
Returns:: The intervals based on predictions.
Return type:: Union[List[Tuple], Dict]

logits(chunk: List | ndarray | Tensor) → Tensor | Dict

Pushes a chunk of audio data through the model for prediction. This method processes audio chunks for prediction by: 1. Converting input to torch tensor if needed 2. Padding the chunk if it’s not divisible by hop length 3. Managing a rolling audio buffer 4. Computing log mel spectrogram 5. Normalizing the spectrogram 6. Running prediction 7. Tracking and aggregating predictions across chunks

Parameters:

chunk – Union[List, np.ndarray, torch.Tensor] Audio chunk to process. Can be a list, numpy array or torch tensor. Length must not exceed model’s window_size.

Returns:

Union[torch.Tensor, Dict[int, torch.Tensor]]: If mode is “instant”: Returns tensor of predictions for current chunk If mode is “gradual”: Returns dict mapping chunk numbers to mean predictions across all passes that included that chunk

Raises:

ValueError – If chunk length is larger than model window size If non-final chunk length is not divisible by hop length

static merge_intervals(intervals: List[Tuple]) → List[source]

Merges adjacent intervals in a list of tuples. This function takes a list of intervals (start, end) and merges any overlapping intervals into a single interval. Two intervals are considered overlapping if the start of one interval is within 0.01 of the end of another interval.

Parameters:: intervals (List[Tuple]) – List of tuples where each tuple contains start and end points of an interval.
Returns:: A new list containing merged intervals with no overlaps.
Return type:: List

normalize_spectrogram(spectrogram: Tensor) → Tensor[source]

Normalizes the spectrogram using running mean and standard deviation.

Parameters:

spectrogram (torch.Tensor) – Input spectrogram tensor to be normalized.

Returns:

Normalized spectrogram tensor with zero mean and unit variance.: If standard deviation is 0, returns original spectrogram unchanged.

Return type:

torch.Tensor

predict(chunk: List | ndarray | Tensor) → Tensor | Dict[source]

Predicts whether audio chunks contain speech based on model predictions.

Parameters:

chunk (Union[List, np.ndarray, torch.Tensor]) – Input audio chunk to process. Can be a list, numpy array or PyTorch tensor.

Returns:

If mode is “instant”: Returns boolean tensor where True indicates speech was detected: (predictions above threshold)
If mode is “gradual”: Returns dictionary mapping chunk numbers to boolean tensors: indicating speech detection for each chunk

Return type:

Union[torch.Tensor, Dict]

Raises:

ValueError – If input chunk has invalid format or dimensions

predictions_to_intervals(chunk_num: int, predictions: Tensor) → List[Tuple[float, float]][source]

Convert binary predictions tensor into a list of time intervals.

Parameters:

chunk_num (int) – Index of the current chunk being processed
predictions (torch.Tensor) – Binary tensor containing predictions (0s and 1s) indicating presence/absence of target signal

Returns:

List of time intervals (start_time, end_time) where: target signal is present. Times are in seconds relative to start of recording.

Return type:

List[Tuple[float, float]]

preload_mel_filters(n_mels: int) → Tensor[source]: Load mel filter bank matrices for a given number of mel bins.

push(chunk: List | ndarray | Tensor) → Tensor | Dict[source]

Pushes a chunk of audio data through the model for prediction. This method processes audio chunks for prediction by: 1. Converting input to torch tensor if needed 2. Padding the chunk if it’s not divisible by hop length 3. Managing a rolling audio buffer 4. Computing log mel spectrogram 5. Normalizing the spectrogram 6. Running prediction 7. Tracking and aggregating predictions across chunks

Parameters:

chunk – Union[List, np.ndarray, torch.Tensor] Audio chunk to process. Can be a list, numpy array or torch tensor. Length must not exceed model’s window_size.

Returns:

Union[torch.Tensor, Dict[int, torch.Tensor]]: If mode is “instant”: Returns tensor of predictions for current chunk If mode is “gradual”: Returns dict mapping chunk numbers to mean predictions across all passes that included that chunk

Raises:

ValueError – If chunk length is larger than model window size If non-final chunk length is not divisible by hop length

reset()[source]: Reset the pipeline to initial state.

update(chunk: List | ndarray | Tensor) → Tensor | Dict

Pushes a chunk of audio data through the model for prediction. This method processes audio chunks for prediction by: 1. Converting input to torch tensor if needed 2. Padding the chunk if it’s not divisible by hop length 3. Managing a rolling audio buffer 4. Computing log mel spectrogram 5. Normalizing the spectrogram 6. Running prediction 7. Tracking and aggregating predictions across chunks

Parameters:

chunk – Union[List, np.ndarray, torch.Tensor] Audio chunk to process. Can be a list, numpy array or torch tensor. Length must not exceed model’s window_size.

Returns:

Union[torch.Tensor, Dict[int, torch.Tensor]]: If mode is “instant”: Returns tensor of predictions for current chunk If mode is “gradual”: Returns dict mapping chunk numbers to mean predictions across all passes that included that chunk

Raises:

ValueError – If chunk length is larger than model window size If non-final chunk length is not divisible by hop length

update_stats(spectrogram: Tensor)[source]

Update running statistics (mean and standard deviation) of the spectrogram data. This method uses Welford’s online algorithm to compute running statistics of streaming spectrogram data.

Parameters:

spectrogram – torch.Tensor Input spectrogram tensor of shape (frequency_bins, time_frames)

Returns:

tuple: A tuple containing: - mean (float): Updated running mean of the spectrogram - std (torch.Tensor): Updated running standard deviation of the spectrogram normalized by total number of frames and frequency bins

Notes

The method tracks the total number of frames processed using self.frames_tracker and updates statistics incrementally using Welford’s method for numerical stability.

Functions

initialize

javad.initialize(name: str = 'balanced') → Module[source]

Initializes a model with the given name.

Parameters:: name (str) – The name of the model to initialize. Defaults to “balanced”. Available options are “tiny”, “balanced”, and “precise”.
Returns:: The initialized model.
Return type:: torch.nn.Module

from_pretrained

javad.from_pretrained(name: str = 'balanced', checkpoint: str | None = None) → Module[source]

Initializes and loads a pre-trained model.

Parameters:

name (str) – The name of the model to initialize and load. Defaults to “balanced”. Available options are “tiny” and “precise”
checkpoint (str, optional) – The path to a checkpoint file to load. Defaults to None. If not None, the model will be loaded from the checkpoint file.

Returns:

The initialized and loaded model.

Return type:

torch.nn.Module

Export Functions

intervals_to_csv

javad.intervals_to_csv(intervals: List[Tuple[float, float]], csv_filename: str, delimiter=',')[source]

Convert a list of speech intervals to CSV format.

Parameters:

intervals (List[Tuple[float, float]]) – List of voice segments as (start_time, end_time) pairs in seconds.
csv_filename (str) – Path to output CSV file.

Example

>>> intervals_to_csv([(0.5, 1.2), (1.8, 3.4)], "output.csv")

intervals_to_rttm

javad.intervals_to_rttm(intervals: List[Tuple[float, float]], rttm_filename: str)[source]

Convert a list of speech intervals to RTTM format.

Parameters:: intervals (List[Tuple[float, float]]) – List of voice segments as (start_time, end_time) pairs in seconds.
Returns:: RTTM formatted string. rttm_filename: Path to output RTTM file.
Return type:: str

Example

>>> rttm = intervals_to_rttm([(0.5, 1.2), (1.8, 3.4)])
>>> print(rttm)
SPEAKER audio 1 0.5 0.7 <NA> <NA> speaker <NA>
SPEAKER audio 1 1.8 1.6 <NA> <NA> speaker <NA>

intervals_to_textgrid

javad.intervals_to_textgrid(intervals: List[Tuple[float, float]], textgrid_filename: str, duration: float, tier_name: str = 'speech') → None[source]

Export voice activity intervals to Praat TextGrid format.

Parameters:

intervals (List[Tuple[float, float]]) – List of (start, end) times in seconds.
filename (str) – Output TextGrid file path.
duration (float) – Total duration of audio in seconds.
tier_name (str, optional) – Name of the interval tier. Defaults to “speech”.