Statistical Analysis

single-algebra provides comprehensive statistical analysis capabilities designed for high-dimensional data, with a particular focus on hypothesis testing, multiple testing correction, and effect size calculation. These tools are essential for identifying significant patterns and relationships in biological and other scientific data.

Hypothesis Testing

The library supports various statistical tests for comparing groups within matrix data:

Parametric Tests:
- Student's t-test (equal variance)
- Welch's t-test (unequal variance)
- Z-tests for proportions
Non-parametric Tests:
- Mann-Whitney U test (Wilcoxon rank-sum)
- Discrete distribution tests

Each test can be configured with different alternative hypotheses:

Two-sided tests (default)
One-sided tests (less than, greater than)

Multiple Testing Correction

When performing many statistical tests simultaneously (as in genomics), correction for multiple testing is essential. single-algebra implements several correction methods:

Bonferroni Correction: Most conservative approach, controls family-wise error rate (FWER)
Benjamini-Hochberg (BH): Controls false discovery rate (FDR), more powerful than Bonferroni
Benjamini-Yekutieli (BY): Conservative variant of BH that handles dependencies between tests
Holm-Bonferroni: Step-down method that controls FWER but is more powerful than standard Bonferroni
Hochberg: Step-up method for controlling FWER
Storey's q-value: Adaptive approach that estimates the proportion of true null hypotheses

Effect Size Calculation

Beyond p-values, the library provides effect size measures to quantify the magnitude of differences:

Log2 Fold Change: Common in genomics for measuring expression differences
Cohen's d: Standardized mean difference
Hedge's g: Bias-corrected effect size for small samples

Matrix-based Statistical Operations

All statistical functions are designed to work efficiently with matrix data:

Row-wise Testing: Apply tests to each row (e.g., genes across different conditions)
Column-wise Testing: Apply tests to each column (e.g., samples across different features)
Batch-aware Analysis: Group-based statistical calculations
Masked Analysis: Apply tests selectively to subsets of data

Result Types

Statistical results are returned in structured formats:

TestResult: Comprehensive result from a single statistical test, including:
- Test statistic
- P-value
- Effect size
- Confidence interval
- Degrees of freedom
- Standard error
- Additional test-specific metadata
MultipleTestResults: Collection of results from multiple tests, including:
- Raw p-values
- Adjusted p-values
- Effect sizes
- Test statistics
- Functions to identify significant features

Differential Expression Analysis

A high-level API for conducting differential expression analysis:

MatrixStatTests trait: Provides methods for t-tests, Mann-Whitney tests, and comprehensive differential expression analysis
Integrated Workflow: Combines test execution, multiple testing correction, and effect size calculation

Performance Considerations

Parallel Implementation: Statistical tests are parallelized using Rayon for efficient processing of large matrices
Sparse Matrix Optimization: Tests account for the sparsity of data common in biological datasets
Memory Efficiency: Algorithms designed to minimize temporary allocations

The statistical functionality in single-algebra follows a modular design, allowing users to select appropriate testing methodologies and correction approaches based on their specific analysis requirements and data characteristics.

Hypothesis Testing​

Multiple Testing Correction​

Effect Size Calculation​

Matrix-based Statistical Operations​

Result Types​

Differential Expression Analysis​

Performance Considerations​