Skip to main content

Statistical Analysis

single-algebra provides comprehensive statistical analysis capabilities designed for high-dimensional data, with a particular focus on hypothesis testing, multiple testing correction, and effect size calculation. These tools are essential for identifying significant patterns and relationships in biological and other scientific data.

Hypothesis Testing

The library supports various statistical tests for comparing groups within matrix data:

  • Parametric Tests:

    • Student's t-test (equal variance)
    • Welch's t-test (unequal variance)
    • Z-tests for proportions
  • Non-parametric Tests:

    • Mann-Whitney U test (Wilcoxon rank-sum)
    • Discrete distribution tests

Each test can be configured with different alternative hypotheses:

  • Two-sided tests (default)
  • One-sided tests (less than, greater than)

Multiple Testing Correction

When performing many statistical tests simultaneously (as in genomics), correction for multiple testing is essential. single-algebra implements several correction methods:

  • Bonferroni Correction: Most conservative approach, controls family-wise error rate (FWER)
  • Benjamini-Hochberg (BH): Controls false discovery rate (FDR), more powerful than Bonferroni
  • Benjamini-Yekutieli (BY): Conservative variant of BH that handles dependencies between tests
  • Holm-Bonferroni: Step-down method that controls FWER but is more powerful than standard Bonferroni
  • Hochberg: Step-up method for controlling FWER
  • Storey's q-value: Adaptive approach that estimates the proportion of true null hypotheses

Effect Size Calculation

Beyond p-values, the library provides effect size measures to quantify the magnitude of differences:

  • Log2 Fold Change: Common in genomics for measuring expression differences
  • Cohen's d: Standardized mean difference
  • Hedge's g: Bias-corrected effect size for small samples

Matrix-based Statistical Operations

All statistical functions are designed to work efficiently with matrix data:

  • Row-wise Testing: Apply tests to each row (e.g., genes across different conditions)
  • Column-wise Testing: Apply tests to each column (e.g., samples across different features)
  • Batch-aware Analysis: Group-based statistical calculations
  • Masked Analysis: Apply tests selectively to subsets of data

Result Types

Statistical results are returned in structured formats:

  • TestResult: Comprehensive result from a single statistical test, including:

    • Test statistic
    • P-value
    • Effect size
    • Confidence interval
    • Degrees of freedom
    • Standard error
    • Additional test-specific metadata
  • MultipleTestResults: Collection of results from multiple tests, including:

    • Raw p-values
    • Adjusted p-values
    • Effect sizes
    • Test statistics
    • Functions to identify significant features

Differential Expression Analysis

A high-level API for conducting differential expression analysis:

  • MatrixStatTests trait: Provides methods for t-tests, Mann-Whitney tests, and comprehensive differential expression analysis
  • Integrated Workflow: Combines test execution, multiple testing correction, and effect size calculation

Performance Considerations

  • Parallel Implementation: Statistical tests are parallelized using Rayon for efficient processing of large matrices
  • Sparse Matrix Optimization: Tests account for the sparsity of data common in biological datasets
  • Memory Efficiency: Algorithms designed to minimize temporary allocations

The statistical functionality in single-algebra follows a modular design, allowing users to select appropriate testing methodologies and correction approaches based on their specific analysis requirements and data characteristics.