hic3defdr.analysis.analysis module

class hic3defdr.analysis.analysis.AnalyzingHiC3DeFDR[source]

Bases: object

Mixin class containing analysis functions for HiC3DeFDR.

bh()[source]

Applies BH-FDR control to p-values across all chromosomes to obtain q-values.

Should only be run after all chromosomes have been processed through p-values.

classify(chrom=None, fdr=0.05, cluster_size=3, n_threads=-1)[source]

Classifies significantly differential pixels according to which condition they are strongest in.

Parameters:
  • chrom (str) – The chromosome to classify significantly differential pixels on. Pass None to run for all chromosomes in series.
  • fdr (float or list of float) – The FDR threshold used to identify clusters of significantly differential pixels via self.threshold(). Pass a list to do a sweep in series.
  • cluster_size (int or list of int) – The cluster size threshold used to identify clusters of significantly differential pixels via threshold(). Pass a list to do a sweep in series.
  • n_threads (int) – The number of threads (technically GIL-avoiding child processes) to use to process multiple chromosomes in parallel. Pass -1 to use as many threads as there are CPUs. Pass 0 to process the chromosomes serially.
collect(fdr=0.05, cluster_size=3, n_threads=-1)[source]

Collects information on thresholded and classified differential interactions into a single TSV output file.

Parameters:
  • fdr (float or list of float) – The FDR threshold used to identify clusters of significantly differential pixels via self.threshold(). Pass a list to do a sweep in series.
  • cluster_size (int or list of int) – The cluster size threshold used to identify clusters of significantly differential pixels via threshold(). Pass a list to do a sweep in series.
  • n_threads (int) – The number of threads (technically GIL-avoiding child processes) to use to process multiple chromosomes in parallel. Pass -1 to use as many threads as there are CPUs. Pass 0 to process the chromosomes serially.
estimate_disp(estimator='qcml', frac=None, auto_frac_factor=15.0, weighted_lowess=True, n_threads=-1)[source]

Estimates dispersion parameters.

Parameters:
  • estimator ('cml', 'qcml', 'mme', or a function) – Pass ‘cml’, ‘qcml’, ‘mme’ to use conditional maximum likelihood (CML), quantile-adjusted CML (qCML), or method of moments estimation (MME) to estimate the dispersion within each bin. Pass a function that takes in a (pixels, replicates) shaped array of data and returns a dispersion value to use that instead.
  • frac (float, optional) – The lowess smoothing fraction to use when fitting the distance vs dispersion trend. Pass None to choose a value automatically.
  • auto_frac_factor (float) – When frac is None, this factor scales the automatically determined fraction parameter.
  • weighted_lowess (bool) – Whether or not to use a weighted lowess fit when fitting the smoothed dispersion curve.
  • n_threads (int) – The number of threads (technically GIL-avoiding child processes) to use to process multiple distance scales in parallel. Pass -1 to use as many threads as there are CPUs. Pass 0 to process the distance scales serially.
lrt(chrom=None, refit_mu=True, n_threads=-1, verbose=True)[source]

Runs the likelihood ratio test to test for differential interactions.

Parameters:
  • chrom (str) – The name of the chromosome to run the LRT for. Pass None to run for all chromosomes in series.
  • refit_mu (bool) – Pass True to refit the mean parameters in the NB models being compared in the LRT. Pass False to use the means across replicates directly, which is simpler and slightly faster but technically violates the assumptions of the LRT.
  • n_threads (int) – The number of threads (technically GIL-avoiding child processes) to use to process multiple chromosomes in parallel. Pass -1 to use as many threads as there are CPUs. Pass 0 to process the chromosomes serially.
  • verbose (bool) – Pass False to silence reporting of progress to stderr.
prepare_data(chrom=None, norm='conditional_mor', n_bins=-1, n_threads=-1, verbose=True)[source]

Prepares raw and normalized data for analysis.

Parameters:
  • chrom (str) – The name of the chromosome to prepare raw data for. Pass None to run for all chromosomes in series.
  • norm (str) –

    The method to use to account for differences in sequencing depth. Valid options are:

    • simple_scaling: scale each replicate to equal total depth
    • median_of_ratios: use median of ratios normalization, ignoring pixels at which any replicate has a zero
    • conditional_scaling: apply simple scaling independently at each distance scale
    • conditional_mor: apply median of ratios independently at each distance scale
  • n_bins (int, optional) – Number of distance bins to use during scaling normalization if norm is one of the conditional options. Pass 0 or None to match pixels by exact distance. Pass -1 to use a reasonable default: 1/5 of self.dist_thesh_max.
  • n_threads (int) – The number of threads (technically GIL-avoiding child processes) to use to process multiple chromosomes in parallel. Pass -1 to use as many threads as there are CPUs. Pass 0 to process the chromosomes serially.
  • verbose (bool) – Pass False to silence reporting of progress to stderr.
run_to_qvalues(norm='conditional_mor', n_bins_norm=-1, estimator='qcml', frac=None, auto_frac_factor=15.0, weighted_lowess=True, refit_mu=True, n_threads=-1, verbose=True)[source]

Shortcut method to run the analysis to q-values.

Parameters:
  • norm (str) –

    The method to use to account for differences in sequencing depth. Valid options are:

    • simple_scaling: scale each replicate to equal total depth
    • median_of_ratios: use median of ratios normalization, ignoring pixels at which any replicate has a zero
    • conditional_scaling: apply simple scaling independently at each distance scale
    • conditional_mor: apply median of ratios independently at each distance scale
  • n_bins_norm (int, optional) – Number of distance bins to use during scaling normalization if norm is one of the conditional options. Pass 0 or None to match pixels by exact distance. Pass -1 to use a reasonable default: 1/5 of self.dist_thesh_max.
  • estimator ('cml', 'qcml', 'mme', or a function) – Pass ‘cml’, ‘qcml’, ‘mme’ to use conditional maximum likelihood (CML), qnorm-CML (qCML), or method of moments estimation (MME) to estimate the dispersion within each bin. Pass a function that takes in a (pixels, replicates) shaped array of data and returns a dispersion value to use that instead.
  • frac (float, optional) – The lowess smoothing fraction to use when fitting the distance vs dispersion trend. Pass None to choose a value automatically.
  • auto_frac_factor (float) – When frac is None, this factor scales the automatically determined fraction parameter.
  • weighted_lowess (bool) – Whether or not to use a weighted lowess fit when fitting the smoothed dispersion curve.
  • refit_mu (bool) – Pass True to refit the mean parameters in the NB models being compared in the LRT. Pass False to use the means across replicates directly, which is simpler and slightly faster but technically violates the assumptions of the LRT.
  • n_threads (int) – The number of threads (technically GIL-avoiding child processes) to use to process multiple chromosomes in parallel. Pass -1 to use as many threads as there are CPUs. Pass 0 to process the chromosomes serially.
  • verbose (bool) – Pass False to silence reporting of progress to stderr.
threshold(chrom=None, fdr=0.05, cluster_size=3, n_threads=-1)[source]

Thresholds and clusters significantly differential pixels.

Should only be run after q-values have been obtained.

Parameters:
  • chrom (str) – The name of the chromosome to threshold. Pass None to threshold all chromosomes in series.
  • fdr (float or list of float) – The FDR to threshold on. Pass a list to do a sweep in series.
  • cluster_size (int or list of int) – Clusters smaller than this size will be filtered out. Pass a list to do a sweep in series.,
  • n_threads (int) – The number of threads (technically GIL-avoiding child processes) to use to process multiple chromosomes in parallel. Pass -1 to use as many threads as there are CPUs. Pass 0 to process the chromosomes serially.