Data filtering

Points are filtered out of Hi-C datasets by HiC3DeFDR in three stages:

1. Initial data import

This filtering step is performed during the prepare_data() step.

Our motivation is to include as many points as possible. We refuse to filter out points that have zero in one replicate (leads to underestimation of variance/dispersion).

We

  • exclude points beyond HiC3DeFDR.dist_thresh_max
  • exclude points that have zero in all reps (“pixel union strategy” implemented in hic3defdr.util.matrices.sparse_union())
  • exclude points in rows that failed balancing (decided by HiC3DeFDR.bias_thresh)

Points present in the data files row_<chrom>.npy, col_<chrom>.npy, raw_<chrom>.npy, and scaled_<chrom>.npy reflect points that survive this filtering step.

2. disp_idx

This filter is computed during the prepare_data() step, but is not used until the estimate_disp() step.

Our motivation is to not try to fit dispersion to points where think dispersion estimation will be very hard. This include points very close to the diagonal where Hi-C gets crazy and points with very low coverage (low mean across reps).

We

  • exclude points within HiC3DeFDR.dist_thresh_min
  • exclude points whose mean across reps is below HiC3DeFDR.mean_thresh

disp_idx_<chrom>.npy is a boolean vector aligned to <row/col>_<chrom>.npy that is True for all points that survive this filtering step.

3. loop_idx

This filter is computed during the prepare_data() step, but is not used until the bh() step.

This filter is only computed and used if HiC3DeFDR.loop_patterns is not None.

Our motivation is to reduce the number of hypotheses that we test when performing multiple testing correction via BH-FDR. Since we are only interested in finding differential loops, we can choose to only test the hypotheses that correspond to positions where there are loops.

We

  • exclude points that are not in a loop as defined by HiC3DeFDR.loop_patterns

loop_idx_<chrom>.npy is a boolean vector aligned to the positions where disp_idx_<chrom>.npy is True that is True for all points that survive this filtering step.