hic3defdr data layout

In brief, the hic3defdr data layout is like a COO-format sparse matrix where the data vector is actually a rectangle, storing parallel data vectors for multiple replicates in the same data structure. This allows hic3defdr to combine the advantages of sparse matrix formats (like COO) together with the applications of analyzing data across replicates (like differential loop calling).

Like COO, the hic3defdr data layout keeps track of a row and col vector for each chromosome. These vectors are stored on disk as <outdir>/row_<chrom>.npy and <outdir>/col_<chrom>.npy, respectively.

In contrast to COO, where the data vector (parallel to row and col) is just a vector, in the hic3defdr data layout the data can be a rectangular matrix whose rows correspond to pixels (same length as row and col) and whose columns correspond to replicates or conditions. Each stage of the hic3defdr pipeline writes its output (in the form of this rectangle) to disk as <outdir>/<stage>_<chrom>.npy. The hic3defdr data layout is designed so that multiple “stages” of data processing can re-use the same row and col vectors, making it easy to trace pixel values across stages as well as across replicates.

One important complication is that certain steps of data processing may filter out pixels from the pipeline. This means that the number of pixels (number of rows in the rectangular data matrix) may be smaller for the output of later pipeline steps. Since these matrices have fewer rows, they don’t align with the row and col vectors, which are always the same length. To address this problem, boolean index vectors stored on disk as e.g. <outdir>/disp_idx_<chrom>.npy are aligned with row and col and are True at all pixels that are kept during filtering. This means that row[disp_idx] is aligned with rectangular matrices after the disp_idx filtering step, such as <outdir>/disp_<chrom>.npy. Finally, these indices can be chained, so that row[disp_idx][loop_idx] is aligned with rectangular matrices after the loop_idx filtering step, such as <outdir>/qvalues_<chrom>.npy.

A complete table of all the outputs, their expected shapes, and what boolean indices are needed to align them to row and col is provided in the README section “Intermediates and final output files”.