hic3defdr data layout¶
In brief, the hic3defdr data layout is like a COO-format sparse matrix where the data vector is actually a rectangle, storing parallel data vectors for multiple replicates in the same data structure. This allows hic3defdr to combine the advantages of sparse matrix formats (like COO) together with the applications of analyzing data across replicates (like differential loop calling).
Like COO, the hic3defdr data layout keeps track of a row and col vector for
each chromosome. These vectors are stored on disk as <outdir>/row_<chrom>.npy
and <outdir>/col_<chrom>.npy, respectively.
In contrast to COO, where the data vector (parallel to row and col) is
just a vector, in the hic3defdr data layout the data can be a rectangular
matrix whose rows correspond to pixels (same length as row and col) and
whose columns correspond to replicates or conditions. Each stage of the
hic3defdr pipeline writes its output (in the form of this rectangle) to disk as
<outdir>/<stage>_<chrom>.npy. The hic3defdr data layout is designed so that
multiple “stages” of data processing can re-use the same row and col
vectors, making it easy to trace pixel values across stages as well as across
replicates.
One important complication is that certain steps of data processing may filter
out pixels from the pipeline. This means that the number of pixels (number of
rows in the rectangular data matrix) may be smaller for the output of later
pipeline steps. Since these matrices have fewer rows, they don’t align with the
row and col vectors, which are always the same length. To address this
problem, boolean index vectors stored on disk as e.g.
<outdir>/disp_idx_<chrom>.npy are aligned with row and col and are True at
all pixels that are kept during filtering. This means that row[disp_idx] is
aligned with rectangular matrices after the disp_idx filtering step, such as
<outdir>/disp_<chrom>.npy. Finally, these indices can be chained, so that
row[disp_idx][loop_idx] is aligned with rectangular matrices after the
loop_idx filtering step, such as <outdir>/qvalues_<chrom>.npy.
A complete table of all the outputs, their expected shapes, and what boolean
indices are needed to align them to row and col is provided in the README
section “Intermediates and final output files”.