Using sparse_union() to load a rectangular data matrix

Motivation

The hic3defdr ecosystem revolves around a specific data layout. hic3defdr automatically imports your input npz’s into this layout when running the pipeline. But what if you want to convert your npz’s into hic3efdr’s data layout without running the pipeline?

The key ingredient that will allow us to do this is the function hic3defdr.util.matrices.sparse_union(), which will be demonstrated below.

Walkthrough

In an interactive shell, import hic3defdr.util.matrices.sparse_union():

>>> import numpy as np
>>> import scipy.sparse as sparse
>>> from hic3defdr.util.matrices import sparse_union

The kwargs on this function mostly serve to help filter out pixels to reduce the total number of pixels. The most important kwarg for this is dist_thresh, which simply drops all pixels with interaction distances longer than dist_thresh. This threshold is on by default and is highly recommended.

The remaining kwargs work together to throw out pixels that have low mean values across replicates after normalization (by bias and size_factors if these are passed). This filtering is off by default (mean_thresh=0.0) and is not recommended. This means that you don’t need to worry about passing bias or size_factors to this function even if you have these values available.

Create some test npz’s from dense matrices with zeros in different positions:

>>> rep1 = np.array([[0., 0., 3., 1.],
...                  [0., 6., 5., 0.],
...                  [0., 0., 0., 2.],
...                  [0., 0., 0., 7.]])
>>> rep2 = np.array([[0., 1., 3., 2.],
...                  [0., 0., 0., 0.],
...                  [0., 0., 4., 2.],
...                  [0., 0., 0., 3.]])
>>> sparse.save_npz('rep1.npz', sparse.csr_matrix(rep1))
>>> sparse.save_npz('rep2.npz', sparse.csr_matrix(rep2))

Use sparse_union to identify the union pixel set:

>>> rep_npzs = ['rep1.npz', 'rep2.npz']
>>> row, col = sparse_union(rep_npzs, dist_thresh=2)
>>> list(zip(row, col))
[(0, 1), (0, 2), (1, 1), (1, 2), (2, 2), (2, 3), (3, 3)]

Notice that pixels (0, 0) and (1, 3) are not in the union pixel set. This is because these pixels are zero in both replicates.

Also notice that pixel (0, 3) is also not in the union pixel set. This is because its distance (3 - 0 = 3) is greater than the dist_thresh we passed to sparse_union().

Finally, we can construct the data matrix:

>>> data = np.zeros((len(row), len(rep_npzs)))
>>> for i in range(len(rep_npzs)):
...     data[:, i] = sparse.load_npz(rep_npzs[i]).tocsr()[row, col]
>>> data
array([[0., 1.],
       [3., 3.],
       [6., 0.],
       [5., 0.],
       [0., 4.],
       [2., 2.],
       [7., 3.]])

If we want to know the interaction distance for each pixel (each row of data), we can calculate:

>>> dist = col - row
>>> dist
array([1, 2, 0, 1, 0, 1, 0], dtype=int32)

To clean up, we will delete the npz’s we created.:

>>> import os
>>> for f in rep_npzs:
...     os.remove(f)