hic3defdr.util.cluster_table module¶

hic3defdr.util.cluster_table.add_columns_to_cluster_table(cluster_table, name_pattern, row, col, data, labels=None, reducer='mean', chrom=None)[source]¶

Adds new data columns to an existing cluster table by evaluating a sparse dataset specified by row, col, data at the pixels in each cluster and combining the resulting values using reducer.

This function operates in-place.

Parameters:

cluster_table (pd.DataFrame) – Must contain a “cluster” column. If the values in this column are strings, they will be “corrected” to list of lists of int in-place.
name_pattern (str) – The name of the column to fill in. If data contains more than one column, multiple columns will be added - include exactly one %s in the name_pattern, then the i th new column will be called name_pattern % labels[i].
col, data (row,) – Sparse format data to use to determine the value to fill in for each cluster for each new column. row and col must be parallel to the first dimension of data. If data is two-dimensional, you must pass labels to label the columns and include a %s in name_pattern.
labels (list of str, optional) – If data is two-dimensional, pass a list of strings labeling the columns of data.
reducer ({'mean', 'max', 'min'}) – The function to use to combine the values for the pixels in each cluster.
chrom (str, optional) – If the cluster_table contains data from multiple chromosomes, pass the name of the chromosome that row, col, data correspond to and only clusters for that chromosome will have their new column created/updated. If the cluster_table contains data from only one chromosome, pass None to update all clusters in the cluster_table.

Examples

>>> import numpy as np
>>> from hic3defdr.util.cluster_table import clusters_to_table, \
...     add_columns_to_cluster_table
>>> # basic test: clusters all on one chromosome
>>> clusters = [[(1, 2), (1, 1)], [(4, 4),  (3, 4)]]
>>> res = 10000
>>> df = clusters_to_table(clusters, 'chrX', res)
>>> row, col = zip(*sum(clusters, []))
>>> data = np.array([[1, 2],
...                  [3, 4],
...                  [5, 6],
...                  [7, 8]], dtype=float)
>>> add_columns_to_cluster_table(df, '%s_mean', row, col, data,
...                              labels=['rep1', 'rep2'])
>>> df.iloc[0, :]
us_chrom                    chrX
us_start                   10000
us_end                     20000
ds_chrom                    chrX
ds_start                   10000
ds_end                     30000
cluster_size                   2
cluster         [[1, 2], [1, 1]]
rep1_mean                      2
rep2_mean                      3
Name: chrX:10000-20000_chrX:10000-30000, dtype: object
>>> # advanced test: two chromosomes
>>> df1 = clusters_to_table(clusters, 'chr1', res)
>>> df2 = clusters_to_table(clusters, 'chr2', res)
>>> df = pd.concat([df1, df2], axis=0)
>>> # add chr1 info
>>> add_columns_to_cluster_table(df, '%s_mean', row, col, data,
...                              labels=['rep1', 'rep2'], chrom='chr1')
>>> # chr1 cluster has data filled in
>>> df.loc[df.index[0], ['rep1_mean', 'rep2_mean']]
rep1_mean    2
rep2_mean    3
Name: chr1:10000-20000_chr1:10000-30000, dtype: object
>>> # chr2 cluster has nans
>>> df.loc[df.index[2], ['rep1_mean', 'rep2_mean']]
rep1_mean    NaN
rep2_mean    NaN
Name: chr2:10000-20000_chr2:10000-30000, dtype: object
>>> # add chr2 info, with different data (reversed row order)
>>> add_columns_to_cluster_table(df, '%s_mean', row, col, data[::-1, :],
...                              labels=['rep1', 'rep2'], chrom='chr2')
>>> # now the chr2 clusters have data
>>> df.loc[df.index[2], ['rep1_mean', 'rep2_mean']]
rep1_mean    6
rep2_mean    7
Name: chr2:10000-20000_chr2:10000-30000, dtype: object
>>> # edge case: data is a vector
>>> df = clusters_to_table(clusters, 'chrX', res)
>>> add_columns_to_cluster_table(df, 'value', row, col, data[:, 0])
>>> df.iloc[0, :]
us_chrom                    chrX
us_start                   10000
us_end                     20000
ds_chrom                    chrX
ds_start                   10000
ds_end                     30000
cluster_size                   2
cluster         [[1, 2], [1, 1]]
value                          2
Name: chrX:10000-20000_chrX:10000-30000, dtype: object

hic3defdr.util.cluster_table.clusters_to_table(clusters, chrom, res)[source]¶

Creates a DataFrame which tabulates cluster information.

The DataFrame’s first column (and index) will be a “loop_id” in the form “chr:start-end_chr:start-end”. Its other columns will be “us_chrom”, “us_start”, “us_end”, and “ds_chrom”, “ds_start”, “ds_end”, representing the BED-style chromosome, start coordinate, and end coordinate of the upstream (“us”, smaller coordinate values) and downstream (“ds”, larger coordinate values) anchors of the loop, respectively. These anchors together form a rectangular “bounding box” that completely encloses the significant pixels in the cluster. The DataFrame will also have a “cluster_size” column representing the total number of significant pixels in the cluster. Finally, the exact indices of the significant pixels in the cluster will be recorded in a “cluster” column in a JSON-like format (using only square brackets).

Parameters:	clusters (list of list of tuple) – The outer list is a list of clusters. Each cluster is a list of (i, j) tuples marking the position of significant points which belong to that cluster. chrom (str) – The name of the chromosome these clusters are on. res (int) – The resolution of the contact matrix referred to by the row and column indices in `clusters`, in units of base pairs.
Returns:	The table of loop information.
Return type:	pd.DataFrame

Examples

>>> from hic3defdr.util.cluster_table import clusters_to_table
>>> clusters = [[(1, 2), (1, 1)], [(4, 4),  (3, 4)]]
>>> df = clusters_to_table(clusters, 'chrX', 10000)
>>> df.iloc[0, :]
us_chrom                    chrX
us_start                   10000
us_end                     20000
ds_chrom                    chrX
ds_start                   10000
ds_end                     30000
cluster_size                   2
cluster         [[1, 2], [1, 1]]
Name: chrX:10000-20000_chrX:10000-30000, dtype: object

hic3defdr.util.cluster_table.load_cluster_table(table_filename)[source]¶

Loads a cluster table from a TSV file on disk to a DataFrame.

This function will ensure that the “cluster” column of the DataFrame is converted from a string representation to a list of list of int to simplify downstream processing.

See the example below for details on how this function assumes the cluster table was saved.

Parameters:	table_filename (str) – String reference to the location of the TSV file.
Returns:	The loaded cluster table.
Return type:	pd.DataFrame

Examples

>>> from tempfile import TemporaryFile
>>> from hic3defdr.util.cluster_table import clusters_to_table, \
...     load_cluster_table
>>> clusters = [[(1, 2), (1, 1)], [(4, 4),  (3, 4)]]
>>> df = clusters_to_table(clusters, 'chrX', 10000)
>>> f = TemporaryFile(mode='w+')  # simulates a file on disk
>>> df.to_csv(f, sep='\t')
>>> position = f.seek(0)
>>> loaded_df = load_cluster_table(f)
>>> df.equals(loaded_df)
True
>>> loaded_df['cluster'][0]
[[1, 2], [1, 1]]

hic3defdr.util.cluster_table.sort_cluster_table(cluster_table)[source]¶

Sorts the rows of a cluster table in the expected order.

This function does not operate in-place.

We expect this to get a lot easier after this pandas issue is fixed: https://github.com/pandas-dev/pandas/issues/3942

Parameters:	cluster_table (pd.DataFrame) – The cluster table to sort. Must have all the expected columns.
Returns:	The sorted cluster table.
Return type:	pd.DataFrame

Examples

>>> from hic3defdr.util.cluster_table import clusters_to_table, \
...     sort_cluster_table
>>> clusters = [[(4, 4),  (3, 4)], [(1, 2), (1, 1)]]
>>> res = 10000
>>> df1 = clusters_to_table(clusters, 'chr1', res)
>>> df2 = clusters_to_table(clusters, 'chr2', res)
>>> df3 = clusters_to_table(clusters, 'chr11', res)
>>> df4 = clusters_to_table(clusters, 'chrX', res)
>>> df = pd.concat([df4, df3, df2, df1], axis=0)
>>> sort_cluster_table(df).index
Index(['chr1:10000-20000_chr1:10000-30000',
       'chr1:30000-50000_chr1:40000-50000',
       'chr2:10000-20000_chr2:10000-30000',
       'chr2:30000-50000_chr2:40000-50000',
       'chr11:10000-20000_chr11:10000-30000',
       'chr11:30000-50000_chr11:40000-50000',
       'chrX:10000-20000_chrX:10000-30000',
       'chrX:30000-50000_chrX:40000-50000'],
      dtype='object', name='loop_id')