Input Format¶
Both Hi-C and ChIP-Seq profiles are required for optimalTAD running.
Hi-C data¶
Hi-C contact matrices should be iteratively corrected and stored in the following formats:
.cool (.mcool). Cool is a widely usable binary Hi-C storage format introduced by Mirny lab.
import cooler
c = cooler.Cooler('Filename.cool')
c.info
{'bin-size': 20000,
'bin-type': 'fixed',
'creation-date': '2020-06-02T14:03:32.896753',
'format': 'HDF5::Cooler',
'format-url': 'https://github.com/mirnylab/cooler',
'format-version': 3,
'generated-by': 'cooler-0.8.7',
'genome-assembly': 'unknown',
'metadata': {},
'nbins': 6024,
'nchroms': 7,
'nnz': 1903526,
'storage-mode': 'symmetric-upper',
'sum': 16003605}
.hdf5. The algoritm also supports .hdf5 matrices, however the specific structure of these files is required. Individual chromosome Hi-C data must be stored in keys of the same names (‘2L’, ‘2R’, etc). Chromosome names must be listed in the chromosomeLabels key and matrix resolution value must be indicated in the resolution key. Here is an example:
import h5py
f = h5py.File("Filename.hdf5", 'r')
f.keys()
<KeysViewHDF5 ['chr2L', 'chr2R', 'chr3L', 'chr3R', 'chr4', 'chrX', 'chromosomeIndex', 'chromosomeLabels', 'chromosomeStarts', 'genome', 'positionIndex', 'resolution']>
ChIP-seq data¶
.bedgraph. optimalTAD supports a classical bedgraph format consisting of 4 columns: chromName, chromStart, chromEnd, dataValue. Columns must be separated by a space (’ ‘).
import pandas as pd
data = pd.read_csv("Filename.bedgraph", sep = ' ', header = None, names=['Chr', 'Start', 'End', 'Score'])
data.head()
Chr Start End Score
0 chr2L 0 20000 1.904566
1 chr2L 20000 40000 2.963382
2 chr2L 40000 60000 2.944759
3 chr2L 60000 80000 4.394352
4 chr2L 80000 100000 3.742936
.bw (.BigWig is also accepted)