BAM

Reference

BAMReader

Class for reading sorted and indexed BAM files.

PivotRegion

Class for storing and calculating methylation entropy, epipolymorphism and PDR stats.

Usage

BSXplorer offers BAMReader to convert BAM files into methylation reports or calculate such methylation statistics as methylation entropy [1], epipolymorphism [2] or PDR [3].

Note

Input BAM file must be sorted and have index .bai file.

Report mode

For reading BAM files and converting them into methylation report format set BAMReader use BAMReader.report_iter(). This method returns iterator over BAM file and yields data batches in UniversalBatch format.

import bsxplorer as bsx

reader = bsx.BAMReader(
    bam_filename="path/to/bamfile.bam",
    index_filename="path/to/bamfile.bam.bai",
)

# Simple iteration over BAM file
for batch in reader.report_iter():
    do_something(batch)

# Batch can be saved into file
with bsx.UniversalWriter("path/to/output.txt", report_type="bismark") as writer:
    for batch in reader.report_iter():
        writer.write(batch)

For full cytosine methylation report output, cytosine data file needs to be prepared with SequenceFile.preprocess_cytosines(). Path to prepocessed cytosine file should be specified with cytosine_file parameter in BAMReader constructor.

cytosine_filepath = "path/to/cytosines.parquet"
bsx.SequenceFile("path/to/sequence.fasta").preprocess_cytosines(cytosine_filepath)

reader = bsx.BAMReader(
    bam_filename="path/to/bamfile.bam",
    index_filename="path/to/bamfile.bam.bai",
    cytosine_file=cytosine_filepath
)

If there is no need to analyze full BAM file, but only some genomic regions, user can specify the regions parameter in BAMReader constructor with DataFrame generated with Genome methods.

regions = bsx.Genome.from_gff("path/to/annot.gff").gene_body(min_length=2000)

reader = bsx.BAMReader(
    bam_filename="path/to/bamfile.bam",
    index_filename="path/to/bamfile.bam.bai",
    regions=regions
)

for batch in reader:
    # The reader will return batch for each region individually
    ...

Note

If genomic regions in annotation file overlap, resulting batch data will be duplicated.

For QC calculation, set qc parameter to True. After reading BAM file QC stats can be plotted with BAMReader.plot_qc() method.

Resulting image:

images/bam/QC_example.png

Stats mode

In “stats” mode BSXplorer can calculate such methylation statistics as:

  • Methylation entropy [1] which is calculated as the entropy of epialleles originating from a single genomic locus.

    \[ME = \frac{1}{L}\sum_i{-\frac{n_i}{N}\cdot Log\frac{n_i}{N}}\]

    \(L\) – window length, \(n_i\) – number of pattern occurencies in window, \(N\) – total number of different methylation patterns in window.

  • Epipolymorphism [2] also captures the amount of heterogeneity of DNA methylation.

    \[EPM = 1 - \sum_i{(\frac{n_i}{N})^2}\]

    \(n_i\) – number of pattern occurencies in window, \(N\) – total number of different methylation patterns in window.

  • PDR (Proportion of discordant reads) [3] – a fraction of reads carrying cytosines in discordant methylation states (i.e. containing both methylated and unmethylated cytosines in a single read) with respect to all reads mapped to a cytosines.

    \[PDR = \frac{d}{c+d}\]

    \(d\) – number of discordant reads, \(c\) – number of concordant reads.

To calculate these methylation statistics, method BAMReader.stats_iter() need to be called to get the iterator. Iterator returns PivotRegion. Methods PivotRegion.methylation_entropy(), PivotRegion.epipolymorphism() and PivotRegion.PDR() can be called to get ME, EPM and PDR values. See documentation for detailed explanation of these methods.

Example for ME:

reader = bsx.BAMReader(
        bam_filename="path/to/bamfile.bam",
        index_filename="path/to/bamfile.bam.bai",
    )

for batch in reader.stats_iter():
    positions, value = batch.methylation_entropy()