BAM
Reference
Class for reading sorted and indexed BAM files. |
|
Class for storing and calculating methylation entropy, epipolymorphism and PDR stats. |
Usage
BSXplorer offers BAMReader
to convert BAM files into methylation
reports or calculate such methylation statistics as methylation entropy [1],
epipolymorphism [2] or PDR [3].
Note
Input BAM file must be sorted and have index .bai file.
Report mode
For reading BAM files and converting them into methylation report format
set BAMReader
use BAMReader.report_iter()
. This method returns
iterator over BAM file and yields data batches in UniversalBatch
format.
import bsxplorer as bsx
reader = bsx.BAMReader(
bam_filename="path/to/bamfile.bam",
index_filename="path/to/bamfile.bam.bai",
)
# Simple iteration over BAM file
for batch in reader.report_iter():
do_something(batch)
# Batch can be saved into file
with bsx.UniversalWriter("path/to/output.txt", report_type="bismark") as writer:
for batch in reader.report_iter():
writer.write(batch)
For full cytosine methylation report output, cytosine data file needs to be
prepared with SequenceFile.preprocess_cytosines()
. Path to prepocessed
cytosine file should be specified with cytosine_file parameter in
BAMReader
constructor.
cytosine_filepath = "path/to/cytosines.parquet"
bsx.SequenceFile("path/to/sequence.fasta").preprocess_cytosines(cytosine_filepath)
reader = bsx.BAMReader(
bam_filename="path/to/bamfile.bam",
index_filename="path/to/bamfile.bam.bai",
cytosine_file=cytosine_filepath
)
If there is no need to analyze full BAM file, but only some genomic regions,
user can specify the regions parameter in BAMReader
constructor
with DataFrame generated with Genome
methods.
regions = bsx.Genome.from_gff("path/to/annot.gff").gene_body(min_length=2000)
reader = bsx.BAMReader(
bam_filename="path/to/bamfile.bam",
index_filename="path/to/bamfile.bam.bai",
regions=regions
)
for batch in reader:
# The reader will return batch for each region individually
...
Note
If genomic regions in annotation file overlap, resulting batch data will be duplicated.
For QC calculation, set qc parameter to True. After reading BAM file QC stats
can be plotted with BAMReader.plot_qc()
method.
Resulting image:
![images/bam/QC_example.png](images/bam/QC_example.png)
Stats mode
In “stats” mode BSXplorer
can calculate such methylation statistics
as:
Methylation entropy [1] which is calculated as the entropy of epialleles originating from a single genomic locus.
\[ME = \frac{1}{L}\sum_i{-\frac{n_i}{N}\cdot Log\frac{n_i}{N}}\]\(L\) – window length, \(n_i\) – number of pattern occurencies in window, \(N\) – total number of different methylation patterns in window.
Epipolymorphism [2] also captures the amount of heterogeneity of DNA methylation.
\[EPM = 1 - \sum_i{(\frac{n_i}{N})^2}\]\(n_i\) – number of pattern occurencies in window, \(N\) – total number of different methylation patterns in window.
PDR (Proportion of discordant reads) [3] – a fraction of reads carrying cytosines in discordant methylation states (i.e. containing both methylated and unmethylated cytosines in a single read) with respect to all reads mapped to a cytosines.
\[PDR = \frac{d}{c+d}\]\(d\) – number of discordant reads, \(c\) – number of concordant reads.
To calculate these methylation statistics, method BAMReader.stats_iter()
need to be
called to get the iterator. Iterator returns PivotRegion
. Methods
PivotRegion.methylation_entropy()
, PivotRegion.epipolymorphism()
and
PivotRegion.PDR()
can be called to get ME, EPM and PDR values. See documentation
for detailed explanation of these methods.
Example for ME:
reader = bsx.BAMReader(
bam_filename="path/to/bamfile.bam",
index_filename="path/to/bamfile.bam.bai",
)
for batch in reader.stats_iter():
positions, value = batch.methylation_entropy()