Recent Posts
Calculating Mapping Statistics from a SAM/BAM file using SAMtools and awk
A BAM
file is the binary version of a SAM
file, a tab-delimited text file that contains sequence alignment data. Mapping tools, such as Bowtie 2
and BWA
, generate SAM
files as output when aligning sequence reads to large reference sequences. The head
of a SAM
file takes the following form:
@HD VN:1.5 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
The header section is denoted by the character @
followed by one of the two-letter header record type codes. The table below describes the available predefined tags in the header section of a SAM
file:
Tag | Description |
---|---|
@HD |
The header line. The first line if present. |
@SQ |
Reference sequence dictionary. The order of @SQ lines defines the alignment sorting order. |
@RG |
Read group. Unordered multiple @RG lines are allowed. |
@PG |
Program |
@CO |
One-line text comment. Unordered multiple @CO lines are allowed. |
Simulating Sequence Reads from a Reference Genome with wgsim
wgsim
is a tool within the SAMtools software package that allows the simulation of FASTQ reads from a FASTA reference. It can simulate diploid genomes with single nucleotide polymorphisms (SNP) and insertion/deletion (indels), and create reads with uniform substitution sequencing errors. It is particularly useful when a FASTA file needs to be included in a mapping pipeline that expects FASTQ files (e.g. variant calling with closed genomes).
Warning and Error Handling with tryCatch()
A few weeks ago, I worked on an implementation of Fisher’s exact test in R. The script expects a data frame with rows representing the various cases/phenotype of my bacterium, and columns corresponding to the presence or absence of certain genes as detected by SRST2. Fisher’s exact test, which is said to work well with small sample sizes, examines the association between two categorical variables (e.g. whether the presence or absence of a gene is linked to the manifestation of a phenotype). Implementing it in R was a matter of calling the fisher.test()
function on a 2x2 contingency table called count.df
to generate the p-values:
How to Run a Process as a Background Task
To my disappointment, much of my experience in wrangling DNA sequences involves short bacterial genomes. My workstation, which features a 4.0-GHz 8-core processor and 32 Gb RAM, had been sufficient for my purposes, but only until I encountered the problem of aligning 147 whole-genome sequences prior to tree building. Even though MAFFT allows multithreading, my workstation would have taken weeks to complete the process. For fear of not graduating in time, using our project-based server suddenly became compulsory.
Standard Deviation and Variance
One of the tools that we discussed in our Data Analytics class last week was canonical correlation analysis (CCA). I won’t delve into CCA as I haven’t fully understood it yet, but my limited apprehension of this paper by Sherry and Henson (2005) tells me that it examines the relationship between two variable sets (usually in the form of predictor and response variables) using their shared variance. When I tried to define variance on my own without referring to a search engine, the only definition I came up with is that variance is the square of standard deviation (SD). Evidently, my understanding of the concept is insufficient, prompting me to brush up on the topic some people would categorize as basic.