Sarah Penir

I'm a knowledge hoarder. This is a self-serving space that compiles some of the things I've learned over time, mostly about coding, bioinformatics, and statistics.

Calculating Mapping Statistics from a SAM/BAM file using SAMtools and awk

November 19, 2017

A BAM file is the binary version of a SAM file, a tab-delimited text file that contains sequence alignment data. Mapping tools, such as Bowtie 2 and BWA, generate SAM files as output when aligning sequence reads to large reference sequences. The head of a SAM file takes the following form:

@HD	VN:1.5	SO:coordinate
@SQ	SN:ref	LN:45
r001	99	ref	7	30	8M2I4M1D3M	=	37	39	TTAGATAAAGGATACTG	*
r002	0	ref	9	30	3S6M1P1I4M	*	0	0	AAAAGATAAGGATA	*
r003	0	ref	9	30	5S6M	*	0	0	GCCTAAGCTAA	*	SA:Z:ref,29,-,6H5M,17,0;
r004	0	ref	16	30	6M14N5M	*	0	0	ATAGCTTCAGC	*

The header section is denoted by the character @ followed by one of the two-letter header record type codes. The table below describes the available predefined tags in the header section of a SAM file:

Tag	Description
`@HD`	The header line. The first line if present.
`@SQ`	Reference sequence dictionary. The order of `@SQ` lines defines the alignment sorting order.
`@RG`	Read group. Unordered multiple `@RG` lines are allowed.
`@PG`	Program
`@CO`	One-line text comment. Unordered multiple `@CO` lines are allowed.

Simulating Sequence Reads from a Reference Genome with wgsim

November 5, 2017

wgsim is a tool within the SAMtools software package that allows the simulation of FASTQ reads from a FASTA reference. It can simulate diploid genomes with single nucleotide polymorphisms (SNP) and insertion/deletion (indels), and create reads with uniform substitution sequencing errors. It is particularly useful when a FASTA file needs to be included in a mapping pipeline that expects FASTQ files (e.g. variant calling with closed genomes).

Warning and Error Handling with tryCatch()

October 22, 2017

A few weeks ago, I worked on an implementation of Fisher’s exact test in R. The script expects a data frame with rows representing the various cases/phenotype of my bacterium, and columns corresponding to the presence or absence of certain genes as detected by SRST2. Fisher’s exact test, which is said to work well with small sample sizes, examines the association between two categorical variables (e.g. whether the presence or absence of a gene is linked to the manifestation of a phenotype). Implementing it in R was a matter of calling the fisher.test() function on a 2x2 contingency table called count.df to generate the p-values:

How to Run a Process as a Background Task

October 8, 2017

To my disappointment, much of my experience in wrangling DNA sequences involves short bacterial genomes. My workstation, which features a 4.0-GHz 8-core processor and 32 Gb RAM, had been sufficient for my purposes, but only until I encountered the problem of aligning 147 whole-genome sequences prior to tree building. Even though MAFFT allows multithreading, my workstation would have taken weeks to complete the process. For fear of not graduating in time, using our project-based server suddenly became compulsory.

Standard Deviation and Variance

October 2, 2017

One of the tools that we discussed in our Data Analytics class last week was canonical correlation analysis (CCA). I won’t delve into CCA as I haven’t fully understood it yet, but my limited apprehension of this paper by Sherry and Henson (2005) tells me that it examines the relationship between two variable sets (usually in the form of predictor and response variables) using their shared variance. When I tried to define variance on my own without referring to a search engine, the only definition I came up with is that variance is the square of standard deviation (SD). Evidently, my understanding of the concept is insufficient, prompting me to brush up on the topic some people would categorize as basic.