Simulating Sequence Reads from a Reference Genome with wgsim
wgsim
is a tool within the SAMtools software package that allows the simulation of FASTQ reads from a FASTA reference. It can simulate diploid genomes with single nucleotide polymorphisms (SNP) and insertion/deletion (indels), and create reads with uniform substitution sequencing errors. It is particularly useful when a FASTA file needs to be included in a mapping pipeline that expects FASTQ files (e.g. variant calling with closed genomes).
The various flags used by wgsim
are as follows:
-e FLOAT base error rate [0.02]
-d INT outer distance between the two ends [500]
-s INT standard deviation [50]
-N INT number of read pairs [1000000]
-1 INT length of the first read [70]
-2 INT length of the second read [70]
-r FLOAT rate of mutations [0.001]
-R FLOAT fraction of indels [0.15]
-X FLOAT probability an indel is extended [0.30]
-S INT seed for random generator [-1]
-A FLOAT disgard if the fraction of ambiguous bases higher than FLOAT [0.05]
-h haplotype mode
This is how I used wgsim
on the reference sequences that I wanted to include in my variant calling pipeline:
wgsim -1 300 -2 300 -r 0 -R 0 -X 0 -e 0 reference.fasta reference_1.fastq reference_2.fastq
wgsim -1 300 -2 300
: Instructs wgsim to create paired-end reads of length 300 bp. The use of 300/300 read length is consistent with my other samples’ read lengths (sequenced using MiSeq Reagent Kits v3 2x300).-e 0 -r 0 -R 0 -X 0
: No errors, mutations, or indels were introduced.reference.fasta
: The reference genomereference_1.fastq reference_2.fastq
: File names of the output simulated reads
Leave a Comment