Variant calling:
There are three 3 steps to call SNP with ANSGD
0.Filteration
When the input is BAM file
angsd -bam ... parseArgs_bambi.cpp: bam reader: -bam/-b (null) (list of BAM/CRAM files) -i (null) (Single BAM/CRAM file) -r (null) Supply a single region in commandline (see examples below) -rf (null) Supply multiple regions in a file (see examples below) -remove_bads 1 Discard ‘bad‘ reads, (flag >=256) -uniqueOnly 0 Discards reads that doesn‘t map uniquely -show 0 Mimic ‘samtools mpileup‘ also supply -ref fasta for printing reference column -minMapQ 0 Discard reads with mapping quality below -minQ 13 Discard bases with base quality below -trim 0 Number of based to discard at both ends of the reads -trim 0 Number of based to discard at 5‘ ends of the reads -trim 0 Number of based to discard at 3‘ ends of the reads -only_proper_pairs 1 Only use reads where the mate could be mapped -C 0 adjust mapQ for excessive mismatches (as SAMtools), supply -ref -baq 0 adjust qscores around indels (1=normal baq 2= extended(as SAMtools)), supply -ref -redo-baq 0 (recompute baq, instead of using BQ tag) -checkBamHeaders 1 Exit if difference in BAM headers -doCheck 1 Keep going even if datafile is not suffixed with .bam/.cram -downSample 0.000000 Downsample to the fraction of original data -nReads 50 Number of reads to pop from each BAM/CRAMs -minChunkSize 250 Minimum size of chunk sent to analyses --ignore-RG 1 (dev only) +RG (null) Readgroups to include in analysis(can be filename) Examples for region specification: chr: Use entire chromosome: chr chr:start- Use region from start to end of chr chr:-stop Use region from beginning of chromosome: chr to stop chr:start-stop Use region from start to stop from chromosome: chr chr:site Use single site on chromosome: chr
When the input is GL file:
multiReader.cpp: -nLines 50 (Number of lines to read) -beagle (null) (Beagle Filename (can be .gz)) -vcf-GL (null) (vcf Filename (can be bcf compressed or uncompressed)) -vcf-PL (null) (vcf Filename (can be bcf compressed or uncompressed)) -vcf-GP (null) (vcf Filename (can be bcf compressed or uncompressed))(*not used) -glf (null) (glf Filename (can be .gz)) -pileup (null) (pileup Filename (can be .gz)) -intName 1 (Assume First column is chr_position) -isSim 0 (Simulated data assumes ancestral is A) -nInd 0 (Number of individuals) -minQ 13 (minimum base quality; only used in pileupreader) -fai (null) (fai file) -minQ 13 (minimum base quality; only used in pileupreader)
1.Calculate genotype likelihood
Using GL method:
angsd -GL ... -GL=0: 1: SAMtools 2: GATK 3: SOAPsnp 4: SYK 5: phys 6: Super simple sample an allele type GL. (1.0,0.5,0.0) 7: outgroup gls -trim 0 (zero means no trimming) -tmpdir angsd_tmpdir/ (used by SOAPsnp) -errors (null) (used by SYK) -minInd 0 (0 indicates no filtering) Filedumping: -doGlf 0 1: binary glf (10 log likes) .glf.gz 2: beagle likelihood file .beagle.gz 3: binary 3 times likelihood .glf.gz 4: text version (10 log likes) .glf.gz
1 angsd -b $BamList 2 -ref $REF -out 3_variant_calling/angsd_test/prefix 3 -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 4 -minMapQ 20 -minQ 20 -minInd 0.7*nInd -setMinDepth 4 -setMaxDepth 30 -doCounts 1 5 -GL 2 -doGlf 2 6 -P 6 \
7 -rf chr.list \
2.Identify major and minor alleles.
angsd -doMajorMinor ... -doMajorMinor 0 1: Infer major and minor from GL 2: Infer major and minor from allele counts 3: use major and minor from a file (requires -sites file.txt) 4: Use reference allele as major (requires -ref) 5: Use ancestral allele as major (requires -anc) -rmTrans: remove transitions 0 -skipTriallelic 0
3.Estimate allele frequencies.
angsd -doMaf ... -doMaf 0 (Calculate persite frequencies ‘.mafs.gz‘) 1: Frequency (fixed major and minor) 2: Frequency (fixed major unknown minor) 4: Frequency from genotype probabilities 8: AlleleCounts based method (known major minor) NB. Filedumping is supressed if value is negative -doPost 0 (Calculate posterior prob 3xgprob) 1: Using frequency as prior 2: Using uniform prior 3: Using SFS as prior (still in development) 4: Using reference panel as prior (still in development), requires a site file with chr pos major minor af ac an Filters: -minMaf -1.000000 (Remove sites with MAF below) -SNP_pval 1.000000 (Remove sites with a pvalue larger) -rmTriallelic 0.000000 (Remove sites with a pvalue lower) Extras: -ref (null) (Filename for fasta reference) -anc (null) (Filename for fasta ancestral) -eps 0.001000 [Only used for -doMaf &8] -beagleProb 0 (Dump beagle style postprobs) -indFname (null) (file containing individual inbreedcoeficients) -underFlowProtect 0 (file containing individual inbreedcoeficients) NB These frequency estimators requires major/minor -doMajorMinor
4.SNP
There are two main ways to call SNPs using ANGSD with these options:
-minMaf 0.000000 (Remove sites with MAF below) -SNP_pval 1.000000 (Remove sites with a pvalue larger)
bam file input:
angsd -b bam.list -ref $REF -out Results/cohort -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 -minMapQ 20 -minQ 20 -minInd 0.75*nInd -setMinDepth 4 -setMaxDepth 30 -doCounts 1 -GL 1 -doGlf 2 -doMajorMinor 1 -doMaf 1 -minMaf 0.05 -SNP_pval 1e-6
GL file input;
angsd -glf cohort.glf -nInd 127 -fai ref_genome_index -out Results/Cohort \ -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 \ -SNP_pval 1e-6
PCA:
The covariance matrix can also be inferred from genotype likelihoods using PCAangsd. PCAngsd takes as input genotype likelihoods in beagle format
1.Obtain covariance matrix:
python ~/Software/pcangsd/pcangsd.py -beagle Results/MME_ANGSD_PCA_LDpruned.beagle.gz -o Results/PCAngsd_LDpruned_covmat
2.PCA analysis using R.
Admixture:
Using NGSadmix inferring admixture proportions from genotype likelihoods
NGSadmix -likes $BASEDIR‘/Results/MME_ANGSD_PCA_LDpruned.beagle.gz/ -K 2 -o $BASEDIR‘/Results/MME_LDpruned_ngsAdmix_K2_out‘
Fst calculation:
1. Estimate of the Site Frequency Spectrum
Using bam files of each population
2. Sompute per-site FST indexes
3. Sliding-window Fst
Nucleotide diversity(theta, Tajima‘s D, PI):
1. compute the allele frequency posterior probabilities and associated statistics (-doThetas) using the SFS as prior information (Don‘t forget the outgroup genome as an ancestor)
2. perform a sliding windows analysis
3. Calculate PI