Low coverage sequencing data analysis with ANSGD

Variant calling:

There are three 3 steps to call SNP with ANSGD

0.Filteration

When the input is BAM file

Low coverage sequencing data analysis with ANSGD
angsd -bam
...
parseArgs_bambi.cpp: bam reader:
    -bam/-b        (null)    (list of BAM/CRAM files)
    -i        (null)    (Single BAM/CRAM file)
    -r        (null)    Supply a single region in commandline (see examples below)
    -rf        (null)    Supply multiple regions in a file (see examples below)
    -remove_bads    1    Discard bad reads, (flag >=256) 
    -uniqueOnly    0    Discards reads that doesnt map uniquely
    -show        0    Mimic samtools mpileup also supply -ref fasta for printing reference column
    -minMapQ    0    Discard reads with mapping quality below
    -minQ        13    Discard bases with base quality below
    -trim        0    Number of based to discard at both ends of the reads
    -trim        0    Number of based to discard at 5 ends of the reads
    -trim        0    Number of based to discard at 3 ends of the reads
    -only_proper_pairs 1    Only use reads where the mate could be mapped
    -C        0    adjust mapQ for excessive mismatches (as SAMtools), supply -ref
    -baq        0    adjust qscores around indels (1=normal baq 2= extended(as SAMtools)), supply -ref
    -redo-baq        0 (recompute baq, instead of using BQ tag)
    -checkBamHeaders 1    Exit if difference in BAM headers
    -doCheck    1    Keep going even if datafile is not suffixed with .bam/.cram
    -downSample    0.000000    Downsample to the fraction of original data
    -nReads        50    Number of reads to pop from each BAM/CRAMs
    -minChunkSize    250    Minimum size of chunk sent to analyses
    --ignore-RG    1    (dev only)
    +RG    (null)    Readgroups to include in analysis(can be filename)

Examples for region specification:
        chr:        Use entire chromosome: chr
        chr:start-    Use region from start to end of chr
        chr:-stop    Use region from beginning of chromosome: chr to stop
        chr:start-stop    Use region from start to stop from chromosome: chr
        chr:site    Use single site on chromosome: chr
BAM input

When the input is GL file:

Low coverage sequencing data analysis with ANSGD
multiReader.cpp:
        -nLines 50      (Number of lines to read)
        -beagle (null)  (Beagle Filename (can be .gz))
        -vcf-GL (null)  (vcf Filename (can be bcf compressed or uncompressed))
        -vcf-PL (null)  (vcf Filename (can be bcf compressed or uncompressed))
        -vcf-GP (null)  (vcf Filename (can be bcf compressed or uncompressed))(*not used)
        -glf    (null)  (glf Filename (can be .gz))
        -pileup (null)  (pileup Filename (can be .gz))
        -intName 1      (Assume First column is chr_position)
        -isSim  0       (Simulated data assumes ancestral is A)
        -nInd   0               (Number of individuals)
        -minQ   13      (minimum base quality; only used in pileupreader)
        -fai    (null)  (fai file)
        -minQ   13      (minimum base quality; only used in pileupreader)
GL input

 

1.Calculate genotype likelihood

Using GL method:

Low coverage sequencing data analysis with ANSGD
angsd -GL
...
-GL=0: 
    1: SAMtools
    2: GATK
    3: SOAPsnp
    4: SYK
    5: phys
    6: Super simple sample an allele type GL. (1.0,0.5,0.0)
    7: outgroup gls
    -trim        0        (zero means no trimming)
    -tmpdir        angsd_tmpdir/    (used by SOAPsnp)
    -errors        (null)        (used by SYK)
    -minInd        0        (0 indicates no filtering)

Filedumping:
    -doGlf    0
    1: binary glf (10 log likes)    .glf.gz
    2: beagle likelihood file    .beagle.gz
    3: binary 3 times likelihood    .glf.gz
    4: text version (10 log likes)    .glf.gz
Tool option
1 angsd -b $BamList 2         -ref $REF -out 3_variant_calling/angsd_test/prefix 3         -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 4         -minMapQ 20 -minQ 20 -minInd 0.7*nInd -setMinDepth 4 -setMaxDepth 30 -doCounts 1 5         -GL 2 -doGlf 2 6         -P 6 \
7 -rf chr.list \

 

2.Identify major and minor alleles. 

Low coverage sequencing data analysis with ANSGD
angsd -doMajorMinor
...
        -doMajorMinor   0
        1: Infer major and minor from GL
        2: Infer major and minor from allele counts
        3: use major and minor from a file (requires -sites file.txt)
        4: Use reference allele as major (requires -ref)
        5: Use ancestral allele as major (requires -anc)
        -rmTrans: remove transitions 0
        -skipTriallelic 0
Tool option

 

3.Estimate allele frequencies.

Low coverage sequencing data analysis with ANSGD
angsd -doMaf
...
-doMaf  0 (Calculate persite frequencies .mafs.gz)
        1: Frequency (fixed major and minor)
        2: Frequency (fixed major unknown minor)
        4: Frequency from genotype probabilities
        8: AlleleCounts based method (known major minor)
        NB. Filedumping is supressed if value is negative
-doPost 0       (Calculate posterior prob 3xgprob)
        1: Using frequency as prior
        2: Using uniform prior
        3: Using SFS as prior (still in development)
        4: Using reference panel as prior (still in development), requires a site file with chr pos major minor af ac an
Filters:
        -minMaf         -1.000000       (Remove sites with MAF below)
        -SNP_pval       1.000000        (Remove sites with a pvalue larger)
        -rmTriallelic   0.000000        (Remove sites with a pvalue lower)
Extras:
        -ref    (null)  (Filename for fasta reference)
        -anc    (null)  (Filename for fasta ancestral)
        -eps    0.001000 [Only used for -doMaf &8]
        -beagleProb     0 (Dump beagle style postprobs)
        -indFname       (null) (file containing individual inbreedcoeficients)
        -underFlowProtect       0 (file containing individual inbreedcoeficients)
NB These frequency estimators requires major/minor -doMajorMinor
angsd -doMaf

 

4.SNP

There are two main ways to call SNPs using ANGSD with these options:

        -minMaf         0.000000        (Remove sites with MAF below)
        -SNP_pval       1.000000        (Remove sites with a pvalue larger)

bam file input:

angsd -b bam.list -ref $REF -out Results/cohort         -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1         -minMapQ 20 -minQ 20 -minInd 0.75*nInd -setMinDepth 4 -setMaxDepth 30 -doCounts 1         -GL 1 -doGlf 2 -doMajorMinor 1 -doMaf 1 -minMaf 0.05 -SNP_pval 1e-6

 

GL file input;

angsd -glf cohort.glf -nInd 127 -fai ref_genome_index -out Results/Cohort \  
        -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 \  
        -SNP_pval 1e-6

 

PCA:

The covariance matrix can also be inferred from genotype likelihoods using PCAangsd. PCAngsd takes as input genotype likelihoods in beagle format

1.Obtain covariance matrix:

python ~/Software/pcangsd/pcangsd.py -beagle Results/MME_ANGSD_PCA_LDpruned.beagle.gz -o Results/PCAngsd_LDpruned_covmat

2.PCA analysis using R.

Admixture:

Using NGSadmix inferring admixture proportions from genotype likelihoods

NGSadmix -likes $BASEDIR/Results/MME_ANGSD_PCA_LDpruned.beagle.gz/ -K 2 -o $BASEDIR/Results/MME_LDpruned_ngsAdmix_K2_out

 

Fst calculation:

1. Estimate of the Site Frequency Spectrum

Using bam files of each population

2. Sompute per-site FST indexes

3. Sliding-window Fst

Nucleotide diversity(theta, Tajima‘s D, PI):

1. compute the allele frequency posterior probabilities and associated statistics (-doThetas) using the SFS as prior information (Don‘t forget the outgroup genome as an ancestor)

2. perform a sliding windows analysis

3. Calculate PI

Low coverage sequencing data analysis with ANSGD

上一篇:rabbmq突然无法访问排查


下一篇:表格操作