pyAmpli: an amplicon-based variant filter pipeline for targeted enriched resequencing data
Cite and read our software note on BMC Bioinformatics
Matthias Beyens, Nele Boeckx, Guy Van Camp, Ken Op de Beeck and Geert Vandeweyer.
pyAmpli: an amplicon-based variant filter pipeline for targeted enriched resequencing data (2017),
18:554, DOI:10.1186/s12859-017-1985-1, BMC Bioinformatics
Questions or ideas? Feel free to contact me: matthias dot beyens at uantwerpen dot be
Content
1) Installation 2) Getting started on example data 3) Getting started on your own data 4) What it does? 5) References
Installation
Fully functional on Ubuntu 14.04.4 LTS platform with Python v2.7.6
git clone https://github.com/MBeyens/pyAmpli
cd pyAmpli
python setup.py install
# if permission issues: python setup.py install --user
# or run with super user
If running setup.py without errors, the required dependencies should be installed correctly (if not check dependencies.md for manual installation).
pyAmpli.py -h
If this gives problems finding the binary (rare cases), please append the pyAmpli binary to your systems PATH by doing so
export PATH=/ABSOLUTE/PATH/WHERE/YOU/COMPILED/PYAMPLI/pyampli:$PATH
Getting started on example data
Create example directory
mkdir example_data
Download example data and copy it in example_data. You do not need to provide any BAM index files, pyAmpli will do the indexing for you!
https://tinyurl.com/m6u7qr9
Note 1
Be sure you have specified the absolute path of the hg19 genome in the configuration file (config.yaml) before running the example. You can download the full hg19 genome from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/
Note 2
If not using the full hg19 genome sequence, please download the hg19_chr1.fa.gz from the example data repository and unzip this file. This hg19 human genome contains only chr1 sequences! Be sure to specify the absolute path in the configuration file (config.yaml).
gunzip hg19_chr1.fa.gz
Run pyAmpli example as follows
cd example_data
pyAmpli.py somatic \
-c example_config.yaml \
-bn normal_sample_chr1.bam \
-bt tumor_sample_chr1.bam \
-v somatic_variants_chr1.vcf \
-d amplicon_design_chr1.bed \
-od example_data_output/
Getting started on your own data
Data preparation
Adjust configuration file (bin/config.yaml)
REQUIRED
reference: hg19.fasta # /ABSOLUTE/PATH/TO/YOUR/REFERENCE.FILE.FASTA
OPTIONAL
cores: 4 # Running on 4 cores
change default parameters # Optimal for somatic filtering
Filter modes
usage: pyAMPLI [-h] {germline,somatic} ...
positional arguments:
{germline,somatic} commands
germline Input arguments for germline amplicon filter
somatic Input arguments for somatic amplicon filter
Somatic modus (default parameters can be found in configuration file)
usage: pyAMPLI somatic [-h] -bn BAM_NORMAL -bt BAM_TUMOR -v VCF -d DESIGN
[-f FILENAME] [-od OUTDIR]
optional arguments:
-h, --help show this help message and exit
-bn BAM_NORMAL, --bam_normal BAM_NORMAL
Binary alignment file, BAM file of the normal sample. Index located in the
same directory
-bt BAM_TUMOR, --bam_tumor BAM_TUMOR
Binary alignment file, BAM file of the tumor sample. Index located in the
same directory
-v VCF, --vcf VCF Variant calling file, VCF file
-d DESIGN, --design DESIGN
Probe/amplicon design file. Contains field, is this
order: Contact your manufacturer for details
-f FILENAME, --filename FILENAME
Output file name
-od OUTDIR, --outdir OUTDIR
Output directory
-c CONFIG, --config CONFIG
Location of user-supplied configuration file
Germline modus (adjust parameters in configuration file)
usage: pyAMPLI germline [-h] -b BAM -v VCF -d DESIGN [-f FILENAME]
[-od OUTDIR]
optional arguments:
-h, --help show this help message and exit
-b BAM, --bam BAM Binary alignment file, BAM file
-v VCF, --vcf VCF Variant calling file, VCF file
-d DESIGN, --design DESIGN
Probe/amplicon design file. Contains field, is this
order: Contact your manufacturer for details
-f FILENAME, --filename FILENAME
Output file name
-od OUTDIR, --outdir OUTDIR
Output directory
-c CONFIG, --config CONFIG
Location of user-supplied configuration file
What it does?
See article for more detailed explanation and tool validation.
Variant categories
Variants are evaluated for each of the following criteria, in the order given here, and assigned to the first matching category. When all criteria are passed, a variant is classified as high quality, corresponding to the label AmpPass.
DepthFail | variants with low read evidence
In a first step, variants with insufficient coverage by genuine read-pairs are flagged as low read evidence variants, and not subjected to further variant filtering. FILTER field flags DepthFail and DepthFailTumor/Normal are set, respectively in germline and somatic modus. Users have the flexibility to define their own DepthFail cut-off by adjusting the min_depth_normal and/or min_depth_tumor values in the configuration file.
OneAmpPass | variants with panel design limitations
Variants covered by and present in a single theoretical amplicon, as a design limitation, might be more prone for systemic enrichment artefacts. As we have insufficient information to evaluate the reliability of these variants, they are flagged as OneAmpPass, and not subjected to further filtering.
LowAmpFail | variants with low amount of covered amplicons
When variants are covered by multiple theoretical amplicons, we can infer variant reliability based on the number of amplicons containing the variant. Variants covered by more than two overlapping theoretical amplicons are flagged as LowAmpFail if the alternative allele is present in reads corresponding to less than three of these amplicons. Variants covered by just two amplicons are handled separately as LowAmpFail if present in reads corresponding to only one of both amplicons.
MatchAmpPass | variants with low amount of covered amplicons
Variants covered by just two theoretical amplicons are handled separately as MatchAmpPass if the alternative allele is present in reads from both amplicons, to indicate the limited discriminative power.
PositionFail | positional biases
Variants only present in the first two positions of either 3’ or 5’ read ends are flagged as PositionFail. This enrichment artefact is typically seen in Haloplex gene panels, because fragments are reproducibly generated by restriction enzymes, which cut only recognized sequences and generate non-random fragments [Reference 1]. Users can adjust the min_read_pos (default 2) and min_read_pos_fraction (default 10) in the configuration file, i.e. variants will be flagged as PositionFail if more than 10% of the total reads contain the alternative allele in the first two positions of either 3’ or 5’ read ends.
NormalFail | low-fraction variants in normal samples
This filter is only applied in somatic mode and is more subjective to user settings. When considering paired tumor-normal samples, somatic variants are not expected to be present in the patient’s paired normal tissue sample. First, this can be indicative for a false-positive somatic variant in the tumor tissue sample, that is in fact a true-positive low-fraction germline variant in the normal sample. Secondly, it might be a systemic enrichment artefact that is more pronounced in the tumor sample and therefore called as somatic. Lastly, it could be a reliable somatic variant. This may be explained by field cancerization, which is the occurrence of genetic, epigenetic and biochemical aberrations in structurally intact cells in histologically normal tissue adjacent to cancerous lesions [Reference 2]. By default, somatic variants present in more than 1% of reads from the normal sample are flagged as NormalFail. To allow the effect of field cancerization, the user can adjust the threshold (min_frac) for flagging these variants in the configuration file.
AmpPass | threshold-passing variants
As mentioned above, variants passing all user-defined filters are flagged as high quality variants, using the AmpPass label.
References
- Samorodnitsky E, Jewell BM, Hagopian R, Miya J, Wing MR, Lyon E, Damodaran S, Bhatt D, Reeser JW, Datta J, Roychowdhury S. Evaluation of Hybridization Capture Versus Amplicon‐Based Methods for Whole‐Exome Sequencing. Hum Mutat. 2015;36:903-914.
- Slaughter DP, Southwick HW, Smejkal W. Field cancerization in oral stratified squamous epithelium: clinical implications of multicentric origin. Cancer. 1953;6:963–968.