Introducing intmap
What is intmap?
intmap is a Python package that provides several CLI modules for mapping the locations of genomic integration and translocation from NGS data. Such NGS data are almost invariably generated by either ligation-mediated (LM-)PCR or linear amplification-mediated (LAM-)PCR. While these PCR approaches are technically different, their final products are essentially the same: R1 NGS adapter – Integrant/Bait | Target/Prey – R2 NGS adapter. For both integration and translocation experiments, the integrant/bait sequences are known. The target/prey sequences, on the other hand, are not. The junction between the integrant/bait and target/prey sequences represents the position we are interested in mapping. This position represents the location in the genome that was targeted and cleaved by a viral integrase, a transposase, a CRISPR/Cas NPC, etc.
Until publication (in preparation), I’ll refrain from describing the inner-workings of intmap in too much detail here. What follows are short demonstrations of intmap usage. These are intended to help users get started with the intmap software.
The intmap repository can be found here.
Installation
The following is the easiest way to install intmap. This will set up an intmap conda environment and install all dependencies within that environment.
git clone https://github.com/gbedwell/intmap.git
cd intmap
bash install.sh
The install script requires an initialized conda instance. Follow the instructions here to install conda, if it’s not already installed. OpenMP is also required to fully leverage FAISS’ parallelization capabilities (intmap will still run fine without it, just maybe a little slower). On Linux systems, OpenMP should be installed natively. For macOS, OpenMP can be installed using Homebrew:
brew install llvm libomp
For installation on macOS with OpenMP, replace bash
install.sh
above with bash install.sh –use-openmp
.
Data
To run the following examples, you will need:
- The example_data directory from the GitHub repository.
- A Bowtie2 index of the human genome. I’m using an indexed version of hs1. The genome fasta file can be downloaded here. If you use a different genome build, the results might be slightly different, but everything should still run.
- A dummy virus/vector index. An HIV-1 genome index can be found in the example_data directory of the git repository. For the example data, what this index is doesn’t strictly matter, as there are no virus- or vector-derived contaminating sequences. Nevertheless, this is a required parameter for intmap.
- The dummy data. This is provided in the example_data directory of the git repository.
Running intmap
Fixed termini
Running intmap to map integration events/translocations in a system with predictable integrant/bait termini looks something like this:
intmap \
-r1 data/sample1_R1.fq.gz \
-r2 data/sample1_R2.fq.gz \
-ltr5 AGTCAGTGTGGA \
-ltr3 AAATCTCTAGCA \
-linker5 ATGAGCATTC \
-linker3 TACACGATTAC \
-nm fixed_example \
-bt2_idx_dir /Users/gbedwell/Documents/github/T2T_genome/indexes/bowtie2 \
-bt2_idx_name hs1 \
-v_idx_dir /Users/gbedwell/Documents/github/virus_genomes/HIV-1 \
-v_idx_name hiv1 > fixed_output.txt
All of these arguments are required. They define the data files, the
sequences to look for on each read, the sample name, and the Bowtie2
index directories to use for read alignment. ltr5
and
ltr3
denote respective chunks of the integrant/bait search
sequence. These are split to facilitate more granular control of error
rate – the error rates in each chunk can be defined separately. In the
given example, the actual search sequence is
AGTCAGTGTGGAAATCTCTAGCA
. linker5
and
linker3
operate in an identical way.
In addition to the required arguments, intmap has a myriad of optional arguments to fine-tune analysis parameters. Below, I will briefly describe some of the optional parameters that users might want to play around with the most.
-
intmap is written to be highly parallelizable. By default, however, intmap will only utilize a single core. To fully leverage intmap’s built-in parallelization capabilities, the number of cores the program uses can be adjusted with
nthr
. -
The example data has 12 nucleotide barcodes on the linker-end of each fragment. We know that these barcodes occur immediately before the sequence defined by
linker5
. We can therefore include the barcodes in the analysis by settinglinker_umi_offset
to 0 (telling the program that the UMIs are located immediately after thelinker5
sequence) andlinker_umi_len
to 12 (defining the length of the UMIs). -
We know that the example data used here were generated in silico. Many of the experimental artifacts that appear in real-world data are therefore not present in these data. intmap’s default parameters are set to allow for “fuzzy” matching. That is, very similar reads, but not necessarily equivalent reads, are grouped together to mitigate artifactual differences stemming from sample handling/processing, sequencing, etc. For the idealized example data, we can turn these fuzzy matching parameters off. These parameters are:
seq_sim
(sequence similarity),len_diff
(allowable length difference between fragments),umi_diff
(allowable UMI distance),frag_ratio
(the count ratio between fragments),min_count
(the required number of sites at a given position to define that site as “abundant”), andcount_fc
(the fold-change required between adjacent abundant sites to call them the same). To turn these parameters off, we will setseq_sim
to 1,len_diff
andumi_diff
to 0, and the others to a very large value. -
Lastly, we can tell intmap how to handle multimapping reads. The
no_mm
flag will tell intmap to completely ignore multimapping reads, but we don’t necessarily want to do that here. Instead, we know that the example data might contain clonal fragments, or a high percentage of fragments generated from a single integration/translocation event. In an attempt to identify those clonal sites as completely as possible, we will set the flagreassign_mm
. This flag will tell the program to:-
Compare multimapping fragments to uniquely mapping fragments and reassign any multimapping fragment that has a suitable match with a uniquely mapping fragment to the uniquely mapped position.
-
Cluster multimapping reads, retain groupings that contain more than
mm_group_threshold
* 100% of all multimapping reads (default 0.002), and reassign all reads in that group to one of the mapped positions.
-
Incorporating all of these options, the intmap run command will look like this:
intmap \
-r1 data/sample1_R1.fq.gz \
-r2 data/sample1_R2.fq.gz \
-ltr5 AGTCAGTGTGGA \
-ltr3 AAATCTCTAGCA \
-linker5 ATGAGCATTC \
-linker3 TACACGATTAC \
-nm fixed_example \
-bt2_idx_dir /Users/gbedwell/Documents/github/T2T_genome/indexes/bowtie2 \
-bt2_idx_name hs1 \
-v_idx_dir /Users/gbedwell/Documents/github/virus_genomes/HIV-1 \
-v_idx_name hiv1 \
-nthr 4 \
-linker_umi_offset 0 \
-linker_umi_len 12 \
-seq_sim 1 \
-len_diff 0 \
-umi_diff 0 \
-frag_ratio 1000 \
-min_count 1000 \
-count_fc 1000 \
--reassign_mm > fixed_output.txt
Truncated termini
Some systems, like AAV, do not generate predictable ends. Instead, the terminus of the integrant is often a truncated derivative of the full-length terminus from the viral genome. In the case of AAV, the situation is further complicated by the presence of two distinct orientations of the terminal repeat. intmap can easily handle these situations:
intmap \
-r1 data/sample3_R1.fq.gz \
-r2 data/sample3_R2.fq.gz \
-ltr5 AGGAACCCCTAGTGATGGAGTTGGC \
-ltr3 CACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCCCGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTGGCCAA \
-ltr3_alt CACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCGCCCGGGCGAAACGCCCGGGCTGGTCGCCCGTTTCGGGCGACCGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTGGCCAA \
-linker5 ATGAGCATTC \
-linker3 TACACGATTAC \
-nm truncated_example \
-nthr 4 \
-bt2_idx_dir /Users/gbedwell/Documents/github/T2T_genome/indexes/bowtie2 \
-bt2_idx_name hs1 \
-v_idx_dir /Users/gbedwell/Documents/github/virus_genomes/HIV-1 \
-v_idx_name hiv1 \
-linker_umi_offset 0 \
-linker_umi_len 12 \
-seq_sim 1 \
-len_diff 0 \
-umi_diff 0 \
-frag_ratio 1000 \
-min_count 1000 \
-count_fc 1000 \
--reassign_mm \
--ttr > truncated_output.txt
Almost all of the arguments given to intmap are the same as for the
fixed-end example. The differences are in the ttr
flag,
which tells the program to look for truncated terminal repeats, and the
inclusion of an ltr3_alt
argument that defines an
alternative orientation of the integrant/bait search sequence.
It is important to note that when ttr
is set, the sequence
given in ltr5
serves as an anchor sequence for the
integrant/bait. This means that the ltr5
sequence should,
in most cases, be known (e.g., the primer binding site used in library
construction). Progressively truncated versions of the sequence(s) given
in ltr3
and ltr3_alt
(if included) are then
searched for in the sequenced reads.
An option not included in the command given above is
min_ttr_len
. This parameter defines the shortest truncation
product to look for. This value defaults to 10 nucleotides.
Multiple analyses
Most integration site experiments include more than one sample. In these cases, it would be convenient to be able to analyze those data in a single command. This can be done using intmap_multi. A brief description of the structure of the required setup file is given in the README in the git repo.
intmap_multi \
-s multi_setup.txt \
-n multi_args > multi_output.txt