6 minute read

What is intmap?

intmap is a Python package that provides several CLI modules for mapping the locations of genomic integration and translocation from NGS data. Such NGS data are almost invariably generated by either ligation-mediated (LM-)PCR or linear amplification-mediated (LAM-)PCR. While these PCR approaches are technically different, their final products are essentially the same: R1 NGS adapter – Integrant/Bait | Target/Prey – R2 NGS adapter. For both integration and translocation experiments, the integrant/bait sequences are known. The target/prey sequences, on the other hand, are not. The junction between the integrant/bait and target/prey sequences represents the position we are interested in mapping. This position represents the location in the genome that was targeted and cleaved by a viral integrase, a transposase, a CRISPR/Cas NPC, etc.

Until publication (in preparation), I’ll refrain from describing the inner-workings of intmap in too much detail here. What follows are short demonstrations of intmap usage. These are intended to help users get started with the intmap software.

The intmap repository can be found here.

Installation

The following is the easiest way to install intmap. This will set up an intmap conda environment and install all dependencies within that environment.

git clone https://github.com/gbedwell/intmap.git
cd intmap
bash install.sh

The install script requires an initialized conda instance. Follow the instructions here to install conda, if it’s not already installed. OpenMP is also required to fully leverage FAISS’ parallelization capabilities (intmap will still run fine without it, just maybe a little slower). On Linux systems, OpenMP should be installed natively. For macOS, OpenMP can be installed using Homebrew:

brew install llvm libomp

For installation on macOS with OpenMP, replace bash install.sh above with bash install.sh –use-openmp.

Data

To run the following examples, you will need:

  1. The example_data directory from the GitHub repository.
  2. A Bowtie2 index of the human genome. I’m using an indexed version of hs1. The genome fasta file can be downloaded here. If you use a different genome build, the results might be slightly different, but everything should still run.
  3. A dummy virus/vector index. An HIV-1 genome index can be found in the example_data directory of the git repository. For the example data, what this index is doesn’t strictly matter, as there are no virus- or vector-derived contaminating sequences. Nevertheless, this is a required parameter for intmap.
  4. The dummy data. This is provided in the example_data directory of the git repository.

Running intmap

Fixed termini

Running intmap to map integration events/translocations in a system with predictable integrant/bait termini looks something like this:

intmap \
  -r1 data/sample1_R1.fq.gz \
  -r2 data/sample1_R2.fq.gz \
  -ltr5 AGTCAGTGTGGA \
  -ltr3 AAATCTCTAGCA \
  -linker5 ATGAGCATTC \
  -linker3 TACACGATTAC \
  -nm fixed_example \
  -bt2_idx_dir /Users/gbedwell/Documents/github/T2T_genome/indexes/bowtie2 \
  -bt2_idx_name hs1 \
  -v_idx_dir /Users/gbedwell/Documents/github/virus_genomes/HIV-1 \
  -v_idx_name hiv1 > fixed_output.txt

All of these arguments are required. They define the data files, the sequences to look for on each read, the sample name, and the Bowtie2 index directories to use for read alignment. ltr5 and ltr3 denote respective chunks of the integrant/bait search sequence. These are split to facilitate more granular control of error rate – the error rates in each chunk can be defined separately. In the given example, the actual search sequence is AGTCAGTGTGGAAATCTCTAGCA. linker5 and linker3 operate in an identical way.

In addition to the required arguments, intmap has a myriad of optional arguments to fine-tune analysis parameters. Below, I will briefly describe some of the optional parameters that users might want to play around with the most.

  1. intmap is written to be highly parallelizable. By default, however, intmap will only utilize a single core. To fully leverage intmap’s built-in parallelization capabilities, the number of cores the program uses can be adjusted with nthr.

  2. The example data has 12 nucleotide barcodes on the linker-end of each fragment. We know that these barcodes occur immediately before the sequence defined by linker5. We can therefore include the barcodes in the analysis by setting linker_umi_offset to 0 (telling the program that the UMIs are located immediately after the linker5 sequence) and linker_umi_len to 12 (defining the length of the UMIs).

  3. We know that the example data used here were generated in silico. Many of the experimental artifacts that appear in real-world data are therefore not present in these data. intmap’s default parameters are set to allow for “fuzzy” matching. That is, very similar reads, but not necessarily equivalent reads, are grouped together to mitigate artifactual differences stemming from sample handling/processing, sequencing, etc. For the idealized example data, we can turn these fuzzy matching parameters off. These parameters are: seq_sim (sequence similarity), len_diff (allowable length difference between fragments), umi_diff (allowable UMI distance), frag_ratio (the count ratio between fragments), min_count (the required number of sites at a given position to define that site as “abundant”), and count_fc (the fold-change required between adjacent abundant sites to call them the same). To turn these parameters off, we will set seq_sim to 1, len_diff and umi_diff to 0, and the others to a very large value.

  4. Lastly, we can tell intmap how to handle multimapping reads. The no_mm flag will tell intmap to completely ignore multimapping reads, but we don’t necessarily want to do that here. Instead, we know that the example data might contain clonal fragments, or a high percentage of fragments generated from a single integration/translocation event. In an attempt to identify those clonal sites as completely as possible, we will set the flag reassign_mm. This flag will tell the program to:

    • Compare multimapping fragments to uniquely mapping fragments and reassign any multimapping fragment that has a suitable match with a uniquely mapping fragment to the uniquely mapped position.

    • Cluster multimapping reads, retain groupings that contain more than mm_group_threshold * 100% of all multimapping reads (default 0.002), and reassign all reads in that group to one of the mapped positions.

Incorporating all of these options, the intmap run command will look like this:

intmap \
  -r1 data/sample1_R1.fq.gz \
  -r2 data/sample1_R2.fq.gz \
  -ltr5 AGTCAGTGTGGA \
  -ltr3 AAATCTCTAGCA \
  -linker5 ATGAGCATTC \
  -linker3 TACACGATTAC \
  -nm fixed_example \
  -bt2_idx_dir /Users/gbedwell/Documents/github/T2T_genome/indexes/bowtie2 \
  -bt2_idx_name hs1 \
  -v_idx_dir /Users/gbedwell/Documents/github/virus_genomes/HIV-1 \
  -v_idx_name hiv1 \
  -nthr 4 \
  -linker_umi_offset 0 \
  -linker_umi_len 12 \
  -seq_sim 1 \
  -len_diff 0 \
  -umi_diff 0 \
  -frag_ratio 1000 \
  -min_count 1000 \
  -count_fc 1000 \
  --reassign_mm > fixed_output.txt

Truncated termini

Some systems, like AAV, do not generate predictable ends. Instead, the terminus of the integrant is often a truncated derivative of the full-length terminus from the viral genome. In the case of AAV, the situation is further complicated by the presence of two distinct orientations of the terminal repeat. intmap can easily handle these situations:

intmap \
  -r1 data/sample3_R1.fq.gz \
  -r2 data/sample3_R2.fq.gz \
  -ltr5 AGGAACCCCTAGTGATGGAGTTGGC \
  -ltr3 CACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCCCGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTGGCCAA \
  -ltr3_alt CACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCGCCCGGGCGAAACGCCCGGGCTGGTCGCCCGTTTCGGGCGACCGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTGGCCAA \
  -linker5 ATGAGCATTC \
  -linker3 TACACGATTAC \
  -nm truncated_example \
  -nthr 4 \
  -bt2_idx_dir /Users/gbedwell/Documents/github/T2T_genome/indexes/bowtie2 \
  -bt2_idx_name hs1 \
  -v_idx_dir /Users/gbedwell/Documents/github/virus_genomes/HIV-1 \
  -v_idx_name hiv1 \
  -linker_umi_offset 0 \
  -linker_umi_len 12 \
  -seq_sim 1 \
  -len_diff 0 \
  -umi_diff 0 \
  -frag_ratio 1000 \
  -min_count 1000 \
  -count_fc 1000 \
  --reassign_mm \
  --ttr > truncated_output.txt

Almost all of the arguments given to intmap are the same as for the fixed-end example. The differences are in the ttr flag, which tells the program to look for truncated terminal repeats, and the inclusion of an ltr3_alt argument that defines an alternative orientation of the integrant/bait search sequence.

It is important to note that when ttr is set, the sequence given in ltr5 serves as an anchor sequence for the integrant/bait. This means that the ltr5 sequence should, in most cases, be known (e.g., the primer binding site used in library construction). Progressively truncated versions of the sequence(s) given in ltr3 and ltr3_alt (if included) are then searched for in the sequenced reads.

An option not included in the command given above is min_ttr_len. This parameter defines the shortest truncation product to look for. This value defaults to 10 nucleotides.

Multiple analyses

Most integration site experiments include more than one sample. In these cases, it would be convenient to be able to analyze those data in a single command. This can be done using intmap_multi. A brief description of the structure of the required setup file is given in the README in the git repo.

intmap_multi \
  -s multi_setup.txt \
  -n multi_args > multi_output.txt

Updated: