Generating random genome fragments in R
In the analysis of genomic data, a theoretical random distribution is often used as a reference point for feature enrichment or depletion. In principle, the derivation of an expected random distribution is not too difficult. Given the genome size, $L_G$, the probability of integration into a particular region $i$ is simply $p_i = L_i/L_G$, where $L_i$ is the length of $i$. Therefore, the expected number of integration sites in $i$ can be expressed as $I \sim Bin(n, p_i)$.
In reality, however, the entire genome is rarely accessible by common next-generation sequencing technologies. While long-read sequencing is growing in both accessibility and throughput, it is still common for genomic experiments to be done using e.g. 150 bp paired-end sequencing. Because of this, certain regions of the genome, such has highly repetitive regions, are effectively “invisible” due to the inability to confidently align short reads to them. In addition, different genome fragmentation strategies may influence the mappable regions of a genome in a given experiment. For example, when using a restriction enzyme (or a cocktail of restriction enzymes) to fragment the genome, certain regions deficient in the target recognition sequence(s) will have comparatively fewer mapped reads than other regions.
To accurately calculate $p_i$ for a given set of sequencing conditions, one really needs to know the effective genome length, $L_{G_{eff}}$. Because mappable regions of the genome can depend on things like fragmentation strategy, it is convenient to derive the mappable genome using simulated genomic fragments that match both the genome fragmentation method used and the sequencing approach.
This idea is not new. People doing integration site analyses have been using simulated random datasets for many years in one way or another. However, many of the scripts in use (at least that I have seen and used), are rather slow and somewhat inflexible. After spending a substantial amount of time adapting and eventually wholly rewriting some existing Python scripts for random fragment generation, I decided to develop an R approach that would slot seamlessly into the Bioconductor framework. I’ve wrapped those functions up in the xInt package that is still under development (as of 12/2022). This aspect of the package, however, is functional. I’ll briefly explain the workflow below.
Load packages
To get started, install xInt if it isn’t already.
devtools::install_github(repo = "gbedwell/xInt")
Next, load xInt and the BSgenome object that corresponds to the genome of interest. In this example, I’ll be using the new CHM13v2 T2T human genome build.
library(xInt)
library(BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0)
Restriction enzyme cut positions
If fragmenting the genome by restriction digestion, the first step in generating random genome fragments is to extract the sequences of each chromosome.
chr.seqs <- get_chromosome_seqs( genome.obj = BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0 )
The output of get_chromosome_seqs()
is a list of DNAString
objects corresponding to each of the chromosomes in the target genome.
#> $`1`
#> 248387328-letter DNAString object
#> seq: CACCCTAAACCCTAACCCCTAACCCTAACCCTAACC...AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT
#>
#> $`2`
#> 242696752-letter DNAString object
#> seq: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCC...TAGGGTTAGGGTTTAGGGGTTTAGGGTTAGGGTTAG
Next, you can use the digest()
function to identify all of
the possible fragmentation positions on both forward and reverse strands
of each chromosome. You must supply the recognition sequences of the
enzymes being used as a character vector.
re.cuts <- digest( string.list = chr.seqs,
re.sites=c( "TTAA","AGATCT" ) )
Random integration site positions
Now you want to generate random positions-of-interest throughout the genome. These positions could represent integration site locations, binding site locations, etc. This step is the same regardless of the fragmentation method used. First, define the number of sites that you want to generate. For good estimation of the mappable genome, this number should be rather high (e.g. $10^8-10^9$ sites). If you just want a representative random dataset, however, this number can be much smaller (e.g. $10^4-10^5$ sites). For example purposes, I’ll generate $10^5$ sites.
rand.sites <- random_sites( n.sites = 1E5,
genome.obj = BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0 )
Generating genomic fragments
You now want to make genomic fragments that mimic the genomic fragments
obtained in your experiment. For restriction digestion, you want to use
the random sites and the restriction enzyme cut positions generated with
the digest()
function to generate the fragments. Be sure to
set random = FALSE
in make_fragments()
.
Fragment coordinates are defined from the random position-of-interest to
the nearest downstream restriction enzyme cut position.
re.fragments <- make_fragments( insert.sites = rand.sites,
frag.sites = re.cuts,
random = FALSE,
genome.obj = BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0 )
For random fragmentation, you want to set frag.sites = NULL
and random = TRUE
in make_fragments()
. You
also want to set the mean ± sd fragment lengths (default values of 500
bp and 250 bp, respectively). Fragments lengths are randomly sampled
from a log-normal distribution with the defined parameters and fragment
end positions are calculated relative to each simulated random
integration site.
rand.fragments <- make_fragments( insert.sites = rand.sites,
frag.sites = NULL,
random = TRUE,
mean = 500,
sd = 250,
genome.obj = BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0 )
Extracting and trimming fragment sequences
I rely on BSgenome::getSeq()
to extract the sequences
corresponding to the fragment positions. These sequences are intended to
mimic the genomic fragments sequenced in the sequencing reaction.
frag.seqs <- getSeq( x = BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0,
names = rand.fragments,
as.character = FALSE )
#> DNAStringSet object of length 100000:
#> width seq
#> [1] 512 AAGGAAGAGAGAGCCGGGGGAGGTGGCGGGC...CCGCCAGGCCTGGGCATCTCCTCTCCTGCAG
#> [2] 448 TCATTGGAATTTTTTCTGTTAGTTAAATAAA...TGCACACCAGATGTGGCCCAATTGTTAATAA
#> [3] 495 TGAGAGGCCCTAGTGTGTGTTGTTCCCCTCC...GAGGCCCTGGTGTGTGTTGTTCCCCTCCATG
#> [4] 434 GTGCAAGCCTCAAGCTTTGGCAGCTTCCATG...ACCATGGGAACCCACCACTTGCATGAGTATG
#> [5] 759 CCAGCCTGGGTGACAGAGTGAGACCCTGTCT...GCTCAAACTTCAGCATTCCGAGTAGCTGGGA
#> ... ... ...
#> [99996] 332 CAGTGTGACTATATTTGGAGACAAGGCCTTT...ATACCTGGTTTATATCATAGTCCCTTTCCCT
#> [99997] 326 TGTTCTTCTGATCTTCTGGTACCTGCCTTCC...TGGGAAGCAACCCAGTATCTTTGTCTATCTT
#> [99998] 281 GATTTTGGAAATTAAAACTGAAAGAGAGCCT...CATTCATAAAAACTAGAAACACAGTTAAAAG
#> [99999] 760 AGAAACGGGATTTCACCATGTTGCCCAGGGT...AGACTGTGTGTTCTGTTATATTTCTCTGGAG
#> [100000] 904 CCACTGACATGACTTTCCAAAAAACACATAA...CAAACCCCCTGAAGCTTCACCGGCGCAGTCA
To mimic the sequencing output, however, the genomic fragments must be
trimmed. The function trim_seqs()
will trim the ends of
each fragment to the maximum desired length and return the paired reads
as matched entries in one of two list elements. The first list element
corresponds to forward reads and the second list element corresponds to
reverse reads. The reverse reads are returned in 5’ to 3’ orientation.
In addition to defining the maximum read length,
trim_seqs()
has options to define the minimum fragment
width (default 14 bp) and the maximum inner distance between pairs
(default 1000 bp). These filtering parameters will decrease the number
of output read pairs, so don’t be alarmed if the lengths of the
trim_seqs()
list elements is less than the number of
generated sites.
frag.trim <- trim_seqs( fragments = frag.seqs,
min.width = 14,
max.distance = 1000,
max.bp = 150 )
#> [[1]]
#> DNAStringSet object of length 98797:
#> width seq names
#> [1] 150 AAGGAAGAGAGAGCCGGGGGA...CCTGAGCCCACCCTTCGCGGC sequence_1
#> [2] 150 TCATTGGAATTTTTTCTGTTA...GCAAACAAGGATCTTCTATCC sequence_2
#> [3] 150 TGAGAGGCCCTAGTGTGTGTT...ATGGTCTCCTACCCCCTGTCC sequence_3
#> [4] 150 GTGCAAGCCTCAAGCTTTGGC...AGAAGTTTGCTGCAGGGGTGA sequence_4
#> [5] 150 CCAGCCTGGGTGACAGAGTGA...CTTACCAACTCTGTCTTCAGT sequence_5
#> ... ... ...
#> [98793] 150 CAGTGTGACTATATTTGGAGA...CCACCTGAGAACACAGCAAGA sequence_98793
#> [98794] 150 TGTTCTTCTGATCTTCTGGTA...CTGGATTCACGCTGATATACA sequence_98794
#> [98795] 150 GATTTTGGAAATTAAAACTGA...ATGGATAAAACTGGAATTTGA sequence_98795
#> [98796] 150 AGAAACGGGATTTCACCATGT...TTAAGATGCACTTTTTGCTTA sequence_98796
#> [98797] 150 CCACTGACATGACTTTCCAAA...TTTTCCTCCGACCCCCTAACA sequence_98797
#>
#> [[2]]
#> DNAStringSet object of length 98797:
#> width seq names
#> [1] 150 CTGCAGGAGAGGAGATGCCCA...GGCGGCGCTGCAGGAGAGGAG sequence_1
#> [2] 150 TTATTAACAATTGGGCCACAT...AGAGGATGCCTATCAGACTAA sequence_2
#> [3] 150 CATGGAGGGGAACAACACACA...GACCATCAAGACAAACACGTG sequence_3
#> [4] 150 CATACTCATGCAAGTGGTGGG...CAAGCTGTCGGTGATCTACCA sequence_4
#> [5] 150 TCCCAGCTACTCGGAATGCTG...ATTTACAGACAGAAAAGAAAT sequence_5
#> ... ... ...
#> [98793] 150 AGGGAAAGGGACTATGATATA...CGCTAATTCAGTTCTTGATGA sequence_98793
#> [98794] 150 AAGATAGACAAAGATACTGGG...CTTCTCAAGTATAATAAACAG sequence_98794
#> [98795] 150 CTTTTAACTGTGTTTCTAGTT...TTTCAAATTCCAGTTTTATCC sequence_98795
#> [98796] 150 CTCCAGAGAAATATAACAGAA...TCAATGAAGATTGACTTCCAG sequence_98796
#> [98797] 150 TGACTGCGCCGGTGAAGCTTC...TAATTATGCCTCATAGGGATA sequence_98797
Generating fasta files
Finally, now that your pairs have been generated, you can save the pairs
as R1 and R2 fasta files for genome alignment. The compress
option in save_fasta()
will compress the fasta files to
save space. Given the input parameters below, the files will be saved as
path/to/directory/prefix_R1.fa(.gz)
and
path/to/directory/prefix_R2.fa(.gz)
, respectively.
save_fasta( reads = frag.trim,
directory.path = "path/to/directory",
file.prefix = "prefix",
compress = TRUE )
Conclusions
On my 2 GHz quad-core laptop with 16 GB RAM, this entire process ran in just over 2 minutes. Obviously, as you increase the number of random sites generated, this run time will increase. For generating extremely large datasets, I would recommend moving this to an HPC cluster. Regardless of where you run this, I hope I’ve demonstrated how easy it can be to generate random genome fragments in a way that should integrate seamlessly with your other Bioconductor workflows!
Addendum: 08/01/2023
A wrapper function, simulate_random_data()
, has been
included in the package that automatically calls all of these functions.
There are some convenient additions to this function that are not part
of the other functions, including the ability to simultaneously generate
multiple random datasets. Concatenating multiple smaller random datasets
together is a convenient and reasonably quick way to generate extremely
large random datasets. The general syntax of this function is below. The
source code with argument descriptions can be found on
GitHub.
simulate_random_data( genome.obj,
re.sites = NULL,
cut.after = 1,
n.sites,
mean,
sd,
min.width = 14,
max.distance = 1000,
max.bp = 150,
iterations = 1,
n.cores = 1,
write.ranges = FALSE,
prefix = NULL,
directory.path = NULL,
compress = TRUE,
collapse = TRUE )