What I want to do here is find out if Latent Dirichlet Allocation and similar topic models are suitable for metagenomics data. I want to test if the probability of a k-mer (or genomic ‘word’), or it’s frequency in a genome can be approximated by the frequency of that k-mer in a collection of short reads drawn from that genome. My intuition is that it can, but the quality of the approximation will depend of the read length and the chromosome length of the genome. This is because overlapping regions of the reads could mess up the approximation, if they are not distibuted uniformly throughout the genome. I believe they should be for the most part, but this could break down near the ends of chromosomes (e.g. reads that are too close to the ends will be tossed because they will be too short, and so the reads just to the other side will have less overlaps, meaning the kmers here may be underrepresented). Since the length of genome that are a read’s length from the ends of chromosome are probably rare compared to the full length of the genome, I suspect this won’t be a big issue.
If we can approximate the k-mer frequencies of a source genome from its reads, then we can model a mixture of reads from many genomes as a mixture model where the frequency of a k-mer is equal to the sum of the frequencies of that k-mer in each genome times the frequency of each genome in that particular mixture. This is exactly what is modelled with a Latent Dirichlet Allocation model, only it is usually used to model documents as a mixture of topics (e.g. probability of a word in a document is equal to probability in each topic, times the probability of each topic being in that document). Documents are sites, topics are genomes, and words are k-mers.
Load the necessary project package:
if (!require("LDAmetagenomics", character.only = TRUE)) devtools::install_github("rdinnager/LDA-metagenomics")
library(LDAmetagenomics)
set.seed(20140728)
First I am going to simulate a set of genomes using the software ALF (through my package Ralf
):
library(magrittr)
if (!require("LDAmetagenomics", character.only = TRUE)) devtools::install_github("rdinnager/Ralf")
library("Ralf")
## generate 50 genomes of 50 genes each
if (!file.exists("temp/genome_sim")) {
genome_sim <- run_ALF("genome_sim", 50, 50, 400, 50, 0.06, 0.025, 1e-04,
"temp", "/home/din02g/ALF_standalone/bin", TRUE) %>% load_ALF
} else {
genome_sim <- "temp/genome_sim" %>% load_ALF
}
Let’s take a look at the simulated tree used to simulate those genomes, to see if everything looks okay.
plot(genome_sim$tree, cex = 0.65, no.margin = TRUE)
The next step it to concatenate the 50 independently simulated genes into one long genomic sequence. We will assume this constitutes a single genome unit, such as a chromosome.
genome_chrom <- ALF_cat(genome_sim)
## convert dna to DNAStringSet for analysis with Biostrings
genome_dna <- DNAStringSet(sapply(genome_chrom$dna, function(x) unlist(x)))
genome_dna
A DNAStringSet instance of length 50
width seq names
[1] 67500 GAGGAGACAAGACACAAGAC...TTCCGCGATATACTGCGCAC SE001
[2] 67200 GAGCCTACAAGCCACAAGAA...TTCTGCGATATACTGCGCAC SE002
[3] 67701 GAGGCTTCACAACATAAGAC...TTCTGCGATATACTGCGCAC SE003
[4] 67281 GAGGCAACCAATCACAAGAA...TTCTGCGATATACTGCGCAC SE004
[5] 67311 GAGTATACATCTCACAAGAA...TTCTGCGATATACTGCGCAC SE005
... ... ...
[46] 67461 GAGCCAACAAGCCACAAGAC...TTCTGCGATATACTGCGCAC SE046
[47] 67200 GAGGAGACACGACACAAGAC...TTCTGCGATATACTGCGCAC SE047
[48] 66723 GAGGAGACAAGGCACAAGGC...TTCTGCGATATACTGCGCAC SE048
[49] 67443 GATCCAACAAGCCACAAGAC...TTCTGCGATATACTGCGCAC SE049
[50] 67446 GATCCAACAAGCCACAAGAC...TTCTGCGATATACTGCGCAC SE050
Okay, now that we have about 50 genomes to play with, we are going to start with just one. We will simulate drawing reads from this genome. For the moment, we will assume that the reads have no errors in them, and they are all 90 base-pairs long (approximately the read-length of Illumina NGS). This uses the gen_segs()
function from the project package.
working_seq <- genome_dna %>% sample(1)
reads <- gen_segs(working_seq, len = 90, num = 1000)
reads
A DNAStringSet instance of length 1000
width seq
[1] 90 TTCTTCTGCGATAAATTACTGTGTGATGAA...CTATACCACTGATGCCCCCAGCAAAATTC
[2] 90 GATTCGAAGACAAAGGCGTGGGCTGCCAGT...CAAGAATTGTGATGCCTGGGGGAGAATGG
[3] 90 GGCACCTTACCTGAACTCATGCCGCAAAAA...CAGAAAAGGGTTTGTAAGCCACCTCTGGA
[4] 90 TGCTATCTATCAGACCGCTACGCTCGGTCC...CACTACGAGACAGTGCAGAAAAAGTGTTT
[5] 90 TTCCTGAGAGCAACCGAGTCTTCATTGGAC...GGTCAATCGAAGGTGATCATTCCGAGAAA
... ... ...
[996] 90 CAATCGACCAGGCTAGAATTTATGGGCAGG...ACCCTCTGAAAACTACCCTGTGGTGGGAC
[997] 90 CTTATCTGGTGAAAAATATTCTCCGGCTCG...GGTACCAAATAGTGTTTACAGCAGAAGGA
[998] 90 ACGGACCCTGCTCTCGCCGATGAGTTTCCA...TGGATGAGCCACTCGTTCGGCAGTTTTCA
[999] 90 ACGCAGAATCCTAAAGGAAGTGCCGAATTC...AGTCCATCTACTTTGAGCATGAGGGACTT
[1000] 90 TTTCAGGACAATAGCAATGAAGGATCCGGC...TAGGCGAAGAACCGCTTTCTTATGAAACT
We can count the frequency of different genomic words of a given length \(K\), using the oligonucleotideFrequency()
function in Biostrings
. To keep memory usage down, we will only look at small \(K\) to begin with. To look at bigger \(K\) we will have to come up with something more memory efficient than storing the frequency of all possible words!
genome_kmer_counts <- oligonucleotideFrequency(working_seq, width = 5, as.prob = TRUE)
genome_kmer_counts[, 50:60]
AATAC AATAG AATAT AATCA AATCC AATCG AATCT
0.0009094 0.0004771 0.0007454 0.0012374 0.0009840 0.0005963 0.0010585
AATGA AATGC AATGG AATGT
0.0011778 0.0010138 0.0011480 0.0009392
## the frequencies in the short reads
reads_kmer_counts <- oligonucleotideFrequency(reads, width = 5, as.prob = TRUE)
head(reads_kmer_counts[, 50:60])
AATAC AATAG AATAT AATCA AATCC AATCG AATCT AATGA AATGC AATGG
[1,] 0 0.01163 0 0.00000 0.01163 0.00000 0 0 0 0.00000
[2,] 0 0.00000 0 0.00000 0.00000 0.00000 0 0 0 0.01163
[3,] 0 0.00000 0 0.00000 0.00000 0.00000 0 0 0 0.00000
[4,] 0 0.00000 0 0.00000 0.00000 0.00000 0 0 0 0.00000
[5,] 0 0.00000 0 0.00000 0.00000 0.01163 0 0 0 0.00000
[6,] 0 0.00000 0 0.01163 0.00000 0.01163 0 0 0 0.00000
AATGT
[1,] 0
[2,] 0
[3,] 0
[4,] 0
[5,] 0
[6,] 0
The expected frequency of a k-mer in the genome should be approximated by the mean frequency of the k-mer in the short reads drawn from the genome. Let’s see if this holds up.
reads_kmers <- apply(reads_kmer_counts, 2, mean)
library(ggplot2)
kmer_dat <- data.frame(Genome = as.vector(genome_kmer_counts), Reads = reads_kmers) %>%
tbl_df
kmer_dat
Source: local data frame [1,024 x 2]
Genome Reads
AAAAA 0.0023704 0.002581
AAAAC 0.0015356 0.001395
AAAAG 0.0020872 0.002174
AAAAT 0.0014014 0.001500
AAACA 0.0017443 0.001605
AAACC 0.0013716 0.001360
AAACG 0.0007901 0.000686
AAACT 0.0011330 0.001186
AAAGA 0.0022810 0.002209
AAAGC 0.0016996 0.001733
.. ... ...
p <- ggplot(kmer_dat, aes(Genome, Reads)) + geom_point() + stat_function(fun = function(x) x,
colour = "white")
p
The white line is the 1:1 line, so it looks like the mean frequency of a k-mer in the reads is not a bad (unbiased) estimator of its frequency in the source genome. There is some error though. Let’s see if we can reduce the error by increasing the number of reads (in other words, increase the read “coverage”).
reads10000 <- gen_segs(working_seq, len = 90, num = 10000)
reads_kmer_counts10000 <- oligonucleotideFrequency(reads10000, width = 5, as.prob = TRUE)
reads_kmers10000 <- apply(reads_kmer_counts10000, 2, mean)
kmer_dat10000 <- data.frame(Genome = as.vector(genome_kmer_counts), Reads = reads_kmers10000) %>%
tbl_df
p <- ggplot(kmer_dat10000, aes(Genome, Reads)) + geom_point() + stat_function(fun = function(x) x,
colour = "white")
p
Indeed, this looks much better. There is probably a way to analytically determine the error distribution as a function of number of reads and read length, but that is beyond my meager math skills at the moment. This would be very useful down the line in this project, so I will do some thinking on it. If anyone happens to be reading this, and know of any mathematical work on this type of question, please let me know!
The next step will be to simulate “communities” of genomes by shuffling genomes together into groups. Then for each group or community, simulate drawing short reads from the genome, where the chance of a read being drawn from a genome is proportional to its relative abundance in the sample. Then we can see if the frequency of k-mers in the reads effectively approximates the frequency we would expect based on the abundances of the source genomes in each community. First, let’s create a random community from the 50 genomes.
library(MCMCpack)
read_sample <- 10000 ## number of reads total
gen_props <- rdirichlet(1, rep(0.2, 50)) %>% as.vector ## genome proportions
gen_props
[1] 2.570e-05 2.050e-03 2.386e-06 1.168e-02 3.481e-02 6.482e-02 1.280e-06
[8] 2.747e-02 6.997e-02 3.941e-03 5.443e-03 5.141e-03 2.651e-02 1.069e-01
[15] 3.670e-05 5.031e-02 8.158e-02 4.573e-03 2.354e-03 3.898e-04 8.697e-03
[22] 6.001e-03 5.143e-09 3.195e-03 6.775e-05 8.880e-09 2.104e-03 2.006e-05
[29] 4.826e-05 2.824e-03 1.211e-01 1.704e-02 3.368e-03 6.660e-08 6.069e-02
[36] 2.582e-07 4.993e-03 1.092e-04 1.327e-01 4.679e-04 5.409e-10 5.856e-03
[43] 1.227e-11 6.664e-02 1.936e-05 4.828e-08 8.069e-03 2.706e-05 5.796e-02
[50] 7.126e-06
gen_reads <- gen_props %>% rmultinom(1, read_sample, .) %>% as.vector ## number of reads to draw from each genome (assuming probability of sampling a read from a genome is equal to that genomes frequency
gen_reads
[1] 0 17 0 102 365 632 0 254 689 29 48 45 267 1095
[15] 0 512 826 45 28 5 81 68 0 27 0 0 15 0
[29] 0 21 1177 169 28 0 613 0 41 2 1347 5 0 52
[43] 0 700 0 0 83 0 612 0
So, given that the genomes are present in proportion gen_props
, the expected k-mer frequencies should just be the mean of the frequencies of the k-mer in each genome, weighted by the proportion of the genomes in the sample.
e_freq <- (oligonucleotideFrequency(genome_dna, width = 5) * gen_props) %>%
apply(2, mean) ## this takes advantage of R's column-wise multiplication
e_freq <- e_freq/sum(e_freq)
e_freq[1:50]
AAAAA AAAAC AAAAG AAAAT AAACA AAACC AAACG
0.0021957 0.0014241 0.0019685 0.0013932 0.0015321 0.0012461 0.0008937
AAACT AAAGA AAAGC AAAGG AAAGT AAATA AAATC
0.0013215 0.0021163 0.0016355 0.0014999 0.0012978 0.0008946 0.0011653
AAATG AAATT AACAA AACAC AACAG AACAT AACCA
0.0012016 0.0012103 0.0012362 0.0010899 0.0017053 0.0010574 0.0012485
AACCC AACCG AACCT AACGA AACGC AACGG AACGT
0.0011092 0.0007295 0.0010600 0.0010966 0.0007687 0.0009192 0.0006986
AACTA AACTC AACTG AACTT AAGAA AAGAC AAGAG
0.0006071 0.0010928 0.0015891 0.0009377 0.0020713 0.0014248 0.0018206
AAGAT AAGCA AAGCC AAGCG AAGCT AAGGA AAGGC
0.0014193 0.0014978 0.0013178 0.0009096 0.0015677 0.0019941 0.0012936
AAGGG AAGGT AAGTA AAGTC AAGTG AAGTT AATAA
0.0011695 0.0009993 0.0006491 0.0009721 0.0014772 0.0010669 0.0007675
AATAC
0.0009110
Now let’s see what we get from analyzing the k-mer frequencies from reads drawn from these communities.
comm_reads <- mapply(gen_segs, as.character(genome_dna), 90, gen_reads) %>%
DNAStringSetList %>% unlist
comm_read_kmers <- oligonucleotideFrequency(comm_reads, width = 5, as.prob = TRUE)
comm_read_mean <- apply(comm_read_kmers, 2, mean)
kmer_dat_comm <- data.frame(Genome = as.vector(e_freq), Reads = comm_read_mean) %>%
tbl_df
p <- ggplot(kmer_dat_comm, aes(Genome, Reads)) + geom_point() + stat_function(fun = function(x) x,
colour = "white")
p
The trouble with this is, if the genomes are all very similar, then it won’t matter which genomes are prevailent in the mixture, the k-mer count will be pretty similar no matter what/ Let’s try comparing the expected k-mer frequencies with read k-mer frequencies generated from a random different community.
gen_props_new <- rdirichlet(1, rep(0.2, 50)) %>% as.vector
gen_props_new
[1] 5.051e-05 2.293e-05 1.187e-01 1.027e-03 2.247e-03 2.599e-02 7.563e-02
[8] 4.638e-05 1.748e-07 1.629e-02 1.201e-03 4.478e-04 5.542e-02 2.472e-06
[15] 8.654e-09 2.061e-03 1.808e-02 2.114e-02 2.983e-03 1.595e-05 1.277e-03
[22] 7.426e-06 1.355e-01 1.023e-02 4.307e-21 3.451e-04 3.104e-02 3.552e-05
[29] 5.083e-05 1.930e-04 1.204e-02 2.857e-03 7.444e-09 2.298e-15 1.695e-02
[36] 1.203e-08 1.493e-06 5.071e-02 4.768e-02 2.538e-03 2.171e-02 4.261e-04
[43] 3.592e-02 3.180e-06 1.808e-01 6.537e-02 1.217e-03 7.009e-08 3.983e-03
[50] 3.775e-02
gen_reads_new <- gen_props_new %>% rmultinom(1, read_sample, .) %>% as.vector
comm_reads_new <- mapply(gen_segs, as.character(genome_dna), 90, gen_reads_new) %>%
DNAStringSetList %>% unlist
comm_read_kmers_new <- oligonucleotideFrequency(comm_reads_new, width = 5, as.prob = TRUE)
comm_read_mean_new <- apply(comm_read_kmers_new, 2, mean)
kmer_dat_comm_new <- data.frame(Genome = as.vector(e_freq), Reads = comm_read_mean_new) %>%
tbl_df
p <- ggplot(kmer_dat_comm_new, aes(Genome, Reads)) + geom_point() + stat_function(fun = function(x) x,
colour = "white")
p
So it looks like the assumptions that in my mind would be necessary for Latent Dirichlet Allocation models to be suitable for metagenomics read data are satisfied more or less. A model of the form
should be applicable. The reads are at least an unbiased estimator of the quantity being modelled, although the model, if left unmodified will not characterize the uncertainty correctly. This is a problem to be dealt with later. In any event, we are ready to try LDA on some simulated data, and see what we get. There will be a number of challenges to solve. In particular, there will be a great amount of redundancy in metagenomics data, which might affect the ability of LDA to settle on a reasonable model. Additionally, we have to choose a word-length \(K\) to convert genomes into a set of (overlapping) words. Out of curiousity, I wonder how the above analyses are effected: