What I want to do here is find out if Latent Dirichlet Allocation and similar topic models are suitable for metagenomics data. I want to test if the probability of a k-mer (or genomic ‘word’), or it’s frequency in a genome can be approximated by the frequency of that k-mer in a collection of short reads drawn from that genome. My intuition is that it can, but the quality of the approximation will depend of the read length and the chromosome length of the genome. This is because overlapping regions of the reads could mess up the approximation, if they are not distibuted uniformly throughout the genome. I believe they should be for the most part, but this could break down near the ends of chromosomes (e.g. reads that are too close to the ends will be tossed because they will be too short, and so the reads just to the other side will have less overlaps, meaning the kmers here may be underrepresented). Since the length of genome that are a read’s length from the ends of chromosome are probably rare compared to the full length of the genome, I suspect this won’t be a big issue.

If we can approximate the k-mer frequencies of a source genome from its reads, then we can model a mixture of reads from many genomes as a mixture model where the frequency of a k-mer is equal to the sum of the frequencies of that k-mer in each genome times the frequency of each genome in that particular mixture. This is exactly what is modelled with a Latent Dirichlet Allocation model, only it is usually used to model documents as a mixture of topics (e.g. probability of a word in a document is equal to probability in each topic, times the probability of each topic being in that document). Documents are sites, topics are genomes, and words are k-mers.

Load the necessary project package:

if (!require("LDAmetagenomics", character.only = TRUE)) devtools::install_github("rdinnager/LDA-metagenomics")
library(LDAmetagenomics)
set.seed(20140728)

First I am going to simulate a set of genomes using the software ALF (through my package Ralf):

library(magrittr)
if (!require("LDAmetagenomics", character.only = TRUE)) devtools::install_github("rdinnager/Ralf")
library("Ralf")
## generate 50 genomes of 50 genes each
if (!file.exists("temp/genome_sim")) {
    genome_sim <- run_ALF("genome_sim", 50, 50, 400, 50, 0.06, 0.025, 1e-04, 
        "temp", "/home/din02g/ALF_standalone/bin", TRUE) %>% load_ALF
} else {
    genome_sim <- "temp/genome_sim" %>% load_ALF
}

Let’s take a look at the simulated tree used to simulate those genomes, to see if everything looks okay.

plot(genome_sim$tree, cex = 0.65, no.margin = TRUE)

plot of chunk tree_plot

The next step it to concatenate the 50 independently simulated genes into one long genomic sequence. We will assume this constitutes a single genome unit, such as a chromosome.

genome_chrom <- ALF_cat(genome_sim)
## convert dna to DNAStringSet for analysis with Biostrings
genome_dna <- DNAStringSet(sapply(genome_chrom$dna, function(x) unlist(x)))
genome_dna
  A DNAStringSet instance of length 50
     width seq                                         names               
 [1] 67500 GAGGAGACAAGACACAAGAC...TTCCGCGATATACTGCGCAC SE001
 [2] 67200 GAGCCTACAAGCCACAAGAA...TTCTGCGATATACTGCGCAC SE002
 [3] 67701 GAGGCTTCACAACATAAGAC...TTCTGCGATATACTGCGCAC SE003
 [4] 67281 GAGGCAACCAATCACAAGAA...TTCTGCGATATACTGCGCAC SE004
 [5] 67311 GAGTATACATCTCACAAGAA...TTCTGCGATATACTGCGCAC SE005
 ...   ... ...
[46] 67461 GAGCCAACAAGCCACAAGAC...TTCTGCGATATACTGCGCAC SE046
[47] 67200 GAGGAGACACGACACAAGAC...TTCTGCGATATACTGCGCAC SE047
[48] 66723 GAGGAGACAAGGCACAAGGC...TTCTGCGATATACTGCGCAC SE048
[49] 67443 GATCCAACAAGCCACAAGAC...TTCTGCGATATACTGCGCAC SE049
[50] 67446 GATCCAACAAGCCACAAGAC...TTCTGCGATATACTGCGCAC SE050

Okay, now that we have about 50 genomes to play with, we are going to start with just one. We will simulate drawing reads from this genome. For the moment, we will assume that the reads have no errors in them, and they are all 90 base-pairs long (approximately the read-length of Illumina NGS). This uses the gen_segs() function from the project package.

working_seq <- genome_dna %>% sample(1)
reads <- gen_segs(working_seq, len = 90, num = 1000)
reads
  A DNAStringSet instance of length 1000
       width seq
   [1]    90 TTCTTCTGCGATAAATTACTGTGTGATGAA...CTATACCACTGATGCCCCCAGCAAAATTC
   [2]    90 GATTCGAAGACAAAGGCGTGGGCTGCCAGT...CAAGAATTGTGATGCCTGGGGGAGAATGG
   [3]    90 GGCACCTTACCTGAACTCATGCCGCAAAAA...CAGAAAAGGGTTTGTAAGCCACCTCTGGA
   [4]    90 TGCTATCTATCAGACCGCTACGCTCGGTCC...CACTACGAGACAGTGCAGAAAAAGTGTTT
   [5]    90 TTCCTGAGAGCAACCGAGTCTTCATTGGAC...GGTCAATCGAAGGTGATCATTCCGAGAAA
   ...   ... ...
 [996]    90 CAATCGACCAGGCTAGAATTTATGGGCAGG...ACCCTCTGAAAACTACCCTGTGGTGGGAC
 [997]    90 CTTATCTGGTGAAAAATATTCTCCGGCTCG...GGTACCAAATAGTGTTTACAGCAGAAGGA
 [998]    90 ACGGACCCTGCTCTCGCCGATGAGTTTCCA...TGGATGAGCCACTCGTTCGGCAGTTTTCA
 [999]    90 ACGCAGAATCCTAAAGGAAGTGCCGAATTC...AGTCCATCTACTTTGAGCATGAGGGACTT
[1000]    90 TTTCAGGACAATAGCAATGAAGGATCCGGC...TAGGCGAAGAACCGCTTTCTTATGAAACT

We can count the frequency of different genomic words of a given length \(K\), using the oligonucleotideFrequency() function in Biostrings. To keep memory usage down, we will only look at small \(K\) to begin with. To look at bigger \(K\) we will have to come up with something more memory efficient than storing the frequency of all possible words!

genome_kmer_counts <- oligonucleotideFrequency(working_seq, width = 5, as.prob = TRUE)
genome_kmer_counts[, 50:60]
    AATAC     AATAG     AATAT     AATCA     AATCC     AATCG     AATCT 
0.0009094 0.0004771 0.0007454 0.0012374 0.0009840 0.0005963 0.0010585 
    AATGA     AATGC     AATGG     AATGT 
0.0011778 0.0010138 0.0011480 0.0009392 
## the frequencies in the short reads
reads_kmer_counts <- oligonucleotideFrequency(reads, width = 5, as.prob = TRUE)
head(reads_kmer_counts[, 50:60])
     AATAC   AATAG AATAT   AATCA   AATCC   AATCG AATCT AATGA AATGC   AATGG
[1,]     0 0.01163     0 0.00000 0.01163 0.00000     0     0     0 0.00000
[2,]     0 0.00000     0 0.00000 0.00000 0.00000     0     0     0 0.01163
[3,]     0 0.00000     0 0.00000 0.00000 0.00000     0     0     0 0.00000
[4,]     0 0.00000     0 0.00000 0.00000 0.00000     0     0     0 0.00000
[5,]     0 0.00000     0 0.00000 0.00000 0.01163     0     0     0 0.00000
[6,]     0 0.00000     0 0.01163 0.00000 0.01163     0     0     0 0.00000
     AATGT
[1,]     0
[2,]     0
[3,]     0
[4,]     0
[5,]     0
[6,]     0

The expected frequency of a k-mer in the genome should be approximated by the mean frequency of the k-mer in the short reads drawn from the genome. Let’s see if this holds up.

reads_kmers <- apply(reads_kmer_counts, 2, mean)
library(ggplot2)
kmer_dat <- data.frame(Genome = as.vector(genome_kmer_counts), Reads = reads_kmers) %>% 
    tbl_df
kmer_dat
Source: local data frame [1,024 x 2]

         Genome    Reads
AAAAA 0.0023704 0.002581
AAAAC 0.0015356 0.001395
AAAAG 0.0020872 0.002174
AAAAT 0.0014014 0.001500
AAACA 0.0017443 0.001605
AAACC 0.0013716 0.001360
AAACG 0.0007901 0.000686
AAACT 0.0011330 0.001186
AAAGA 0.0022810 0.002209
AAAGC 0.0016996 0.001733
..          ...      ...
p <- ggplot(kmer_dat, aes(Genome, Reads)) + geom_point() + stat_function(fun = function(x) x, 
    colour = "white")
p

plot of chunk test_kmers

The white line is the 1:1 line, so it looks like the mean frequency of a k-mer in the reads is not a bad (unbiased) estimator of its frequency in the source genome. There is some error though. Let’s see if we can reduce the error by increasing the number of reads (in other words, increase the read “coverage”).

reads10000 <- gen_segs(working_seq, len = 90, num = 10000)
reads_kmer_counts10000 <- oligonucleotideFrequency(reads10000, width = 5, as.prob = TRUE)
reads_kmers10000 <- apply(reads_kmer_counts10000, 2, mean)
kmer_dat10000 <- data.frame(Genome = as.vector(genome_kmer_counts), Reads = reads_kmers10000) %>% 
    tbl_df
p <- ggplot(kmer_dat10000, aes(Genome, Reads)) + geom_point() + stat_function(fun = function(x) x, 
    colour = "white")
p

plot of chunk test_more_reads

Indeed, this looks much better. There is probably a way to analytically determine the error distribution as a function of number of reads and read length, but that is beyond my meager math skills at the moment. This would be very useful down the line in this project, so I will do some thinking on it. If anyone happens to be reading this, and know of any mathematical work on this type of question, please let me know!

The next step will be to simulate “communities” of genomes by shuffling genomes together into groups. Then for each group or community, simulate drawing short reads from the genome, where the chance of a read being drawn from a genome is proportional to its relative abundance in the sample. Then we can see if the frequency of k-mers in the reads effectively approximates the frequency we would expect based on the abundances of the source genomes in each community. First, let’s create a random community from the 50 genomes.

library(MCMCpack)
read_sample <- 10000  ## number of reads total
gen_props <- rdirichlet(1, rep(0.2, 50)) %>% as.vector  ## genome proportions 
gen_props
 [1] 2.570e-05 2.050e-03 2.386e-06 1.168e-02 3.481e-02 6.482e-02 1.280e-06
 [8] 2.747e-02 6.997e-02 3.941e-03 5.443e-03 5.141e-03 2.651e-02 1.069e-01
[15] 3.670e-05 5.031e-02 8.158e-02 4.573e-03 2.354e-03 3.898e-04 8.697e-03
[22] 6.001e-03 5.143e-09 3.195e-03 6.775e-05 8.880e-09 2.104e-03 2.006e-05
[29] 4.826e-05 2.824e-03 1.211e-01 1.704e-02 3.368e-03 6.660e-08 6.069e-02
[36] 2.582e-07 4.993e-03 1.092e-04 1.327e-01 4.679e-04 5.409e-10 5.856e-03
[43] 1.227e-11 6.664e-02 1.936e-05 4.828e-08 8.069e-03 2.706e-05 5.796e-02
[50] 7.126e-06
gen_reads <- gen_props %>% rmultinom(1, read_sample, .) %>% as.vector  ## number of reads to draw from each genome (assuming probability of sampling a read from a genome is equal to that genomes frequency 
gen_reads
 [1]    0   17    0  102  365  632    0  254  689   29   48   45  267 1095
[15]    0  512  826   45   28    5   81   68    0   27    0    0   15    0
[29]    0   21 1177  169   28    0  613    0   41    2 1347    5    0   52
[43]    0  700    0    0   83    0  612    0

So, given that the genomes are present in proportion gen_props, the expected k-mer frequencies should just be the mean of the frequencies of the k-mer in each genome, weighted by the proportion of the genomes in the sample.

e_freq <- (oligonucleotideFrequency(genome_dna, width = 5) * gen_props) %>% 
    apply(2, mean)  ## this takes advantage of R's column-wise multiplication
e_freq <- e_freq/sum(e_freq)
e_freq[1:50]
    AAAAA     AAAAC     AAAAG     AAAAT     AAACA     AAACC     AAACG 
0.0021957 0.0014241 0.0019685 0.0013932 0.0015321 0.0012461 0.0008937 
    AAACT     AAAGA     AAAGC     AAAGG     AAAGT     AAATA     AAATC 
0.0013215 0.0021163 0.0016355 0.0014999 0.0012978 0.0008946 0.0011653 
    AAATG     AAATT     AACAA     AACAC     AACAG     AACAT     AACCA 
0.0012016 0.0012103 0.0012362 0.0010899 0.0017053 0.0010574 0.0012485 
    AACCC     AACCG     AACCT     AACGA     AACGC     AACGG     AACGT 
0.0011092 0.0007295 0.0010600 0.0010966 0.0007687 0.0009192 0.0006986 
    AACTA     AACTC     AACTG     AACTT     AAGAA     AAGAC     AAGAG 
0.0006071 0.0010928 0.0015891 0.0009377 0.0020713 0.0014248 0.0018206 
    AAGAT     AAGCA     AAGCC     AAGCG     AAGCT     AAGGA     AAGGC 
0.0014193 0.0014978 0.0013178 0.0009096 0.0015677 0.0019941 0.0012936 
    AAGGG     AAGGT     AAGTA     AAGTC     AAGTG     AAGTT     AATAA 
0.0011695 0.0009993 0.0006491 0.0009721 0.0014772 0.0010669 0.0007675 
    AATAC 
0.0009110 

Now let’s see what we get from analyzing the k-mer frequencies from reads drawn from these communities.

comm_reads <- mapply(gen_segs, as.character(genome_dna), 90, gen_reads) %>% 
    DNAStringSetList %>% unlist
comm_read_kmers <- oligonucleotideFrequency(comm_reads, width = 5, as.prob = TRUE)
comm_read_mean <- apply(comm_read_kmers, 2, mean)

kmer_dat_comm <- data.frame(Genome = as.vector(e_freq), Reads = comm_read_mean) %>% 
    tbl_df
p <- ggplot(kmer_dat_comm, aes(Genome, Reads)) + geom_point() + stat_function(fun = function(x) x, 
    colour = "white")
p

plot of chunk comm_read_kmers

The trouble with this is, if the genomes are all very similar, then it won’t matter which genomes are prevailent in the mixture, the k-mer count will be pretty similar no matter what/ Let’s try comparing the expected k-mer frequencies with read k-mer frequencies generated from a random different community.

gen_props_new <- rdirichlet(1, rep(0.2, 50)) %>% as.vector
gen_props_new
 [1] 5.051e-05 2.293e-05 1.187e-01 1.027e-03 2.247e-03 2.599e-02 7.563e-02
 [8] 4.638e-05 1.748e-07 1.629e-02 1.201e-03 4.478e-04 5.542e-02 2.472e-06
[15] 8.654e-09 2.061e-03 1.808e-02 2.114e-02 2.983e-03 1.595e-05 1.277e-03
[22] 7.426e-06 1.355e-01 1.023e-02 4.307e-21 3.451e-04 3.104e-02 3.552e-05
[29] 5.083e-05 1.930e-04 1.204e-02 2.857e-03 7.444e-09 2.298e-15 1.695e-02
[36] 1.203e-08 1.493e-06 5.071e-02 4.768e-02 2.538e-03 2.171e-02 4.261e-04
[43] 3.592e-02 3.180e-06 1.808e-01 6.537e-02 1.217e-03 7.009e-08 3.983e-03
[50] 3.775e-02
gen_reads_new <- gen_props_new %>% rmultinom(1, read_sample, .) %>% as.vector
comm_reads_new <- mapply(gen_segs, as.character(genome_dna), 90, gen_reads_new) %>% 
    DNAStringSetList %>% unlist
comm_read_kmers_new <- oligonucleotideFrequency(comm_reads_new, width = 5, as.prob = TRUE)
comm_read_mean_new <- apply(comm_read_kmers_new, 2, mean)

kmer_dat_comm_new <- data.frame(Genome = as.vector(e_freq), Reads = comm_read_mean_new) %>% 
    tbl_df
p <- ggplot(kmer_dat_comm_new, aes(Genome, Reads)) + geom_point() + stat_function(fun = function(x) x, 
    colour = "white")
p

plot of chunk compare_diff

So it looks like the assumptions that in my mind would be necessary for Latent Dirichlet Allocation models to be suitable for metagenomics read data are satisfied more or less. A model of the form

should be applicable. The reads are at least an unbiased estimator of the quantity being modelled, although the model, if left unmodified will not characterize the uncertainty correctly. This is a problem to be dealt with later. In any event, we are ready to try LDA on some simulated data, and see what we get. There will be a number of challenges to solve. In particular, there will be a great amount of redundancy in metagenomics data, which might affect the ability of LDA to settle on a reasonable model. Additionally, we have to choose a word-length \(K\) to convert genomes into a set of (overlapping) words. Out of curiousity, I wonder how the above analyses are effected: