Generated .rda file for huge (about 8GB csv) Return Time Distribution (RTDs) data (for all kmers up to k = 6) for all international Barcode of Life COI insect sequences and put it into data directory of WBHC project package. #data #WBHC
Using a fork of @cboettig ‘s project package template, started setting up project package for Word-Based Hierarchical Classification (WBHC) project. Briefly, this is a project to identify COI sequences to different taxonomic levels using machine learning on features which are statistical summaries of word occurrences in the sequences. The main challenge is to deal with massive sample imbalance between represented taxa in the training data, which is notoriously difficult for machine learning. My solution is to recursively break up the training data into binary groups that maximize the similarity in sample size, train separate models on each subset and then combine the predictions of all of them in a useful way. #open-science #WBHC
Afternoon
Had to cancel my meeting this afternoon to take my car to the mechanic. This will likely eat up my afternoon :(
Problems
I am trying to figure out the best way to incorporate open data into project package when data is really large. I have a data file which is about 1 GB compressed. If I include it in package and github repo, any users will have to download 1 GB of data. This seems crazy if, for example, someone just wants to use some of the functions in the package for their own project. What I am thinking of doing is uploading the file external to the github repo, perhaps at figshare or Dryad, and then have a dataset.R file in the data directory which uses R tools to download the data. The only trouble with this is that it would rerun the download every time you ran data(dataset). Perhaps I should just have an R funtion in the package which will download the file into the data directory if the user chooses, and loads the data if it is already there? #issues