We are often asked to provide advice or assistance building plant DNA reference libraries for use in dietary metabarcoding projects. To begin centralizing info on our methods and sharing some important lessons-learned from experience, I have created a section on the lab's wiki for building plant barcode libraries. I will treat the google docs that you can link to from there as living documents. All of the details provided are nested within two main goals. The first goal is to collect plant voucher specimens and plant DNA barcode samples that match in ways that can be clearly documented through their respective metadata sheets. This is critical for the long-term value of the data. The second goal is to ensure work done by field biologists and molecular biologists are mutually informative -- the best reference libraries are developed through the meaningful engagement of expert botanists who are knowledgeable in a local flora and the researchers who will be analyzing the laboratory data.
We love to archive relevant vouchers in the Brown University Herbarium. Please keep in mind that the herbarium is staffed by expert botanists. Properly collected specimens can be mounted, archived, and digitized by professional staff -- this greatly reduces the cost and complexity of fieldwork.
Following the recent publication of our plant DNA barcode library from Mpala Research Centre, Kenya, led by Brian Gill, we are happy to provide a set of files to serve as our local trnL-P6 reference library (version 2.0). These files were carefully prepared by Courtney Reed, to whom we are most grateful.
Bianca Brown began the hard work of collating scripts the lab uses to process fastq data from our lab's diverse Illumina amplicon projects. These strategies, and a draft explanation of why we use different "flavors" of these approaches for different projects, are provided here.
Modules included the tutorial include "cutadapt," "dada2," and "R," with some references to "Obitools" and Brown University's supercomputing cluster "Oscar."
Many of the steps and principles of these workflows are identical -- we want to thoughtfully prepare our data for analysis and remove errors -- but a few of the nuts and bolts differ. Most often, these differences arise from whether or not a project included single-end sequence data (used to be common) or paired-end sequence data (now standard in the lab). There are also differences in approaches depending on whether the amplicons are typically invariable in length (e.g., 16S-V4 rRNA or COI markers), or if there is considerable length variation (e.g., trnL-P6 markers).
For members of Brown University seeking to run parts of these modules on Oscar, Bianca has very kindly provided some blank bash scripts that can get you started here.
NB: This compilation of scripts is a work in progress. We are aware of necessary updates and improvements, and we intend to push them soon. We'll add posts describing any substantial updates in the future, and we welcome feedback.
We also wish to express our appreciation to all of the authors of the softwares that we use and cite in our work.
Computational resources kindly contributed and explained by members of our community.