By Scott Edmunds
There’s gold in those old databases. Analyses of genomic data often miss a large amount of information, but genome scientists at UC Davis have now created an automated analysis pipeline to dig out this hidden information.
In a new study published in the journal GigaScience the researchers mine a huge marine microbial dataset from the Microbial Transcriptome Sequencing Project (MMETSP) to find new results.
Previous work on the MMETSP sequenced 678 transcriptomes (a set of all expressed RNA sequences) and assembled genes spanning 396 different strains of marine eukaryotes. This dataset has been an invaluable resource for ocean science, exponentially expanding the accessible genetic information base of marine protistan life.
But in the 5 years since the original analysis, tools, techniques and databases have all improved. Reanalysis of previously generated data with new tools is not commonplace, and it is unclear what the best practice would be. Running analyses again produces different results and the effects of using different workflows, or “pipelines”, are poorly understood, making it difficult to determine the usefulness of the new results relative to the previous findings.
C. Titus Brown, associate professor in the Department of Population Health and Reproduction, UC Davis School of Veterinary Medicine, graduate student Lisa Johnson and postdoctoral researcher Harriet Alexander went back to the original raw data from MMETSP and created an automated pipeline to assemble and annotate it. The resulting new transcriptome assemblies were then automatically evaluated and compared against previously-generated results from the original assembly pipeline developed by the National Center for Genome Research. As there is no one-size-fits-all protocol for transcriptome assembly, and as software tools are constantly improving, Brown and colleagues’pipeline enabled improvements to be tested and quantified.
New genes in old data
The new assemblies they generated contained the majority of the previous data as well as new content. On average, 7.8 percent of the annotated sequence in the new assemblies had novel gene names not found in the historical assemblies, demonstrating that new findings can be gleaned from old data. An accompanying commentary in GigaScience summarizes the important lessons learned from this work.
Raw sequencing data is commonly shared and well cared for by government funded public archives, but the resulting assembled genomes, annotations and results generally are not. The authors have placed their results in the public Zenodo repository hosted by CERN and snapshots from the study are archived in the GigaScience GigaDB database.
The work demonstrates that researchers need to make these products “forward discoverable,” automatically notifying users when a dataset is updated or changed. For researchers low on resources, this would make it possible to improve downstream work without significant additional funding, experimentation or sequencing.
This research was the winner of the second GigaScience ICG prize track last month at the International Conference on Genomics in Shenzhen. Johnson presented her work at the meeting and received the $1,000 USD and trophy.
Scott Edmunds is executive editor of GigaScience. Adapted from a blog post published by the journal.