White Lab

Peter White, PhD is a principal investigator in the Center for Microbial Pathogenesis at The Research Institute at Nationwide Children’s Hospital and an Assistant Professor of Pediatrics at The Ohio State University. He is also Director of Molecular Bioinformatics, serving on the research computing executive governance committee, and Director of the Biomedical Genomics Core, a nationally recognized microarray and next-gen sequencing facility that help numerous investigators design, perform and analyze genomics research. His research program focuses on molecular bioinformatics and high performance computing solutions for "big data", including discovery of disease associated human genetic variation and understanding the molecular mechanism of transcriptional regulation in both eukaryotes and prokaryotes.

Informatics Services

The BGC Bioinformatics Unit is comprised of a dynamic team of computational biologists, with the substantial technical and bioinformatics expertise required to oversee the multiple platforms that acquire, store and analyze the very large and complex data sets generated by the BGC Microarray and Sequencing Units. The unit provides advanced bioinformatics analysis on a collaborative basis and serves as an interface between the research investigator and the multiple domains that are required to handle the size and complexity of genomic data, including Research Information Services (RIS) and the Ohio Supercomputing Center (OSC). With our high performance compute cluster and over 100TB of clustered high performance disk space we are able to support the analysis of both large and small scale sequencing projects.

As part of the research group of Dr. Peter White, a major research focus for our team is the development of analytical pipelines for BIG DATA, including:

  • Human genome resequencing and the identification of disease causing genetic variants
  • Bacterial genomics including RNA-Seq,ChIP-Seq and TN-Seq
  • Re-sequencing and assembly of bacterial genome sequencing data
  • Analysis pipeline development, automation and optimization
informatics workflow diagram

Current BGC Bioinformatics Unit Services:

Primary Data Analysis

Primary analysis typically describes the process by which instrument-specific sequencing measures are converted into files containing the short read sequence data, including generation of sequencing run quality control metrics. For example, for the Illumina HiSeq 2000 platform, primary analysis encompasses the process of image analysis to generate intensities (BCL files), the conversion of these intensities into actual reads with base-quality scores (BCL conversion), and the generation of numerous metrics that allow for assessment of the quality of the run.
Primary data analysis is provided for all projects utilizing Next Generation Sequencing data generated by the BGC Sequencing Unit.
The community has adopted a human-readable primary analysis output format called Sanger FASTQ, containing read identifiers, the sequence of bases, and the PHRED-like quality score Q, represented by single character ASCII character to reduce the output file size. A typical lane of output from our sequencer will contain upwards of 180 million reads.

Secondary Data Analysis

Secondary Data Analysis is currently ONLY routinely provided as a core service for whole human genome resequencing analysis and exome capture and sequencing. For these data sets, secondary analysis encompasses both alignment of these sequence reads to the human reference genome and utilization of this alignment to detect differences between the patient sample and the reference. Analysis is performed using the state of the art Churchill approach (Patent Pending) developed by the White lab, which performs the complex and computationally intensive processes of:

  • alignment of raw reads to the human reference genome
  • post-alignment processing and duplicate read removal
  • local realignment around variant positions to improve both SNP and INDEL calling
  • base quality score recalibration to improve genotyping accuracy
  • single sample variant calling and genotyping to create a final variant call file (VCF) for each sample

The final results of secondary analysis for human genetic variant discovery include:

  • Recalibrated BAM files which are viewable in the Integrative Genomics Viewer separated into:
    • Reads mapping to unique locations
    • Reads mapping to multiple locations (MAPQ of 0)
    • Unmapped reads
  • Raw and recalibrated variant call files (VCF)

Tertiary Data Analysis

Tertiary data analysis diverges into the broad spectrum of study specific downstream investigation, and encompasses the advanced and in-depth processes where an attempt is made to make biological sense from the secondary data analysis output.

NO CORE FACILITY PROVIDES TERTIARY DATA ANALYSIS

Collaboration with the White Lab (within which the BGC is located) allows our bioinformatics team members to get involved in the tertiary data analysis process. This provides an avenue by which both the needs of the investigators involved in the study can be met, along with the need of members of our group to become more intellectually challenged and connected to a given project. This level of commitment is STRICTLY done collaboratively with the expectation of co-authorship on resulting publication (see our policy for further details). Depending upon the size of the project, such involvement typically requires salary support. As such, investigators are strongly encouraged to include such support as part of their grant submissions.
Some examples of projects the White lab is currently working on in collaboration with investigators here at The Research Institute, OSU and Battelle include:

  • Utilize cutting edge next-generation sequencing technologies, innovative bioinformatics and statistical approaches, and advanced molecular biological techniques to identify novel genetic etiologies for congenital heart defects in humans (NIH R01-HL109758: White/McBride/Garg)
  • Comprehensive Clinical Phenotyping and Genetic Mapping for the Discovery of Autism Susceptibility Genes (Air Force Medical Service: Herman)
  • Characterization of host and Salmonella polymicrobial interactions using high-throughput parallel sequencing for fitness and genetic interaction studies (Tn-Seq) and Transposon Site Hybridization (TraSH)  (NIH R01-AI073971: Ahmer/White)
  • Transcriptomic analysis of host-pathogen interactions (NCHRI: White / King)
  • Evaluation of standard reference sample types for Next Generation Sequence-based genotyping and forensic analysis (Battelle: Faith / White)

 

Churchill

Download Churchill

Next generation sequencing (NGS) has revolutionized genetic research, empowering dramatic increases in the discovery of new functional variants in both syndromic and common diseases. The technology has been widely adopted by the research community and is now seeing rapid adoption clinically, driven by recognition of NGS’s diagnostic utility and enhancements in quality and speed of data acquisition. Compounded by declining sequencing costs, this exponential growth in data generation has created a computational bottleneck. Current analysis approaches can take weeks to complete resulting in bioinformatics overheads that exceed raw sequencing costs and represent a significant limitation for those utilizing the technology.

Churchill is a computational approach that overcomes these challenges. Through implementation of novel parallelization techniques we have dramatically reduced the analysis time for deep whole human genome resequencing from weeks to hours, without the need for specialized analysis equipment or supercomputers. Our approach fully automates the analytical process required to take raw sequencing instrument output through the complex and computationally intensive process of alignment, post-alignment processing, local realignment, recalibration and genotyping. Furthermore, through optimized utilization of available compute resources, the pipeline scales in a near linear fashion, enabling complete human genome resequencing analysis in ten hours with a single server, three hours with our in-house cluster and under 100 minutes using a larger HPC cluster.