Uniform genomic data analysis in the NCI Genomic Data Commons.

Zhang, Zhenyu; Hernandez, Kyle; Savage, Jeremiah; Li, Shenglai; Miller, Dan; Agrawal, Stuti; Ortuno, Francisco; Staudt, Louis M; Heath, Allison; Grossman, Robert L

Publication:
Uniform genomic data analysis in the NCI Genomic Data Commons.

dc.contributor.author	Zhang, Zhenyu
dc.contributor.author	Hernandez, Kyle
dc.contributor.author	Savage, Jeremiah
dc.contributor.author	Li, Shenglai
dc.contributor.author	Miller, Dan
dc.contributor.author	Agrawal, Stuti
dc.contributor.author	Ortuno, Francisco
dc.contributor.author	Staudt, Louis M
dc.contributor.author	Heath, Allison
dc.contributor.author	Grossman, Robert L
dc.date.accessioned	2023-02-09T10:43:01Z
dc.date.available	2023-02-09T10:43:01Z
dc.date.issued	2021-02-22
dc.description.abstract	The goal of the National Cancer Institute's (NCI's) Genomic Data Commons (GDC) is to provide the cancer research community with a data repository of uniformly processed genomic and associated clinical data that enables data sharing and collaborative analysis in the support of precision medicine. The initial GDC dataset include genomic, epigenomic, proteomic, clinical and other data from the NCI TCGA and TARGET programs. Data production for the GDC started in June, 2015 using an OpenStack-based private cloud. By June of 2016, the GDC had analyzed more than 50,000 raw sequencing data inputs, as well as multiple other data types. Using the latest human genome reference build GRCh38, the GDC generated a variety of data types from aligned reads to somatic mutations, gene expression, miRNA expression, DNA methylation status, and copy number variation. In this paper, we describe the pipelines and workflows used to process and harmonize the data in the GDC. The generated data, as well as the original input files from TCGA and TARGET, are available for download and exploratory analysis at the GDC Data Portal and Legacy Archive ( https://gdc.cancer.gov/ ).
dc.identifier.doi	10.1038/s41467-021-21254-9
dc.identifier.essn	2041-1723
dc.identifier.pmc	PMC7900240
dc.identifier.pmid	33619257
dc.identifier.pubmedURL	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7900240/pdf
dc.identifier.unpaywallURL	https://www.nature.com/articles/s41467-021-21254-9.pdf
dc.identifier.uri	http://hdl.handle.net/10668/17217
dc.issue.number	1
dc.journal.title	Nature communications
dc.journal.titleabbreviation	Nat Commun
dc.language.iso	en
dc.organization	Fundación Pública Andaluz Progreso y Salud-FPS
dc.page.number	1226
dc.pubmedtype	Journal Article
dc.pubmedtype	Research Support, N.I.H., Extramural
dc.rights	Attribution 4.0 International
dc.rights.accessRights	open access
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject.mesh	Base Sequence
dc.subject.mesh	DNA Copy Number Variations
dc.subject.mesh	DNA Methylation
dc.subject.mesh	Data Analysis
dc.subject.mesh	Databases, Genetic
dc.subject.mesh	Gene Expression Regulation
dc.subject.mesh	Genome, Human
dc.subject.mesh	Genomics
dc.subject.mesh	Humans
dc.subject.mesh	MicroRNAs
dc.subject.mesh	Molecular Sequence Annotation
dc.subject.mesh	Mutation
dc.subject.mesh	National Cancer Institute (U.S.)
dc.subject.mesh	RNA-Seq
dc.subject.mesh	Reproducibility of Results
dc.subject.mesh	United States
dc.subject.mesh	Viruses
dc.title	Uniform genomic data analysis in the NCI Genomic Data Commons.
dc.type	research article
dc.type.hasVersion	VoR
dc.volume.number	12
dspace.entity.type	Publication