Publication:
Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling.

dc.contributor.authorCastillo, Daniel
dc.contributor.authorGálvez, Juan Manuel
dc.contributor.authorHerrera, Luis Javier
dc.contributor.authorRomán, Belén San
dc.contributor.authorRojas, Fernando
dc.contributor.authorRojas, Ignacio
dc.date.accessioned2023-01-25T10:01:31Z
dc.date.available2023-01-25T10:01:31Z
dc.date.issued2017-11-21
dc.description.abstractNowadays, many public repositories containing large microarray gene expression datasets are available. However, the problem lies in the fact that microarray technology are less powerful and accurate than more recent Next Generation Sequencing technologies, such as RNA-Seq. In any case, information from microarrays is truthful and robust, thus it can be exploited through the integration of microarray data with RNA-Seq data. Additionally, information extraction and acquisition of large number of samples in RNA-Seq still entails very high costs in terms of time and computational resources.This paper proposes a new model to find the gene signature of breast cancer cell lines through the integration of heterogeneous data from different breast cancer datasets, obtained from microarray and RNA-Seq technologies. Consequently, data integration is expected to provide a more robust statistical significance to the results obtained. Finally, a classification method is proposed in order to test the robustness of the Differentially Expressed Genes when unseen data is presented for diagnosis. The proposed data integration allows analyzing gene expression samples coming from different technologies. The most significant genes of the whole integrated data were obtained through the intersection of the three gene sets, corresponding to the identified expressed genes within the microarray data itself, within the RNA-Seq data itself, and within the integrated data from both technologies. This intersection reveals 98 possible technology-independent biomarkers. Two different heterogeneous datasets were distinguished for the classification tasks: a training dataset for gene expression identification and classifier validation, and a test dataset with unseen data for testing the classifier. Both of them achieved great classification accuracies, therefore confirming the validity of the obtained set of genes as possible biomarkers for breast cancer. Through a feature selection process, a final small subset made up by six genes was considered for breast cancer diagnosis. This work proposes a novel data integration stage in the traditional gene expression analysis pipeline through the combination of heterogeneous data from microarrays and RNA-Seq technologies. Available samples have been successfully classified using a subset of six genes obtained by a feature selection method. Consequently, a new classification and diagnosis tool was built and its performance was validated using previously unseen samples.
dc.identifier.doi10.1186/s12859-017-1925-0
dc.identifier.essn1471-2105
dc.identifier.pmcPMC5697344
dc.identifier.pmid29157215
dc.identifier.pubmedURLhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC5697344/pdf
dc.identifier.unpaywallURLhttps://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-017-1925-0
dc.identifier.urihttp://hdl.handle.net/10668/11821
dc.issue.number1
dc.journal.titleBMC bioinformatics
dc.journal.titleabbreviationBMC Bioinformatics
dc.language.isoen
dc.organizationIBS
dc.page.number506
dc.pubmedtypeJournal Article
dc.rightsAttribution 4.0 International
dc.rights.accessRightsopen access
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectBreast cancer
dc.subjectCancer
dc.subjectClassification
dc.subjectGene expression
dc.subjectIntegration
dc.subjectMicroarray
dc.subjectRNA-Seq
dc.subjectRandom Forest
dc.subjectSVM
dc.subjectk-NN
dc.subject.meshAlgorithms
dc.subject.meshBreast Neoplasms
dc.subject.meshCluster Analysis
dc.subject.meshDatabases, Genetic
dc.subject.meshFemale
dc.subject.meshGene Expression Profiling
dc.subject.meshGene Expression Regulation, Neoplastic
dc.subject.meshHumans
dc.subject.meshOligonucleotide Array Sequence Analysis
dc.subject.meshReproducibility of Results
dc.subject.meshSequence Analysis, RNA
dc.titleIntegration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling.
dc.typeresearch article
dc.type.hasVersionVoR
dc.volume.number18
dspace.entity.typePublication

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
PMC5697344.pdf
Size:
3.01 MB
Format:
Adobe Portable Document Format