74
Views
3
CrossRef citations to date
0
Altmetric
Original Articles

Integrating background knowledge from internet databases into predictive toxicology models

, , &
Pages 21-35 | Received 06 Jul 2009, Accepted 18 Nov 2009, Published online: 06 Apr 2010
 

Abstract

While data integration for data analysis has been investigated extensively in biological applications, it has not yet been so much the focus in computational chemistry and quantitative structure–activity relationship (QSAR) research. With the availability and growing number of chemical databases on the web, such data integration efforts become an intriguing possibility (and, in fact, a necessity). In this paper, we take a first step towards the following vision and scenario for predictive toxicology applications. Given a new structure to be predicted, the first step would be to gather (integrate) all relevant information from internet databases for the structure itself, and all structures with available information for the endpoint of interest. In a second step, the collected information is combined statistically into a prediction of the new structure. We simulate this scenario with three endpoints (data sets) from the DSSTox database and collect information from three public chemical databases: PubChem, ChemBank and Sigma-Aldrich. In the experiments, we investigate whether the addition of background knowledge from the three databases can improve predictive performance (over using chemical structure alone) in a statistically significant way. For this purpose, we define groups of features (belonging together from an application point of view) from the three databases, and perform a variant of forward selection to include these feature groups in a prediction model. Our experiments show that the integration of background knowledge from internet databases can significantly improve prediction performance, especially for regression tasks.

Notes

Notes

1. It is interesting to note that data integration has not received attention in prediction toxicology before, although it could have been done technically Citation4.

2. This is done for simplicity. Note that (a) FreeTreeMiner works in an unsupervised fashion, and (b) we are just interested in the relative performance of the substructures alone and the substructures plus the background knowledge. Substructures alone are used as a starting point here–the better the substructures, the harder it will be for the background knowledge to improve upon them.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 543.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.