Knowledge Discovery in Biology and Biotechnology Texts: A Review of Techniques, Evaluation Strategies, and Applications

J. Natarajan University of Ulster, School of Biomedical Sciences, Bioinformatics Research Group, Coleraine, Northern Ireland

D. Berrar University of Ulster, School of Biomedical Sciences, Bioinformatics Research Group, Coleraine, Northern Ireland

C. J. Hack University of Ulster, School of Biomedical Sciences, Bioinformatics Research Group, Coleraine, Northern Ireland

W. Dubitzky University of Ulster, School of Biomedical Sciences, Bioinformatics Research Group, Coleraine, Northern Ireland

Abstract

Arguably, the richest source of knowledge (as opposed to fact and data collections) about biology and biotechnology is captured in natural-language documents such as technical reports, conference proceedings and research articles. The automatic exploitation of this rich knowledge base for decision making, hypothesis management (generation and testing) and knowledge discovery constitutes a formidable challenge. Recently, a set of technologies collectively referred to as knowledge discovery in text (KDT) has been advocated as a promising approach to tackle this challenge. KDT comprises three main tasks: information retrieval, information extraction and text mining. These tasks are the focus of much recent scientific research and many algorithms have been developed and applied to documents and text in biology and biotechnology. This article introduces the basic concepts of KDT, provides an overview of some of these efforts in the field of bioscience and biotechnology, and presents a framework of commonly used techniques for evaluating KDT methods, tools and systems.

KEYWORDS:

Notes

¹A confusion matrix (sometimes referred to as contingency table) is used to record and analyze the relationship between two or more variables. Usually, the variables are categorical variables. In the IR case the variables involved are relevancy (‘Doc d_k Relevant?’) and retrieval (‘Doc d_k Retrieved?’), each associated with the value set {Yes, No}. The cells of the matrix record the frequency of the various value co-occurrences. In , we are looking at individual documents, here the frequency for a particular value combination can either be zero or one.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Knowledge Discovery in Biology and Biotechnology Texts: A Review of Techniques, Evaluation Strategies, and Applications

Information for

Open access

Opportunities

Help and information

Knowledge Discovery in Biology and Biotechnology Texts: A Review of Techniques, Evaluation Strategies, and Applications

Abstract

Notes

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature