192
Views
26
CrossRef citations to date
0
Altmetric
Research Article

Knowledge Discovery in Biology and Biotechnology Texts: A Review of Techniques, Evaluation Strategies, and Applications

, , &
Pages 31-52 | Published online: 10 Oct 2008

REFERENCES

  • Ram A., Moorman K. Understanding Language Understanding. MIT Press, Cambridge, MA 1999
  • Baeza-Yates R., Ribeiro-Neto B. Modern Information Retrieval. Addison-Wesley, HarlowUK 1999
  • http://www.ncbi.nlm.nih.gov., National Library of Medicine's Medline bibliographic database at
  • http://www.ncbi.nlm. nih.gov/entrez/query.fcgi., National Library of Medicine's PubMed at
  • Bremer E. G., Natarajan J., Zhang Y., DeSesa C., Hack C. J., Dubitzky W. Text mining of full text articles and creation of a knowledge base for analysis of microarray data. Proc. Intl. Symp. Knowledge Exploration in Life Sciences Informatics. MilanItaly 2004; 84–95
  • Shah P. K., Perez-Iratxeta C., Bork P., Andrade M. A. Information extraction from full text scientific articles: Where are the keywords?. BMC Bioinformatics 2003; 4: 20, [PUBMED], [INFOTRIEVE]
  • http://www.pubmedcentral.gov, PubMedCentral
  • http://www.biomedcentral.com, BioMedCentral(BMC) at
  • http://www.publiclibraryofscience.org/, Public Library of Science (PloS) at
  • Hakenberg J., Schmeier S., Kowald A., Klipp E., Leser U. Finding Kinetic Parameters Using Text Mining in Dubitzky W. (guest ed.), Special Issue on Data Mining meets Integrative Biology. OMICS: A Journal of Integrative Biology 2004; 8(2)131–152, [CROSSREF]
  • Cowie J., Lehnert W. Information extraction. Communications of the ACM 1996; 39: 80–91, [CSA], [CROSSREF]
  • Hearst M. A. Untangling text data mining. Proc. of ACL 1999; 37, [CSA]
  • Salton G., Buckley C. Term weighting approaches in automatic information retrieval. Inf. Proc. Man. 1988; 24: 513–523, [CSA]
  • Wilbur W. J., Yang Y. An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts. Comput. Biol. Med 1996; Vol. 26: 209, [PUBMED], [INFOTRIEVE], [CSA], [CROSSREF]
  • Wilbur W. J. A thematic analysis of the AIDS literature. Pacific Symposium on Biocomputing. 2002, 386–397
  • Perez-Iratxeta C., Bork P., Andrade M. A. XplorMed: A tool for exploring MEDLINE abstracts. Trends Biochem. Sci. 2001; 26: 573–575, [PUBMED], [INFOTRIEVE], [CSA], [CROSSREF]
  • http://www.ncbi.nlm. nih.gov/entrez/query.fcgi?db=mesh, Medical Subject Heading (MeSH) at
  • Deerwester S., Dumais S. T., Furnas G. W., Landauer K., Harshman R. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science. 1990; 41(6)391–407, [CSA], [CROSSREF]
  • Golub G. H., Van Loan C. F. Matrix Computations. Johns Hopkins University Press. 1993
  • Message Understanding Conference. Proc. of the Seventh Message Understanding Conference (MUC-7). Morgan Kaufmann, 1998
  • Jackson P., Moulinier I. Natural Language Processing for Online Applications: Text Retrieval, Extraction, Categorization. John Benjamins Pub. Co. 2002
  • Jurafsky D., Martin J. H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, Speech Recognition. Prentice Hall, New Jersey 2000
  • Manning C. D., Schuetze H. Foundations of Statistical Natural Language Processing. MIT Press. 1999
  • Allen J. Natural Language Understanding. The Benjamin/Cummings Publishing Company, Inc. New York 1995
  • Cohen K. B., Hunter L. Natural language processing and systems biology. Artificial Intelligence Methods and Tools for Systems Biology, W. Dubitzky, F. Azuaje. Kluwer Academic Publishers, Boston/Dordrecht/London 2004; 147–174
  • Fukuda K., Tsunoda T., Tamura A., Takagi T. Towards information extraction: identifying protein names from biological papers. Pacific Symposium on Biocomputing. 1998, 707–718
  • Eriksson G., Franzen K., Olsson F. Exploiting syntax when detecting protein names in text. Workshop on Natural Language Processing in Biomedical Applications. 2002, http://www.sics.se/humle/projects/prothalt/
  • Narayanaswamy M., Ravikumar K. E., Vijay-Shankar K. A biological named entity recognizer. Pacific Symposium on Biocomputing 2003; 8: 427–438, [CSA]
  • Ono T., Hishigaki H., Tanigami A., Takagi T. Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 2001; 17: 155–161, [PUBMED], [INFOTRIEVE], [CSA], [CROSSREF]
  • Krauthammer M., Rzhetsky A., Morozov P., Friedman C. Using blast for identifying gene and protein names in journal articles. Gene. 2000; 245–152, [CROSSREF]
  • Hanisch D., Fluck J., Mevissen D. T., Zimmer R. Playing biology's name game: Identifying protein names in scientific text. Pacific Symposium on Biocomputing 2003; 8: 403–414
  • Hatzivassiloglou V., Duboue P. A., Rzhetsky A. Disambiguating proteins, genes, and RNA in text: A machine learning approach. Proc. of the 9th International Conference on Intelligent Systems for Molecular Biology. 2001, 97–106
  • Wilbur W., Hazard G. F., Jr., Divita G., Mork J. G., Aronson A. R., Browne A. C. Analysis of biomedical text for biochemical names: A comparison of three methods. Proc. of AMIA Symposium. 1999, 176–180
  • Collier N., Nobata C., Tsujii J. Extraction of name of genes and gene products with a Hidden Markov Model. COLING Conference Proceedings. 2000, 201–207
  • Kazama J., Makino T., Ohta Y., Tsujii J. Tuning support vector machines for biomedical named entity recognition. Proc. of the Natural Language Processing in the Biomedical Domain. Philadelphia, PAUSA 2002
  • Zhou G., Zhang J., Su J., Shen D., Tan C. Recognizing names in biomedical texts: A machine learning approach. Bioinformatics 2004; 20: 1178–1190, [PUBMED], [INFOTRIEVE], [CROSSREF]
  • McDonald R. T., Scot Winters R., Mandel M., Jin Y., White P. S., Pereira F. An entity tagger for recognizing acquired genomic variations in cancer literature. Bioinformatics 2004; 20: 3249–3251, [PUBMED], [INFOTRIEVE], [CROSSREF]
  • http://www.gene.ucl.ac.uk/hugo/, HUGO, The Human genome organization at
  • http://www-tsujii.is.s.u-tokyo.ac.jp/∼genia/topics/Corpus/, GENIA, corpus at
  • http://www.mitre.org/public/biocreative/, BioCreative Task1A, corpus at
  • , Yapex, corpus at
  • http://www.pdg.cnb.uam.es/BioLINK/workshop_BioCreative_04, BioCreative, Shared task at
  • Ng S-K., Wong M. Towards routine automatic pathway discovery from on-line scientific text abstracts. Proc. of the workshop on Genome Informatics 1999; 10: 104–112
  • Wong L. A protein interaction extraction system. Pacific Symposium on Biocomputing 2001; 6: 520–531
  • Park J. C., Kim H. S., Kim J. J. Bi-directional incremental parsing for automatic pathway identification with combinatory categorical grammar. Pacific Symposium on Biocomputing 2001; 6: 396–407
  • Yakushiji A., Tateisi Y., Miyao Y., Tsujii J. Event extraction from biomedical papers using a full parser. Pacific Symposium on Biocomputing 2001; 6: 408–419
  • Pustejovsky J., Castano J., Zhang J., Kotecki M., Cochran B. Robust relational parsing over biomedical literature: Extracting inhibits relations. Pacific Symposium on Biocomputing 2002; 7: 362–373
  • Leroy G., Chen H. Filling preposition-based templates to capture information from medical abstracts. Pacific Symposium on Biocomputing 2002; 7: 350–361
  • Thomas J., Milward D., Ouzounis C. Automatic extraction of protein interactions from scientific abstracts. Pacific Symposium on Biocomputing 2000; 5: 384–395
  • Yakushiji A., Tateisi Y., Miyao Y., Tsujii J. Event extraction from biomedical papers using a full parser. Pacific Symposium on Biocomputing 2001; 6: 408–419
  • Friedman C., Kra P., Yu H., Krauthammer M., Rzhetsky A. GENIES: A natural language processing system for extraction of molecular pathways from journal articles. Bioinformatics Suppl. 2001; 1: 74–82
  • Huang M., Zhu X., Hao Y., Payan D. G., Qu K., Li M. Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics 2004; 20(18)3604–3612, [PUBMED], [INFOTRIEVE]
  • Sekimizu T., Park H. S., Tsujii J. Identifying the interaction between genes and gene products based on frequently seen verbs in MEDLINE abstracts. Proc. of the workshop on Genome Informatics. 1998, 62–71
  • Humphreys K., Demertriou G., Geizauskas R. Two applications of information extraction to biological science journal articles: Enzyme interactions and protein structure. Pacific Symposium on Biocomputing 2000; 5: 502–513
  • Ding J., Berleant D., Nettleton D., Wurtele E. Mining MEDLINE: Abstracts, sentences or phrases?. Pacific Symposium on Biocomputing 2002; 7: 326–337
  • Berleant D., Ding J., Fulmer A. W., Corpus properties of protein interaction descriptions in Medline at http://class. ee.iastate.edu/berleant/
  • Rosario B., Hearst M. A., Classifying semantic relations in bioscience texts, available at http://biotext.berkeley.edu/
  • Gaizauskas R., Demetriou G., Artymiuk P. J, Willett P. Protein structure and information extraction from biological texts: The PASTA system. Bioinformatics 2003; 19(1)135–143, [PUBMED], [INFOTRIEVE], [CSA], [CROSSREF]
  • Brusic V., Zeleznikow J. Knowledge discovery and data mining in biological databases. The Knowledge Engineering Review 1999; 14(3)257–277, [CROSSREF]
  • Shearer C. The CRISP-DM Model: The new blueprint for data mining. Journal of Data Warehousing 2000; 5(4)13–22, Fall 2000
  • Sebastiani F. Machine learning in automated text categorization. ACM Computer Surveys 2002 2002; 34(1)1–47, [CROSSREF]
  • Craven M., Kumlien J. Constructing biological knowledge base by extracting information from text sources. Proc. of the 7th International Conference on Intelligent Systems for Molecular Biology. 1999; 77–76
  • Stapley B. J., Kelley L. A., Strenberg M. J. E. Predicting the sub-cellular location of proteins from text using support vector machines. Pacific Symposium on Biocomputing 2002; 7: 374–385
  • Raychaudhuri S., Chang J. T., Sutphin P. D., Altman R. B. Associating genes with gene ontology codes using maximum entropy analysis of biomedical literature. Genome Research 2002; 12: 203–214, [PUBMED], [INFOTRIEVE], [CSA], [CROSSREF]
  • KDD C up. Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, http://www.biostat.wisc.edu/∼craven/kddcup/index.html
  • Willett P. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management 1988; 24(5)577–597, [CSA], [CROSSREF]
  • Jain A. K., Murty M. N., Flynn F. A. Data clustering: A review. ACM Computing Surveys 1998; 31: 264–323, [CROSSREF]
  • Zhao Y., Karypis G. Criterion functions for document clustering. University of Minnesota, Minnestoa 2000, TR# 01-40
  • Vaithyanathan S., Dom B. Model selection in unsupervised learning with applications to document clustering. ICML-99 1999
  • Steinbach M., Karypis G., Vipin K. A comparison of document clustering techniques. Text Mining Workshop, KDD-2000. 2000
  • Iliopoulos I., Enright A. J., Ouzounis C. TEXTQUEST: Document clustering of MEDLINE abstracts for concept discovery in molecular biology. Pacific Symposium on Biocomputing 2001; 374–383
  • Andrade M. A., Valencia A. Automatic extraction of keywords from scientific text: Application to the knowledge domain of protein families. Bioinformatics 1998; 14: 600–607, [PUBMED], [INFOTRIEVE], [CSA], [CROSSREF]
  • Swanson D. R. Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine 1986; 30: 7–18, [PUBMED], [INFOTRIEVE]
  • Swanson D. R., Smalheiser N. R. An interactive system for finding complementary literatures: A stimulates to scientific discovery. Artificial Intelligence 1997; 91(1)183–203, [CROSSREF]
  • Tanabe L., Scherf U., Smith L. H. MedMiner: An Internet text-mining tool for biomedical information with application to gene expression profiling. Biotechniques 1999; 27: 1210–1217, [PUBMED], [INFOTRIEVE], [CSA]
  • Jenssen T. K., Laegreid A., Komorowski J., Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics 2001; 28: 21–28, [PUBMED], [INFOTRIEVE], [CROSSREF]
  • Stapley B. J., Benoit G. Biobibliometrics: Information retrieval and visualization from co-occurrences of gene names in medical abstracts. Pacific Symposium on Biocomputing 2000; 5: 529–540
  • Rzhetsky A., Iossifov I., Koike T., Krauthammer M., Kra P., Morris M., Yu H., Duboue P. A., Weng W., Wilbur W. J., Hatzivassiloglou V., Friedman C. GeneWays: A system for extracting, analyzing, visualizing, and integrating molecular pathway data. Jr of Biomedical Informatics 2004; 37: 43–53, [CROSSREF]
  • Hahn U., Romacker M., Schulz S. Creating knowledge repositories from biomedical reports: The MEDSYNDIKATE text mining system. Pacific Symposium on Biocomputing 2002; 7: 338–349
  • malheiser N. R., Swanson D. R. Using ARROWSMITH: A computer assisted approach to formulating and accessing scientific hypotheses. Computer Methods and Programs in Biomedicine 1998; 57: 149–153, [CSA], [CROSSREF]
  • http://bioinfo.weizmann.ac.il/cards, GeneCards, Online database at
  • Wolpert D., Macready W. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1997; 1(1)67–82, [CSA], [CROSSREF]
  • Dietterich T. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 1998; 10(7)1895–1924, [PUBMED], [INFOTRIEVE], [CROSSREF]
  • Schena M., Shalon D., Davis R. W., Brown P. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995; 270: 467–470, [PUBMED], [INFOTRIEVE]
  • DeRisi J., Iyer V., Brown P. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997; 278: 680–686, [PUBMED], [INFOTRIEVE], [CROSSREF]
  • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM, OMIM, Online Mendelian Inheritance in Man at
  • Blaschke C., Valencia A. Automatic Ontology Construction from the Literature. Genome Informatics Series 2002; 13: 201–213, [PUBMED], [INFOTRIEVE]
  • Ideker T., Galitski T., Hood L. A new approach to decoding life: Systems biology. Annu Rev Genomics Hum Genet 2001; 2: 343–372, [PUBMED], [INFOTRIEVE], [CSA]
  • Sabatti C. Statistical issues in Microarray Analysis. Current Genomics 2003
  • A Practical Approach to Microarray Data Analysis, D. Berrar, W. Dubitzky, M. Granzow. Kluwer Academic Publishers, Boston/Dordrecht/London 2002
  • Raychaudhuri S., Schueltz H., Altman R. B. Using text analysis to identify functionally coherent gene groups. Genome Research 2002; 12: 1582–1590, [PUBMED], [INFOTRIEVE], [CSA], [CROSSREF]
  • Relational Data Mining, S. Dzeroski, N. Lavrac. Springer, Berlin 2001
  • Feldman R. Link analysis: current state of the art. 2002, tutorial at theKDD-02
  • Humphreys B. L., Lindberg D. A., Schoolman H. M., Barnett G. O. The unified medical language system: An information research collaboration. J. Amer. Med. Inform. Assoc. 1998; 5: 1–11, [CSA]
  • , Gene Ontology
  • Srinivasan P., Rindflesch T. Exploring Text Mining from Medline. Proc. Annual Conference of the American Medical Informatics Association (AMIA 2002). 2002; 722–726
  • Srinivasan P. MeSHmap: A text mining tool for Medline. Proc. of the Annual Conference of the American Medical Informatics Association (AMIA 2001). 2001

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.