53
Views
2
CrossRef citations to date
0
Altmetric
Original Articles

Bayesian Induction of Verb Sub-categorization Frames in Imbalanced Heterogeneous Data

Pages 185-211 | Published online: 16 Feb 2007
 

Abstract

The problem of high imbalance in data in the binary classification task of determining whether a syntactic construction (environment) co-occurring with a verb in a natural text corpus consists of a subcategorization frame of the verb or not is the central focus of the present paper. Each environment is encoded as a vector of heterogeneous attributes, where a very high imbalance between positive and negative examples is observed (an imbalance ratio of approximately 1:80). In order to cope with the plethora of negative examples, we propose a search tactic during training that employs Tomek links for eliminating unnecessary negative examples from the training set. As for a classification mechanism, we argue that Bayesian networks are well suited and we propose a novel network structure which efficiently handles heterogeneous attributes without discretization and is more classification-oriented. Comparing the experimental results with those of other known machine learning algorithms, our methodology performs significantly better in detecting instances of the rare positive class.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.