Abstract
The problem of high imbalance in data in the binary classification task of determining whether a syntactic construction (environment) co-occurring with a verb in a natural text corpus consists of a subcategorization frame of the verb or not is the central focus of the present paper. Each environment is encoded as a vector of heterogeneous attributes, where a very high imbalance between positive and negative examples is observed (an imbalance ratio of approximately 1:80). In order to cope with the plethora of negative examples, we propose a search tactic during training that employs Tomek links for eliminating unnecessary negative examples from the training set. As for a classification mechanism, we argue that Bayesian networks are well suited and we propose a novel network structure which efficiently handles heterogeneous attributes without discretization and is more classification-oriented. Comparing the experimental results with those of other known machine learning algorithms, our methodology performs significantly better in detecting instances of the rare positive class.