435
Views
1
CrossRef citations to date
0
Altmetric
Research Article

A Word Embedding Model for Analyzing Patterns and Their Distributional Semantics

, &
Pages 80-105 | Published online: 07 Jun 2020
 

ABSTRACT

Recent advances in natural language processing have catalysed active research in designing algorithms to generate contextual vector representations of words, or word embedding, in the machine learning and computational linguistics community. Existing works pay little attention to patterns of words, which encode rich semantic information and impose semantic constraints on a word’s context. This paper explores the feasibility of incorporating word embedding with pattern grammar, a grammar model to describe the syntactic environment of lexical items. Specifically, this research develops a method to extract patterns with semantic information of word embedding and investigates the statistical regularities and distributional semantics of the extracted patterns. The major results of this paper are as follows. Experiments on the LCMC Chinese corpus reveal that the frequency of patterns follows Zipf’s hypothesis, and the frequency and pattern length are inversely related. Therefore, the proposed method enables the study of distributional properties of patterns in large-scale corpora. Furthermore, experiments illustrate that our extracted patterns impose semantic constraints on context, proving that patterns encode rich semantic and contextual information. This sheds light on the potential applications of pattern-based word embedding in a wide range of natural language processing tasks.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1. Biber (Citation2009, p. 279) pointed out, ‘However, the pattern grammar studies are corpus-based because the analyses are in part determined by pre-defined linguistic categories (including basic grammatical categories like “noun” and “verb”, phrase types, and even syntactic structures).’

2. This paper frequently deals with POS of Chinese texts. This paper adheres to the notations of ICTCLAS POS tagging proposed by the Chinese Academy of Science. The notations can be found at http://ictclas.nlpir.org/nlpir/html/readme.htm.

3. The statistics were retrieved by counting all item tags with words in the corpus, excluding punctuation, spaces, etc.

Additional information

Funding

Project supported by the National Social Science Fund of China (No. 17BYY002).

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 394.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.