49
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Transformer models as predication machines

ABSTRACT

Predication is the process by which the meaning of words are altered as a consequence of the contexts in which they appear. Kintsch provides an algorithm to capture this process. The model is based on Latent Semantic Analysis and has the advantage that it relies only on the statistical analysis of word occurrence—and hence provides a coherent and practical account of how representational content might come to be. Similarly, large language models—such as the Bidirectional Encoder Representations from Transformers (BERT) update the initially context-independent representations of words based on their context of use in a data driven way. In a sense, BERT models are a series of predication layers. In this paper, predication, BERT and the relationship between them is explained.

Introduction

My tenure as a research scientist under Walter Kintsch from 2002 to 2004, offered me the privilege of working with him on a daily basis and seeing how he went about the process of science. Walter’s expertise in text comprehension and discourse processes was profound, matched only by his ability to translate complex theories into computational models. His gentle demeanor and mentorship greatly enriched my professional journey, leaving a lasting impact on my approach to research.

Walter would be extremely excited about the developments in large language models that have taken place recently. His interest in Latent Semantic Analysis (LSA; Deerwester et al., Citation1990; Landauer et al., Citation2007) was driven by a conviction that simple models applied to large language data can provide compelling accounts of cognitive mechanisms. Before LSA, models of semantic memory employed representations that were handcrafted. These include the network models of Collins and Quillian (Citation1969) and Collins and Loftus (Citation1975), the feature models of Rips et al. (Citation1973) and Smith et al. (Citation1974) and the spatial models of Osgood (Citation1952, Citation1971)—see Jones et al. (Citation2015) for an overview. While these models allowed a great deal of progress to be made, they were deficient in several related ways. First, they provided no account of how the representations came to be. The concepts, links, features and dimensions were selected by the theorist. There was no account of how an individual might acquire them. Second, the fact that the representations were handcrafted meant that modelers had a great deal of freedom in the choice of the structure relevant to a particular experimental data set. The modelers’ intuitions about these structures often did more of the work in fitting the data than the models did as performance was very sensitive to the representations chosen—rendering falsification difficult (Hummel & Holyoak, Citation2003). Third and finally, these models were difficult to scale. Applying the models to new domains involved a painstaking process of trying to codify all of the implicit knowledge held by a person, which turned out to be an arduous and contentious process, which requires constant revision as language evolves (Kintsch, Citation2001).

While LSA has the advantage of deriving meaning from text input, thus circumventing the handcrafting issues, Walter realized that LSA, in itself, was incomplete because it provides a single representation of a word that is not sensitive to the context of use. As a consequence, he developed the predication algorithm.

Predication

The meaning of the word “runs” in the sentence “the horse runs” is different from the meaning of “runs” in “the paint runs.” In Walter’s computational implementation of predication (Kintsch, Citation2001; see also Kintsch & Mangalath, Citation2011), he outlined a mechanism by which the context independent representations derived by models like LSA could be modified to capture constraints of the immediate environment.

Let us step through the predication algorithm using “the horse runs” as an example:

  1. Initial Vector Representation: In LSA, words and documents are represented as vectors in a high-dimensional semantic space. The predication algorithm starts with an initial context-independent meaning of the predicate word (e.g., “runs”) given by its vector in this space.

  2. Nearest Neighbors Selection: The algorithm identifies the “nearest neighbors” in the semantic space—words that are semantically close to the predicate. These neighbors reflect potential meanings or contexts the predicate could be associated with. For instance, some nearest neighbors of “runs” would be “gallop” and “sprint”.

  3. Relevance Determination: Among the nearest neighbors, the algorithm selects those most relevant to the argument (e.g., “horse”) by constructing a network with links weighted by the semantic closeness (cosine similarity) between each neighbor and the argument. Activation was then spread through this network and relevance was determined by the final activations. In this example, “gallop” would have a high activation and would be selected.

  4. Vector Modification: The initial vector of the predicate is modified by adding a weighted sum of the vectors representing the selected nearest neighbors. This modification emphasizes features of the predicate that are relevant in the given context, effectively creating a new “sense” for the predicate. In the example, we would modify the vector representing “runs” by adding a proportion of the vector representing “gallop”, thus aligning the representation of “runs” more closely with the context provided by “horse”.

As we will see, there are striking similarities between this process and how current large language models such as BERT operate.

The Bidirectional Encoder Representations from Transformers (BERT) model and predication

BERT models, as part of the broader category of large language models (LLMs), have profoundly impacted various areas, notably in discourse analysis and language-based applications. These models have been effectively utilized in a variety of NLP tasks such as sentiment analysis, named entity recognition, and question answering (Devlin et al., Citation2018; Liu et al., Citation2019). Their versatility extends to more specialized applications including summarization, translation, and even health care text analysis, demonstrating the breadth of their utility across fields (Alsentzer et al., Citation2019; Jawahar et al., Citation2019).

As in the predication model, the BERT model assumes that representations should be learned from textual data without recourse to hand crafting. The model incorporates the following components:

Embedding layer

The process begins with the embedding layer, where each wordFootnote1 from the input sequence is transformed into a high-dimensional vector. Just as in the predication algorithm, these initial embeddings represent the words in their raw form, devoid of contextual influence.

Attention mechanisms

Central to BERT’s effectiveness is its use of self-attention, which is pivotal in defining the inter-word dynamics. Self-attention allows each word to “see” and “interact” with every other word in the sequence, irrespective of their positional distances. For a given word, the model calculates attention scores reflecting its affinity or relevance to every other word, enabling a dynamic weighting of influence. This process is akin to each word querying every other word, with the responses weighted by these calculated attention scores to update the word’s representation. This bidirectional exchange ensures that the representation of a word is not just a byproduct of its immediate neighbors but an amalgamation of the broader context, capturing nuances that span across the entire input sequence. As in predication, the output vector representing a word is a weighted sum of vectors—but the sum is of the surrounding word vectors directly rather than being mediated through the nearest semantic neighbors.

Multilayer structure and word representation evolution

BERT’s architecture is characterized by a series of transformer encoder layers stacked sequentially. Each layer serves to refine and recontextualize the word representations inherited from the layer below. As words progress through the layers, their representations evolve to encapsulate more abstract and complex semantic information, reflecting the words’ roles and meanings within the full context of the sentence or passage. This hierarchical processing ensures that with each layer, word representations are enriched and influenced by the increasingly sophisticated context, allowing the model to capture intricate and deep semantic relationships within the text. This layered evolution of word representations is fundamental to BERT’s ability to comprehend and interpret complex linguistic structures. Predication as outlined in Kintsch and Mangalath (Citation2011) contains a single layer. But it is easy to extend to multiple layers. One would just take the final predicate representation and apply the process again as if it were the initial representation.

Conclusion

Over the past three decades, the shift in the language processing field has been evident as intricate, manually constructed models of syntax and semantics have been largely replaced by data-driven methods. The evolution of models has been driven by several factors: (a) the availability of larger text datasets, (b) advances in computational power, and (c) enhancements in algorithms. Walter was a leader in this push.

And the similarities between the predication algorithm and large language models, and in particular the BERT architecture, are striking. I am not aware of any citations to Walter’s work in the machine learning literature, but it is possible that the concepts percolated indirectly. It is perhaps more likely that the similarities arose through a process of convergent evolution. They were, after all, trying to solve the same task—how does context influence the meaning of a word? The convergence of the solutions might suggest that we are reaching a deeper understanding.

Disclosure statement

No potential conflict of interest was reported by the author.

Notes

1. BERT actually creates representations for tokens which may be words, but can also include part word fragments and punctuation. I have used the term word here to make the exposition more accessible.

References

  • Alsentzer, E., Murphy, J. R., Boag, W., Weng, W.-H., Jin, D., Naumann, T., & McDermott, M. (2019). Publicly available clinical BERT embeddings. arXiv. https://arxiv.org/abs/1904.03323
  • Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological Review, 82(6), 407–428. https://doi.org/10.1037/0033-295X.82.6.407
  • Collins, A. M., & Quillian, M. R. (1969). Retrieval time from semantic memory. Journal of Verbal Learning & Verbal Behavior, 8(2), 240–247. https://doi.org/10.1016/S0022-5371(69)80069-1
  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Hummel, J. E., & Holyoak, K. J. (2003). A symbolic-connectionist theory of relational inference and generalization. Psychological Review, 110(2), 220–264. https://doi.org/10.1037/0033-295X.110.2.220
  • Jawahar, G., Sagot, B., & Seddah, D. (2019). What does BERT learn about the structure of language? ACL. https://www.aclweb.org/anthology/P19-1356
  • Jones, M. N., Willits, J., Dennis, S., & Jones, M. (2015). Models of semantic memory. Oxford Handbook of Mathematical and Computational Psychology, 1, 232–254.
  • Kintsch, W. (2001). Predication. Cognitive Science, 25(2), 173–202. https://doi.org/10.1207/s15516709cog2502_1
  • Kintsch, W., & Mangalath, P. (2011). The construction of meaning. Topics in Cognitive Science, 3(2), 346–370. https://doi.org/10.1111/j.1756-8765.2010.01107.x
  • Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.). (2007). Handbook of latent semantic analysis. Erlbaum.
  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv. https://arxiv.org/abs/1907.11692
  • Osgood, C. E. (1952). The nature and measurement of meaning. Psychological Review, 49, 197–237.
  • Osgood, C. E. (1971). Exploration in semantic space: A personal diary. Journal of Social Issues, 27(4), 5–62. https://doi.org/10.1111/j.1540-4560.1971.tb00678.x
  • Rips, L. J., Shoben, E. J., & Smith, E. E. (1973). Semantic distance and the verification of semantic relations. Journal of Verbal Learning & Verbal Behavior, 12(1), 1–20. https://doi.org/10.1016/S0022-5371(73)80056-8
  • Smith, E. E., Shoben, E. J., & Rips, L. J. (1974). Structure and process in semantic memory: A featural model for semantic decisions. Psychological Review, 81(3), 214–241. https://doi.org/10.1037/h0036351
  • Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to fine-tune BERT for text classification? arXiv preprint arXiv:1905.05583. https://arxiv.org/abs/1905.05583