Search in:

Discourse Processes Latest Articles

Submit an article Journal homepage

Open access

Views

CrossRef citations to date

Altmetric

Listen

Research Article

Transformer models as predication machines

Simon DennisMelbourne School of Psychological Sciences, University of MelbourneCorrespondence[email protected]

Published online: 10 Jul 2024

Cite this article
https://doi.org/10.1080/0163853X.2024.2362038
CrossMark

In this article

ABSTRACT
Introduction
Predication
The Bidirectional Encoder Representations from Transformers (BERT) model and predication
Conclusion
Disclosure statement
Footnotes
References

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

ABSTRACT

Predication is the process by which the meaning of words are altered as a consequence of the contexts in which they appear. Kintsch provides an algorithm to capture this process. The model is based on Latent Semantic Analysis and has the advantage that it relies only on the statistical analysis of word occurrence—and hence provides a coherent and practical account of how representational content might come to be. Similarly, large language models—such as the Bidirectional Encoder Representations from Transformers (BERT) update the initially context-independent representations of words based on their context of use in a data driven way. In a sense, BERT models are a series of predication layers. In this paper, predication, BERT and the relationship between them is explained.

Introduction

My tenure as a research scientist under Walter Kintsch from 2002 to 2004, offered me the privilege of working with him on a daily basis and seeing how he went about the process of science. Walter’s expertise in text comprehension and discourse processes was profound, matched only by his ability to translate complex theories into computational models. His gentle demeanor and mentorship greatly enriched my professional journey, leaving a lasting impact on my approach to research.

Walter would be extremely excited about the developments in large language models that have taken place recently. His interest in Latent Semantic Analysis (LSA; Deerwester et al., Citation1990; Landauer et al., Citation2007) was driven by a conviction that simple models applied to large language data can provide compelling accounts of cognitive mechanisms. Before LSA, models of semantic memory employed representations that were handcrafted. These include the network models of Collins and Quillian (Citation1969) and Collins and Loftus (Citation1975), the feature models of Rips et al. (Citation1973) and Smith et al. (Citation1974) and the spatial models of Osgood (Citation1952, Citation1971)—see Jones et al. (Citation2015) for an overview. While these models allowed a great deal of progress to be made, they were deficient in several related ways. First, they provided no account of how the representations came to be. The concepts, links, features and dimensions were selected by the theorist. There was no account of how an individual might acquire them. Second, the fact that the representations were handcrafted meant that modelers had a great deal of freedom in the choice of the structure relevant to a particular experimental data set. The modelers’ intuitions about these structures often did more of the work in fitting the data than the models did as performance was very sensitive to the representations chosen—rendering falsification difficult (Hummel & Holyoak, Citation2003). Third and finally, these models were difficult to scale. Applying the models to new domains involved a painstaking process of trying to codify all of the implicit knowledge held by a person, which turned out to be an arduous and contentious process, which requires constant revision as language evolves (Kintsch, Citation2001).

While LSA has the advantage of deriving meaning from text input, thus circumventing the handcrafting issues, Walter realized that LSA, in itself, was incomplete because it provides a single representation of a word that is not sensitive to the context of use. As a consequence, he developed the predication algorithm.

Predication

The meaning of the word “runs” in the sentence “the horse runs” is different from the meaning of “runs” in “the paint runs.” In Walter’s computational implementation of predication (Kintsch, Citation2001; see also Kintsch & Mangalath, Citation2011), he outlined a mechanism by which the context independent representations derived by models like LSA could be modified to capture constraints of the immediate environment.

Let us step through the predication algorithm using “the horse runs” as an example:

Initial Vector Representation: In LSA, words and documents are represented as vectors in a high-dimensional semantic space. The predication algorithm starts with an initial context-independent meaning of the predicate word (e.g., “runs”) given by its vector in this space.
Nearest Neighbors Selection: The algorithm identifies the “nearest neighbors” in the semantic space—words that are semantically close to the predicate. These neighbors reflect potential meanings or contexts the predicate could be associated with. For instance, some nearest neighbors of “runs” would be “gallop” and “sprint”.
Relevance Determination: Among the nearest neighbors, the algorithm selects those most relevant to the argument (e.g., “horse”) by constructing a network with links weighted by the semantic closeness (cosine similarity) between each neighbor and the argument. Activation was then spread through this network and relevance was determined by the final activations. In this example, “gallop” would have a high activation and would be selected.
Vector Modification: The initial vector of the predicate is modified by adding a weighted sum of the vectors representing the selected nearest neighbors. This modification emphasizes features of the predicate that are relevant in the given context, effectively creating a new “sense” for the predicate. In the example, we would modify the vector representing “runs” by adding a proportion of the vector representing “gallop”, thus aligning the representation of “runs” more closely with the context provided by “horse”.

As we will see, there are striking similarities between this process and how current large language models such as BERT operate.

The Bidirectional Encoder Representations from Transformers (BERT) model and predication

BERT models, as part of the broader category of large language models (LLMs), have profoundly impacted various areas, notably in discourse analysis and language-based applications. These models have been effectively utilized in a variety of NLP tasks such as sentiment analysis, named entity recognition, and question answering (Devlin et al., Citation2018; Liu et al., Citation2019). Their versatility extends to more specialized applications including summarization, translation, and even health care text analysis, demonstrating the breadth of their utility across fields (Alsentzer et al., Citation2019; Jawahar et al., Citation2019).

As in the predication model, the BERT model assumes that representations should be learned from textual data without recourse to hand crafting. The model incorporates the following components:

Embedding layer

The process begins with the embedding layer, where each wordFootnote¹ from the input sequence is transformed into a high-dimensional vector. Just as in the predication algorithm, these initial embeddings represent the words in their raw form, devoid of contextual influence.

Attention mechanisms

Central to BERT’s effectiveness is its use of self-attention, which is pivotal in defining the inter-word dynamics. Self-attention allows each word to “see” and “interact” with every other word in the sequence, irrespective of their positional distances. For a given word, the model calculates attention scores reflecting its affinity or relevance to every other word, enabling a dynamic weighting of influence. This process is akin to each word querying every other word, with the responses weighted by these calculated attention scores to update the word’s representation. This bidirectional exchange ensures that the representation of a word is not just a byproduct of its immediate neighbors but an amalgamation of the broader context, capturing nuances that span across the entire input sequence. As in predication, the output vector representing a word is a weighted sum of vectors—but the sum is of the surrounding word vectors directly rather than being mediated through the nearest semantic neighbors.

Multilayer structure and word representation evolution

BERT’s architecture is characterized by a series of transformer encoder layers stacked sequentially. Each layer serves to refine and recontextualize the word representations inherited from the layer below. As words progress through the layers, their representations evolve to encapsulate more abstract and complex semantic information, reflecting the words’ roles and meanings within the full context of the sentence or passage. This hierarchical processing ensures that with each layer, word representations are enriched and influenced by the increasingly sophisticated context, allowing the model to capture intricate and deep semantic relationships within the text. This layered evolution of word representations is fundamental to BERT’s ability to comprehend and interpret complex linguistic structures. Predication as outlined in Kintsch and Mangalath (Citation2011) contains a single layer. But it is easy to extend to multiple layers. One would just take the final predicate representation and apply the process again as if it were the initial representation.

Conclusion

Over the past three decades, the shift in the language processing field has been evident as intricate, manually constructed models of syntax and semantics have been largely replaced by data-driven methods. The evolution of models has been driven by several factors: (a) the availability of larger text datasets, (b) advances in computational power, and (c) enhancements in algorithms. Walter was a leader in this push.

And the similarities between the predication algorithm and large language models, and in particular the BERT architecture, are striking. I am not aware of any citations to Walter’s work in the machine learning literature, but it is possible that the concepts percolated indirectly. It is perhaps more likely that the similarities arose through a process of convergent evolution. They were, after all, trying to solve the same task—how does context influence the meaning of a word? The convergence of the solutions might suggest that we are reaching a deeper understanding.

Disclosure statement

No potential conflict of interest was reported by the author.

Notes

1. BERT actually creates representations for tokens which may be words, but can also include part word fragments and punctuation. I have used the term word here to make the exposition more accessible.

References

Alsentzer, E., Murphy, J. R., Boag, W., Weng, W.-H., Jin, D., Naumann, T., & McDermott, M. (2019). Publicly available clinical BERT embeddings. arXiv. https://arxiv.org/abs/1904.03323
Google Scholar
Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological Review, 82(6), 407–428. https://doi.org/10.1037/0033-295X.82.6.407
Web of Science ®Google Scholar
Collins, A. M., & Quillian, M. R. (1969). Retrieval time from semantic memory. Journal of Verbal Learning & Verbal Behavior, 8(2), 240–247. https://doi.org/10.1016/S0022-5371(69)80069-1
Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
Web of Science ®Google Scholar
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Google Scholar
Hummel, J. E., & Holyoak, K. J. (2003). A symbolic-connectionist theory of relational inference and generalization. Psychological Review, 110(2), 220–264. https://doi.org/10.1037/0033-295X.110.2.220
PubMed Web of Science ®Google Scholar
Jawahar, G., Sagot, B., & Seddah, D. (2019). What does BERT learn about the structure of language? ACL. https://www.aclweb.org/anthology/P19-1356
Google Scholar
Jones, M. N., Willits, J., Dennis, S., & Jones, M. (2015). Models of semantic memory. Oxford Handbook of Mathematical and Computational Psychology, 1, 232–254.
Google Scholar
Kintsch, W. (2001). Predication. Cognitive Science, 25(2), 173–202. https://doi.org/10.1207/s15516709cog2502_1
Web of Science ®Google Scholar
Kintsch, W., & Mangalath, P. (2011). The construction of meaning. Topics in Cognitive Science, 3(2), 346–370. https://doi.org/10.1111/j.1756-8765.2010.01107.x
PubMed Web of Science ®Google Scholar
Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.). (2007). Handbook of latent semantic analysis. Erlbaum.
Google Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv. https://arxiv.org/abs/1907.11692
Google Scholar
Osgood, C. E. (1952). The nature and measurement of meaning. Psychological Review, 49, 197–237.
Google Scholar
Osgood, C. E. (1971). Exploration in semantic space: A personal diary. Journal of Social Issues, 27(4), 5–62. https://doi.org/10.1111/j.1540-4560.1971.tb00678.x
Web of Science ®Google Scholar
Rips, L. J., Shoben, E. J., & Smith, E. E. (1973). Semantic distance and the verification of semantic relations. Journal of Verbal Learning & Verbal Behavior, 12(1), 1–20. https://doi.org/10.1016/S0022-5371(73)80056-8
Google Scholar
Smith, E. E., Shoben, E. J., & Rips, L. J. (1974). Structure and process in semantic memory: A featural model for semantic decisions. Psychological Review, 81(3), 214–241. https://doi.org/10.1037/h0036351
Web of Science ®Google Scholar
Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to fine-tune BERT for text classification? arXiv preprint arXiv:1905.05583. https://arxiv.org/abs/1905.05583
Google Scholar

Download PDF

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Transformer models as predication machines

ABSTRACT

Introduction

Predication

The Bidirectional Encoder Representations from Transformers (BERT) model and predication

Embedding layer

Attention mechanisms

Multilayer structure and word representation evolution

Conclusion

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

Transformer models as predication machines

ABSTRACT

Introduction

Predication

The Bidirectional Encoder Representations from Transformers (BERT) model and predication

Embedding layer

Attention mechanisms

Multilayer structure and word representation evolution

Conclusion

Disclosure statement

Notes

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date