2,001

Views

CrossRef citations to date

Altmetric

Listen

Editorial

Truth and Regret: Large Language Models, the Quran, and Misinformation

Ali-Reza Bhojania University of Birmingham, Birmingham, UKCorrespondence[email protected]

https://orcid.org/0000-0002-3728-5260

Marcus Schwartingb University of Chicago, Chicago, IL, USACorrespondence[email protected]

https://orcid.org/0000-0001-6817-7265

The mass adoption of large language models (LLMs), spearheaded by OpenAI’s publicly available ChatGPT-3, continues to move at a dramatic pace. The ethical concerns arising from the vast array of opportunities offered by this technology are, of course, numerous. Amongst the most important of these are issues surrounding the scale, speed, and pervasiveness of misinformation infiltrating our knowledge economy. To minimise the occurrence of so-called LLM “confabulations,” and to ensure certain topics are entirely avoided by LLMs, many approaches have been proposed. These approaches include devoting greater attention to training and fine-tuning, often incorporating human feedback, as well as strategies for implementing additional guardrails. Researchers and engineers continue to improve upon such strategies to augment the performance of LLMs. Nevertheless, the tendency of LLMs to falsely present information as truth is an inherent and unavoidable feature of the technology itself. It is crucial for both developers and users of these technologies to account for this fact. Meanwhile ethicists, religious or not, must carefully consider the potential societal impact that comes with an increasing reliance on such tools.

Religious traditions are deeply and fundamentally concerned with truth. This concern for truth obviously includes, but is not limited to, the truthful and accurate dissemination of a religion’s own teachings. It follows that the use of LLMs for seeking religious knowledge should be especially sensitive. In the case of the Quran, considered by Muslims to be the word of God, the importance of the literary integrity of the text is critical for Islamic theology and wider Muslim religiosity. For classical Muslim theologians the miracle of the Quran was widely seen in its literary inimitability.Footnote¹ Muslims typically hold that the Quran, as found in the mass circulation of printed copies or as preserved within the memories of Muslim, is free from revision and alteration (tahrif).Footnote² LLMs presenting Quranic-like (sic) material, and inaccurate Quranic citations, as truthful references to the Quran should thus be deeply problematic for Muslims. That such errors arise as a result of the very nature of LLMs means that our case has wider significance, highlighting the potential risk to truth in adopting an uncritical use of LLMs per se. A concern for truth is one that should not be limited to the religious instruction of any particular tradition. Truth is, or at least should be, a shared public concern. Here we report and comment on a small sample of regenerated queries to ChatGPT-3.5 and Google Bard regarding the Quran’s perspective on misinformation. We argue that the significant and consistent errors that emerge are related to the stochastic and probabilistic nature of LLMs. Quran 49:6 calls believers to adopt a critical approach to any information relayed by an immoral agent (fasiq), “O you who believe ! If an immoral agent brings you some news, verify it … .” Although LLMs appear best described as tools, and thus not moral agents, we argue that LLM’s are not trustworthy and thus should not be employed without a critical disposition and due attention to truth.

Systematically studying the responses of LLMs can be a difficult undertaking, especially in the absence of reliable and objective quantitative assessments. Nearly all deployed LLMs operate stochastically; that is, repeated identical queries will yield nonidentical responses sampled from an underlying probability distribution. Studying an inherently stochastic system is like wading through a Heraclitean quagmire: since a single response from an LLM is never exactly reproducible, any article which draws conclusions about the nature of LLMs from a single correspondence is, at best, incomplete. Statisticians can draw conclusions from a variety of stochastic systems through sampling processes. By studying a set of regenerated responses from an LLM for a specific query, overall trends (relevant only to the LLM and the query) can be reproducibly observed and measured. We regenerated queries to the question “What does the Quran have to say about misinformation?” five times in Google Bard and five times in ChatGPT-3.5. We further sampled five responses from Google Bard with the additional prompt “please cite verses in Arabic” and five responses to the original query on ChatGPT-3.5 directly in Arabic.Footnote³ In this repeated small sampling of LLMs we found consistent trends of misbehaviour. We can assess this erroneous behaviour of LLMs both qualitatively and quantitatively.

shows a bar chart of the Quranic verses cited by LLMs across various responses. These responses typically began with a general statement about the Quran’s perspective on misinformation, consistently recognising that the Quran does not directly mention the contemporary term “misinformation,” but that it does include teachings relevant to misinformation. Opening statements were then followed by two to five attempted references to the Quran regarding those teachings. As can be seen in the most commonly cited verse was the aforementioned 49:6. The second most cited verse was 2:42, “Do not mix the truth with falsehood, nor conceal the truth while you know.”

Figure 1. Frequency of LLM attempted references to different verses of the Quran in response to the query “What does the Quran have to say about misinformation?” Note that when ChatGPT-Arabic cites the Quran 68:11, it also includes the subsequent verse (68:12) in its return. This was not the case for any other LLM return.

Only two out of the twenty responses we gathered (one from ChatGPT-3.5 and one from ChatGPT-3.5 in Arabic) had no mistakes in attempted Quranic references. The errors were of different sorts including: the misordering of words in a Quranic quotation; providing incorrect citations for a Quranic quotation; producing correctly quoted and cited, yet irrelevant Quranic material; and producing quotations that appear to be an amalgamation of Quranic phrases or extra Quranic material entirely misattributed to the text of the Quran itself. The latter category of error, sheer confabulations, appeared in both the English and the Arabic responses. For the purpose of Quranic references, none of these errors are trivial.Footnote⁴

For example, consider the misordering of a single word that occurred in three attempted references by ChatGPT-3.5 Arabic to 17:81. The accurate, accepted, Arabic text of 17:81 reads as follows:

وَقُلْ جَاءَ الْحَقُّ وَزَهَقَ الْبَاطِلُ ۚ إِنَّ الْبَاطِلَ كَانَ زَهُوقًا

This translates as: “Say: ‘The truth has come, and falsehood has vanished. Indeed falsehood is bound to vanish.’” In all three LLM attempts to reference 17:81 the third Arabic word in the phrase, al-haq, was erroneously placed prior to the verb ja’a and presented as follows:

وَقُلِ الْحَقُّ جَاءَ وَزَهَقَ الْبَاطِلُ ۚ إِنَّ الْبَاطِلَ كَانَ زَهُوقًا

An Arabic reader would likely spot this as an odd formulation, and many other non-Arabic readers would recognise the error due to previous familiarity with the verse itself. Nevertheless it constitutes an unacceptable alteration in the text. Furthermore, the relevance to misinformation is also not obvious with the verse widely understood by Quranic exegetes to be referring to the Quran itself.

Beyond such misordering of words and the near pervasive occurrence of miscitations of Quranic quotations, we found outright confabulations in each of chatGPT-3.5 Arabic, Google Bard, and Google Bard when prompted to cite Arabic verses. For example chatGPT-3.5 Arabic attributes the following text to 68:11-12:

إِنَّ الَّذِينَ يُلْقُونَ الْإِيمَانَ وَرَدُّوا بِأَدْبَارِهِمْ فَلَا يَؤْذَنَّ بِهِمْ ۘ وَهُمْ لَهَا صَادِرُونَ إِلَّا الَّذِينَ تَابُوا مِنْ بَعْدِهَا وَأَصْلَحُوا فَإِنَّ اللَّهَ غَفُورٌ رَحِيمٌ

Loosely this might translate as follows, “Surely those whom have been presented with belief, and turn their back, will not be given permission whilst they are brought forth except those who repent thereafter and reform, surely Allah is forgiving and merciful.” Quran 68:11-12, however, states nothing of the sort nor is the generated text a verse that can be found elsewhere in the Quran. 68:11-12 actually reads as follows;

هَمَّازٍ مَّشَّاءٍ بِنَمِيمٍ - مَّنَّاعٍ لِّلْخَيْرِ مُعْتَدٍ أَثِيمٍ

The two verses, which are more meaningful when read from at least 68:10, translate as “scandal-mongerer, talebearer, hinderer of all- good, sinful transgressor.” The fuller passage from 68:10 translates as “And do not obey any vile swearer, scandal-mongerer, talebearer, hinderer of all- good, sinful transgressor.” It is important to note the impact of these qualitative examples of LLM misbehaviour may be said to be greater than the sum of their parts. The typically coherent and eloquent framing of attempted Quranic references, along with the presence of accurate and familiar Quranic material, means that inaccurate or confabulated text generated by LLMs are more likely to be mistakenly accepted as true.

We employ two quantitative methods to assess the degree to which ChatGPT-3.5 and Bard misquoted the Quran. First, we use a pre-trained multilingual semantic textual similarity (STS) model (available for download from HuggingFaceFootnote⁵) to compute embeddings for accurate Quranic verses and the LLM quoted verses, then we compared the embeddings using a cosine distance metric.Footnote⁶ A cosine distance of zero occurs when compared statements are identical, a cosine distance of one occurs when compared statements have completely different meanings according to the pre-trained model. When comparing returns in English, we use the widely available and acclaimed English translation of the Quran by Arthur J Arberry (AJA). shows the results of this analysis.

Figure 2. Distribution of STS cosine distances for ChatGPT-3.5 and Bard English and Arabic returns when quoting from the Quran.

The STS embedding is designed to place a higher weight on semantic similarity, and appears to indicate that the Bard English references to the Quran are semantically closest, while ChatGPT-3.5 English references have few semantic commonalities. While ChatGPT-3.5 performs better on returns in English, Bard excels at returns in Arabic. Overall, the distribution of semantic distances across all LLM tests leaves a lot to be desired.

In addition to STS cosine distance, we also consider a BiLingual Evaluation Understudy (BLEU) score evaluation of passages.Footnote⁷ The BLEU metric is a gold standard when benchmarking the performance of models trained for language translation tasks. A BLEU score of one indicates two passages are identical; that is, containing the same set of words independent of ordering, while a BLEU score of zero indicates that two passages have no words in common. shows the distribution of BLEU scores for Quranic references.

Figure 3. Distribution of BLEU scores for ChatGPT-3.5 and Bard English and Arabic returns when quoting from the Quran.

The BLEU scores tell a similar story to the STS cosine distance, although instead of pointing out semantic differences, we see that LLM responses rarely include identical words. For LLM responses that do not deviate significantly from the meaning of the original text, synonym substitution may partially explain these low scores. ChatGPT with Arabic returns appears to out-perform the other experiments, which may suggest that ChatGPT has a greater capacity to regurgitate exact words and phrases from the Arabic Quran (which is assumed to be contained in the training set).

We have demonstrated, qualitatively and quantitatively, that LLMs are not trustworthy when quoting Quranic verses in response to our misinformation query. Several factors contribute to this deficiency. Both LLMs employed in our small sample are reported to be less performant in Arabic compared to English, which may be attributed to the greater availability of English text in training sets.Footnote⁸ Second, as previously discussed, LLMs such as ChatGPT-3.5 and Bard are stochastic. In stark contrast to search engines, LLMs are not designed to perfectly retrieve and recapitulate information. While search engine queries can be thought of as effectively deterministic (where identical searches for the most part lead to identical results), LLMs are inherently stochastic; that is, responses are iteratively generated by selecting the next word according to a weighted probability. The reliance on these weighted probabilities is the critical feature that separates today’s state of the art from bygone expert systems and chatbots. Confabulations can never be completely eradicated, and must be accepted as an unavoidable risk of LLMs used for text generation. Assessing the magnitude of this risk, and the severity of the consequences they pose, will vary by application and domain. However as of this writing, we do not find LLMs fit to adequately return references to Quranic passages.

The fact that the misbehaviour of LLMs in referencing the Quran results from a key feature of these technologies themselves should thus bring to attention the importance of a critical approach to using LLMs – not only for religious instruction, but also more generally. The coherent, and typically eloquent, framing of these error-ridden references means that they are in no way obvious, especially to the non-expert. The presence of erroneous references alongside accurate ones can mislead the user of LLMs to unwittingly accept the entirety of any given response. The irony of such mixing of truth with falsehood in a query about misinformation should be obvious when recalling Quran 2:42, the second most regularly cited verse within our sample, “Do not mix the truth with falsehood, nor conceal the truth while you know.” Although many have speculated that the extent of confabulations and errors generated by LLMs may be better managed as the technology develops, there are also reasons to believe that performance will stagnate or even degrade over time. As LLMs contribute an ever larger fraction of written online content, subsequent model fine-tuning has been shown to result in eventual model collapse.Footnote⁹ Presentation of this case, and the brief analysis offered here, should serve to highlight that a critical approach to the use of LLMs, and due attention to truth, is imperative. Failing to assess or verify the responses given to any query produced by an LLM risks serious societal harms. Returning back to the verse most regularly cited in our sample, and quoting 49:6 now in its entirety, “O you who believe ! If an immoral agent brings you some news, verify it, lest you should visit [harm] on a people out of ignorance, and then become regretful for what you have done.” Amidst all the excitement and possibilities that LLMs bring, not taking our responsibilities to truth seriously may leave us looking back at the current AI moment with great remorse.

Notes

1 On Inimitability see Richard C. Martin, “Inimitability,” in Encyclopaedia of the Qurʾān, ed. Johanna Pink (Leiden: Brill). Consulted online on 22 August 2023, http://dx.doi.org.bham-ezproxy.idm.oclc.org/10.1163/1875-3922_q3_EQCOM_00093.

2 On tahrif see Shari Lowin, “Revision and Alteration,” in Encyclopaedia of the Qurʾān, ed. Johanna Pink, (Leiden: Brill). Consulted online on 22 August 2023, http://dx.doi.org.bham-ezproxy.idm.oclc.org/10.1163/1875-3922_q3_EQSIM_00358.

3 When asking the standard English version of ChatGPT-3.5 to cite verses in Arabic we apparently hit a guardrail. “I apologize for any confusion caused. However, as an AI language model, I do not have direct access to the Quran in Arabic or the ability to cite specific verses. I can provide general information and interpretations based on my training, but for precise citations and detailed analysis of specific verses, I recommend consulting a qualified Islamic scholar or referring to trusted translations and commentaries of the Quran.”

4 It is worth noting that the misquotations arising from Bard are especially concerning, as Bard is given access to the Google API when composing responses and can cite its answers with specific website links. Instead of linking to reputable sources for Quranic translations or source text, the links provided by Bard were often broken or led to social media posts on Facebook or other platforms.

5 On STS models see Nils Reimers and Iryna Gurevych, “Sentence-BERT: Sentence embeddings using siamese BERT-networks.” arXiv 1908.10084 (2019). The pre-trained STS model is available for download on HuggingFace at the following URL: https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual

6 Cosine distance is a common method for comparing two vectors u and v in a variety of contexts. Vectors can be thought of as a directed line segment that possesses both magnitude and direction. A cosine distance metric seeks to compare the difference in direction between a set of vectors, which can be written as the cosine of the angle required to change the direction of one vector into the direction of the other. For a variety of NLP models that rely on embeddings of words, passages, or entire documents, the cosine distance metric serves as a proxy for measuring the overall semantic similarity of these entities.

7 A BiLingual Evaluation Understudy (BLEU) score is calculated by combining a series of precision scores with a brevity score. Precision can be calculated at the single word (1-gram) level by taking the ratio of correctly matching words in the prediction to the total number of words in the source. A brevity score penalizes predictions which exceed the word length of the source. For further detail on BLEU see Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, “Bleu: a Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, ed. Pierre Isabelle, Eugene Charniak and Dekang Lin (Philadelphia: Association for Computational Linguistics, 2002), 311–18.

8 Assessing LLM performance across languages is commonly done using the “Measuring Massive Multitask Language Understanding” (MMLU) benchmark. See See Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt, “Measuring Massive Multitask Language Understanding.” arXiv 2009.03300v3 (2021).

9 For more information on the problem of model collapse, see Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson, “Model Dementia: Generated Data Makes Models Forget” arXiv preprint arXiv:2305.17493 (2023).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Truth and Regret: Large Language Models, the Quran, and Misinformation

Information for

Open access

Opportunities

Help and information

Truth and Regret: Large Language Models, the Quran, and Misinformation

Notes

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date