1,220
Views
0
CrossRef citations to date
0
Altmetric

Abstract

Problem, research strategy, and findings

Planners are increasingly using online public engagement approaches to broaden their reach in communities. This results in substantial volumes of digital, text-based public feedback data, making it difficult to analyze efficiently and derive meaningful insights. We explored the use of the novel large language model (LLM), ChatGPT, in analyzing a public feedback data set collected via online submissions in Hamilton City (New Zealand) in response to a proposed local plan change. Specifically, we initially employed zero-shot prompts with ChatGPT for tasks like summarizing, topic identification, and sentiment analysis and compared the results with those obtained by human planners and two standard natural language processing (NLP) techniques: latent Dirichlet allocation (LDA) topic modeling and lexicon-based sentiment analysis. The findings show that zero-shot prompting effectively identified political stances (accuracy: 81.7%), reasons (87.3%), decisions sought (85.8%), and associated sentiments (94.1%). Although subject to several limitations, ChatGPT demonstrates promise in automating the analysis of public feedback, offering substantial time and cost savings. In addition, few-shot prompting enhanced performance in more complex tasks, such as topic identification involving planning jargon. We also provide insights for urban planners to better harness the power of ChatGPT to analyze citizen feedback.

Takeaway for practice

ChatGPT presents a transformative opportunity for planners, particularly those dealing with growing volumes of public feedback data. However, it cannot be entirely relied upon. Planners must be mindful of ChatGPT’s limitations, including its sensitivity to prompt phrasing, inherent biases from training data, tendency to overgeneralize, and occasional omission of nuanced details. To enhance accuracy, planners should prescreen data for consistency, provide clear and iteratively tested prompts, use few-shot prompts for complex analysis, and explore various combinations of prompting strategies to develop an effective local approach. It is also crucial to ensure human review of the results.

Public engagement is at the heart of successful urban planning, enabling cities to evolve and function in a way that meets the diverse needs of their inhabitants (Arnstein, Citation1969; Davidoff, Citation1965). To this end, public engagement requires a robust, continuous mechanism to foster dialogue between planners and the communities they serve (Innes & Booher, Citation2004). Such engagement not only ensures a democratic planning process, allowing the public to voice their opinions, but also caters to the unique socioeconomic and environmental context of the locale, delivering fit-for-purpose, place-based planning outcomes (Fung & Wright, Citation2001; Healey, Citation1997).

Genuine public engagement involves more than just gathering public opinions or feedback on proposals: It is about understanding, synthesizing, and integrating these viewpoints into actionable policies and plans (Innes & Booher, Citation2004; Quick & Feldman, Citation2011). Planners have been increasingly encouraging citizen participation; for instance, by reaching more community members through social media and designing accessible, easy-to-use online platforms to collate public feedback (Ertiö & Bhagwatwar, Citation2017; Fredericks & Foth, Citation2013; Wilson et al., Citation2019). However, this process, especially in large cities and regions, poses substantial challenges due to the vast amount of feedback received, the time and resources required to process this feedback manually, and the frequency with which this process must be repeated. The need for informed and publicly engaged planning is only expected to intensify as cities and regions face increasingly complex and challenging issues.

In recent years, natural language processing (NLP), a subfield of artificial intelligence (AI), has emerged as a promising technology to assist planners in handling the rapidly growing volumes of textual data more effectively and efficiently. This includes data such as plans (Fu, Li, & Zhai, Citation2023), social media data (Huai et al., Citation2023), and public feedback (Kim et al., Citation2021). The release of ChatGPT in late 2022 significantly advanced mainstream NLP. ChatGPT particularly excels in understanding and analyzing textual data in greater depth and demonstrates substantial versatility in language processing tasks unmatched by its predecessors. In this research context, ChatGPT holds the potential to revolutionize the analysis of public feedback for urban planners, thereby catalyzing more comprehensive and well-informed public engagement initiatives on a larger scale. Despite the excitement surrounding ChatGPT, there are several key limitations hindering its reliability and widespread adoption. First, how ChatGPT can be used, and its performance compared with humans and other commonly used NLP techniques, is currently unknown. Second, because ChatGPT’s performance is highly sensitive to how prompts are crafted, determining effective methods to improve its performance across various tasks remains a significant research gap. Therefore, we aimed to explore the use of ChatGPT in analyzing public feedback data to a) benchmark its performance and b) derive best practices from a prompt engineering perspective. Our findings suggested that although planners were still better in analyzing the public feedback data, ChatGPT’s performance was close. Using ChatGPT and similar large language model (LLM) tools can significantly enhance the analysis of public input. These tools offer notable efficiency and the capability for continuous, real-time analysis, aiding urban planners in gaining quick insights and facilitating responsive policymaking. In addition, their scalability enables handling increasing volumes of data without the need for additional resources or personnel. Overall, benchmarking results and prompt engineering findings offer transferable lessons for cutting-edge planners to improve how they engage with community input, potentially promoting more informed, efficient, and fair urban planning.

We begin with a literature review on public engagement and the analysis of collated public feedback. This is followed by a description of our methodology, including the research context, how we used ChatGPT to analyze public feedback, and how we compared its results against those generated by professional planners and two other conventional NLP methods, namely, latent Dirichlet allocation (LDA) topic modeling and lexicon-based sentiment analysis. We then discuss the results and findings. We conclude with key takeaways for planning researchers and practitioners, offering insights into how to effectively leverage the capabilities of ChatGPT and highlighting directions for future research.

Literature Review

Analyzing Feedback from Public Participation

It is well documented that public participation is a crucial element of effective planning processes. Although the specifics of how such activities should be carried out and who should be involved remain contested (Clark, Citation2021; Laskey & Nicholls, Citation2019; Sanchez, Citation2023), their purposes are relatively clear. These generally include understanding the public’s preferences, incorporating citizens’ local knowledge, advancing fairness and justice, legitimizing planning decisions, fulfilling due planning processes as required by law, promoting civic ownership, and building an adaptive society capable of addressing changing circumstances (Innes & Booher, Citation2004). In this review, we emphasize the relevant literature concerning the conventional analysis of public feedback and the key challenges involved.

Public participation activities can take various forms, ranging from public hearings and surveys to online engagement platforms (Haklay et al., Citation2018; Shipley & Utz, Citation2012). Regardless of the format, the outcome of such activities is often a record of collated public feedback (primarily textual) that planners then analyze to identify key concerns and common themes to better inform local decision making (Shipley & Utz, Citation2012). Research on professional planners’ analyses of public participation data sets is scarce, likely due to the highly contextual and place-specific nature of these activities. The varying purposes and topics of these activities challenge the application of a one-size-fits-all analytical approach.

Nevertheless, some common themes can still be distilled from the literature. For example, a recent review on public participation in heritage planning by Foroughi et al. (Citation2023) found that most studies were at level 2 (consultation) on the International Association of Public Participation (IAP2) spectrum, which is like Arnstein’s ladder of citizen participation (Arnstein, Citation1969). This means that most engagements involved only consulting the public through a one-way interaction between citizens and the organizing institution (e.g., Aigwi et al., Citation2019; van der Hoeven, Citation2020; Yu et al., Citation2019). Fewer studies reached involvement (level 3) with two-way interaction or collaboration (level 4) for citizen-to-citizen interaction. This diversity in participation led to varied data sets, analyzed with different methods (content, spatial, statistical). For our research, which focused on NLP, we specifically reviewed text-based public feedback from hearings, surveys, and online platforms. Feedback analysis varied with the participation level, from basic summarization to more complex analyses like topic identification, sentiment analysis, and demographic study.

Existing textual public feedback data, can be analyzed to varying extents or depths depending on the desired level of participation (Foroughi et al., Citation2023; Laskey & Nicholls, Citation2019; Shipley & Utz, Citation2012). In summary, public participation activities may simply allow citizens to express themselves, but the feedback collected often has little or no impact on decisions (termed placation in Arnstein’s ladder of citizen participation) (Arnstein, Citation1969). The primary purpose of these activities is to acknowledge public input, necessitating the summarization and publication of feedback as evidence of acknowledgment. As planners move up the ladder, they consult the public to glean useful information for improved decision making (Sanchez & Brenman, Citation2013). This process involves analyzing feedback to identify key topics, patterns, and trends; assessing public sentiment on various topics; and categorizing the data for better organization and accessibility. At higher levels, a more in-depth analysis, including the participant’s identity and demographic representation, is required to ensure feedback serves the public good and not individual interests (Shipley & Utz, Citation2012). Due to the scope of our research, we primarily focused on the first two levels of participation.

NLP Application in Public Participation

Research has increasingly emerged applying NLP techniques to analyze textual data relevant to urban planning (Cai, Citation2021; Fu, Citation2024). Existing research covers a broad range of topics, such as detecting elected officials’ planning priorities (Han & Laurian, Citation2023; Han et al., Citation2021), processing planning administrative data (Mleczko & Desmond, Citation2023), and evaluating planning documents (Brinkley & Stahmer, Citation2021; Fu, Li, & Zhai, Citation2023). A dominant pattern in existing NLP research within planning is the use of techniques, including topic modeling and sentiment analysis, to assess public perceptions and opinions on planning-related topics, such as green spaces, public transit, and disasters.

For example, Kong et al. (Citation2022) analyzed online park reviews to identify key environmental features influencing public perception of park quality. Others have analyzed Twitter data to gauge public perception of transport services (Lock & Pettit, Citation2020) or to measure how public emotions change over time and space during a disaster event to design better emergency responses (Zhai et al., Citation2020). Most existing studies focused on social media data, whereas few have analyzed actual public feedback collected from conventional citizen engagement approaches. Notably, one study developed a dynamic topic model based on 160,000 civic queries collected through Seoul’s (Korea) civic participation platform to process future public submissions and forecast citizens’ demands more efficiently (Kim et al., Citation2021).

Although interest in NLP applications is rapidly growing, we contend that there are two major gaps in this emerging field within the planning literature. The first gap is the lack of result validation. Although NLP techniques demonstrate significant potential for efficiently analyzing vast amounts of textual data (Huai et al., Citation2023), they are prone to generating inaccurate results due to their inherent methodological limitations and the prevalence of poor quality and ubiquitous noise in large data sets. For instance, elements such as sarcasm, metaphors, and colloquial language in social media data pose significant methodological challenges for existing NLP techniques like sentiment analysis, often leading to unreliable results (Khurana et al., Citation2023). Without rigorous validation, these novel tools, despite their considerable potential, may not be deemed reliable by practitioners, thus rendering them impractical planning practice (Krizek et al., Citation2009; Ramage et al., Citation2009).

The second gap concerns the accessibility of NLP techniques for planners, attributed to a steep learning curve. The plethora of different NLP tools, each with its own variations, and the continuous development of new tools make it challenging for planners to keep up with the latest advancements. Furthermore, effectively applying these tools often requires coding skills and a solid understanding of data structures, which are typically beyond the expertise of an average planner. These technical barriers significantly impede the widespread adoption of NLP methods in the planning profession.

ChatGPT, a leading LLM, has great potential in addressing these gaps. It enables human-like conversational dialogue and demonstrates promise in producing insightful and useful information with deliberate prompts (Wu et al., Citation2023). It is also to date the most accessible and versatile tool intentionally crafted to execute a wide range of text-based analyses and tasks (Ray, Citation2023). Despite its huge potential, ChatGPT is a general-purpose language model, which means that it does not necessarily have domain-specific knowledge, nor can it perform consistently over various language processing tasks. This has led to rapidly growing research aimed at benchmarking ChatGPT’s performance across various domains and tasks, ranging from answering exam questions in ophthalmology (Antaki et al., Citation2023) to solving programming bugs in computer science (Surameery & Shakor, Citation2023). Though existing findings are inconclusive, there has been a consensus in the literature that ChatGPT generally performs well in most areas. However, its performance can still vary significantly across different domains and tasks, and it often struggles with questions requiring domain-specific knowledge (Bhayana et al., Citation2023; Duong & Solomon, Citation2023; Fu, Wang, & Li, Citation2023; Yeo et al., Citation2023).

Given the importance of analyzing public feedback in planning practice and the growing challenges faced by planners in processing such feedback—which is often digitally collected, voluminous, and frequent—it is critical to examine how ChatGPT can be leveraged for this purpose. To our knowledge, this is the first study of its kind in the planning field. Furthermore, public feedback analysis presents unique challenges, such as varied language use, abundant subjective expressions and opinions, the need for local contextual understanding, and the presence of data sparsity and noise. These factors introduce significant uncertainties about the extent to which ChatGPT can effectively analyze public feedback. Thus, our study fills this research gap by investigating ChatGPT’s potential in analyzing real-world public feedback. This study is not only crucial in providing a framework for using ChatGPT to aid planners but also in empirically validating its effectiveness. It offers timely and practical insights for practitioners aiming to facilitate the responsible, broader adoption of ChatGPT in planning practice.

Methodology

Research Context

In this study, we used public feedback data collected through the Hamilton City Council’s (HCC) official website from August 19 to September 30, 2022, regarding Plan Change 12 (PC12)—Enabling Housing Supply (HCC, Citation2024). This PC12 was proposed by HCC in response to two major pieces of state legislation: the National Policy Statement on Urban Development (NPS-UD) and the Resource Management (Enabling Housing Supply and Other Matters) Amendment Act 2021 (Enable Housing Act). Both legislations were high-level strategic policies aimed at enabling greater housing density in major urban areas (Ministry for the Environment, Citation2021b, Citation2022). Without delving into New Zealand’s complex planning system (Ministry for the Environment, Citation2021a), we provide a brief background here of New Zealand’s regulatory framework to contextualize this study.

HCC, the local governmental body for Hamilton (the fourth-largest city in New Zealand with a population of nearly 186,000), is responsible for planning issues specific to local communities, such as where and what types of developments are allowed. Local councils are empowered to develop their place-based district plans that set the rules for land use, development, and subdivision, but they must comply with state legislation. Therefore, because of the NPS-UD and the Enabling Housing Act, HCC had to amend its district plan to reflect the new national rules. PC12 was the local response. However, this response was not a full compliance with the state legislation, because HCC claimed that “PC12 will change Hamilton’s District Plan to provide more housing and different types of housing within the city. However, it doesn’t go as far as the Government wants us to” (HCC, Citation2023, p. 1). Before PC12 becomes operative in late 2024, the process includes, in chronological order, public notification of the proposal in August 2022, open request for public submissions on the proposal until September 2022, organization of multiple hearings thereafter, and, finally, notification of the decision in early 2024.

The data set used in this study consisted of public submissions collated during the open request for submissions following the public notification of PC12. In total, our data set contained 1,996 data entries, including key public feedback information such as whether the respondents supported or opposed the plan change, reasons given by the submitters, explanations of decisions sought by the submitters, and respective summaries of reasons and decisions sought, manually generated by the planners at HCC. This was required to acknowledge the reception of public submissions by the council. Hence, this data set readily allowed us to compare the summary results generated by ChatGPT with those existing summaries created by professional planners. After scanning the data set, 14 data entries were omitted due to missing information, specifically references to attachments not included in the data set. A final data set of 1,982 data entries was used in this research.

Research Methodology

The objectives of this study were twofold: first, to benchmark ChatGPT against humans and traditional NLP tools to accomplish the major textual analysis tasks, including feedback summarization, topic identification, and sentiment analysis. Second, the study aimed to derive good prompt-engineering strategies for generating higher-quality outputs by ChatGPT. The latest ChatGPT model (gpt-4.0-1106-preview) was used during the period from November 23 to 26, 2023, to process the public feedback data through OpenAI’s API platform in the Python programming environment. We enhanced output consistency and reproducibility in this version by setting the temperature parameter to zero, reducing model creativity, and fixing the seed parameter for greater control (OpenAI, Citation2023). The updated model also excels in handling specific formats like JSON, allowing for multiple queries in one prompt and easy conversion of outputs into data frames or tables, with results stored in corresponding columns. In addition, our research incorporated sentiment analysis, topic modeling, and manual comparisons of ChatGPT outputs with human summaries and traditional NLP techniques. This methodology overview is designed for audiences without AI/NLP expertise; for technical details and codes, see the Technical Appendix.

We initially used zero-shot prompts with ChatGPT, which involved giving prompts without specific instructions or a training data set. This method is the most widely used. We tasked ChatGPT with extracting five key elements from public feedback: 1) political stance (support, opposition, or unspecified), 2) reasons from submitters, 3) decisions sought by submitters, 4) sentiment of the submission (positive, negative, or neutral), and 5) relevant planning topics. The first four outcomes were generated by simply asking ChatGPT to “identify the political stance associated with the public feedback,” “summarize the reasons given by the submitter,” “summarize the decision sought by the submitters,” and “analyze the submitter’s sentiment associated with the submission.” The outputs, typically in a consistent format, were compared with planner-generated results and sentiment analysis using VADER (Valence Aware Dictionary and sEntiment Reasoner) (Hutto & Gilbert, Citation2014).

To compare ChatGPT’s topic identification with topic modeling, we initially asked ChatGPT to pinpoint the most discussed planning topics from the data set, varying from 5 to 20 topics in increments of 5. This generated a list of 11 key planning topics, cross-checked with HCC’s five key planning changes and reviewed by HCC planners for accuracy (see ). ChatGPT was then prompted to match submissions to these 11 topics. Concurrently, we performed LDA topic modeling, incrementally increasing the number of generated topics to align with our list. We selected the topic modeling results with 16 topics and aggregated their distribution to match our key topics list, facilitating comparison.

Table 1. Key planning topics.

We also explored few-shot prompting to see whether it outperformed zero-shot prompting. Few-shot prompting involved providing detailed instructions and training data sets with specific inputs and desired outputs. We created a training data set of 100 submissions where ChatGPT initially underperformed based on our manual evaluation, dividing it into a training set (N = 100) and a test set (N = 1,882). To implement few-shot prompting, we instructed ChatGPT to first learn from the training data set by including it at the beginning of the prompt and then to answer the same questions as those asked in the zero-shot prompts.

To validate and assess ChatGPT’s performance and reliability, two research assistants, trained by the lead researcher, compared ChatGPT’s findings and summaries with those of professional planners and evaluated sentiment levels and topical relevance. Once completed, the researcher then independently reviewed and finalized these findings. We identified discrepancies between ChatGPT’s zero-shot results and human or conventional NLP evaluations. This comparison allowed us to benchmark ChatGPT’s effectiveness and develop prompting guidelines, providing insights for planners to effectively use this tool in analyzing public submissions.

Limitations

Key limitations of this study primarily stemmed from the methodological limitations of the ChatGPT model. Specifically, ChatGPT is highly sensitive to input phrasing. The model often evolves with irregular updates that can change its behavior and performance. As a result, there is a consensus that ChatGPT may not yield consistent results when reproduced on a different data set (with the same prompt but different public feedback data) or even the same data set in the future, following new model updates. Although the latest model enabled a higher degree of control over its outputs in a shorter time frame, ChatGPT, as a living AI model, will inevitably change its behavior over time with more training, akin to how human beings might answer the same questions differently at various stages of their lives. Though we have no control over its performance drifts over time, based on our experiences in the last year, the model did appear to become increasingly capable and perform better, aligning more closely with human expectations in its outputs. Future research could continue evaluating ChatGPT’s performance over time and compare it with other models, such as Google's Gemini and Meta’s LLaMA, in analyzing public feedback or assisting with other planning-related tasks.

Moreover, our study was a case study primarily focusing on the prompt engineering approach. This approach aimed to optimize instructions or prompts to effectively communicate with ChatGPT to analyze public feedback. Though this approach offers a feasible path for enhancing ChatGPT’s performance, it is important to recognize that certain inherent limitations linked to the model’s development may still exist. Thus, the study’s findings should be interpreted considering these inherent limitations, despite our rigorous efforts to compare and validate results with those produced by professional planners. Future studies could explore the development of domain-specific and locally informed LLMs to probe public feedback more effectively from various contexts and localities.

Last, we acknowledge that ChatGPT, like other AI techniques and human evaluations, possesses biases, some of which may be unintentional. Such biases are inherent to the ChatGPT model based on its training data, and the nature and degree of these biases remain unclear. It is likely that ChatGPT may prefer certain languages or dialects, pay more attention to specific demographic groups, and emphasize certain topics or viewpoints. Researchers and practitioners planning to use such tools should always be aware of these potential biases. Future research should explore the sources and types of these biases and develop strategies to mitigate them.

Benchmarking and Prompt Engineering

Overview

ChatGPT’s overall performance varied considerably across different tasks in this research, aligning with existing literature on ChatGPT’s performance in other application fields (Zhong et al., Citation2023). We provide an overview of the benchmarking results in and delve into the nuanced differences for each task in subsequent sections.

Table 2. Overall benchmarking results for both zero-shot and few-shot prompting using ChatGPT.

In zero-shot prompting scenarios when benchmarked against planners, ChatGPT generally produced consistent results across various tasks. These included identifying political stance with a convergence rate of 71.2% and summarizing reasons and decisions sought at 84.6% and 79.2%, respectively. In a direct comparison of ChatGPT-generated and planner-generated results against the actual feedback, ChatGPT showed strong performance in producing desired outputs. It was reasonably effective in deriving political stances (81.7%), summarizing reasons (87.3%), and decisions sought (85.8%). Furthermore, using few-shot prompting did not yield significantly better results, showing only a marginal improvement compared with zero-shot prompting. This suggested that for simple reasoning and summarization tasks, zero-shot prompting was sufficiently effective.

When benchmarked against planners, ChatGPT demonstrated greater accuracy in identifying stances but was less precise in summarization tasks. However, considering time and costs, ChatGPT has shown substantial promise in efficiently generating initial baseline results for further review and revision by planners. Specifically, it took us a total of 4.2 hr and US$13 to produce zero-shot prompting results for comparison with the planners, including time for preparation and code execution. According to the planning manager at HCC, the manual process involved “3 to 4 weeks of entering the original submissions, summarizing them, and reviewing the summaries, with six planners entering and summarizing, and three reviewing.” By rough estimation, using ChatGPT could potentially save between 89% and 93% of the time spent by planners in processing this public feedback data set, including the research team’s review time. This significant time and cost savings become even more compelling as the data set size increases.

In benchmarking against the two commonly used NLP techniques (LDA topic modeling and lexicon-based sentiment analysis), ChatGPT’s performance once again varied greatly. However, when evaluating the accuracy of the results against the actual feedback, ChatGPT significantly outperformed the two conventional NLP techniques in accurately identifying associated sentiments and discussed topics.

For example, in the zero-shot prompting scenario for sentiment analysis, ChatGPT’s outputs only coincided 49.9% with the results generated by the corresponding NLP technique, but it was able to accurately identify the associated sentiments at 94.1%, compared with the 55.8% accuracy of the conventional lexicon-based sentiment analysis approach. This empirically reinforced the recurring criticism about relying on the commonly used lexicon-based sentiment analysis in predicting the actual sentiments or emotions in the big data literature (Khoo & Johnkhan, Citation2018). When employing few-shot prompting for sentiment analysis, ChatGPT’s accuracy increased by 3.1% to 97.2%. We contend that this is because analyzing the sentiment was quite like the analysis of political stance, with similarly high accuracy rates using ChatGPT with the simple zero-shot prompts.

The benchmarking results for topic identification presented a stark contrast to earlier findings. Despite a moderate convergence rate of 56.0%, both ChatGPT (zero-shot; accuracy 64.2%) and the LDA topic modeling method (42.6%) struggled to accurately identify the associated topics for each of the individual public submissions. Remarkably, both methods provided completely incorrect answers approximately a quarter of the time (24.6%). Despite that, ChatGPT still significantly outperformed the LDA topic modeling method in identifying specific topics relevant to individual submission texts by more than 20%. This indicated that although both ChatGPT and topic modeling could reasonably identify key topics from the entire data set, their accuracy significantly diminished at the individual submission level, especially for the LDA topic modeling. With few-shot prompting, ChatGPT’s performance in topic identification improved markedly to 87.3%, a 23.1% increase. This improvement was primarily achieved by reducing the rate of incorrect answers from 24.6% to 10.3%. This substantial increase suggested that for more complex tasks like topic identification, particularly those involving specific planning terminologies, few-shot prompting was an effective strategy in significantly enhancing ChatGPT’s performance.

In the following sections, we delve deeper into these differences by examining each task individually. We aim to further uncover the types and sources of discrepancies and provide insights into how these errors can be mitigated in future research and practice.

Summarization

Based on our scrutiny of the outputs, ChatGPT’s errors in summarization can be mainly categorized into four types as exemplified in . Summaries generated by ChatGPT can be overgeneralized and therefore miss nuanced details in the text provided. Specifically, in the representative example, the submitter provided details regarding the zoning ordinance and expressed the need to delete the relevant terrace housing requirement. The planner’s summary not only covered the specific point on terrace housing (townhouse) but also provided their professional rationale as derived from the feedback text. In contrast, ChatGPT’s summary was too generic and generalized to different housing types from the quotation of the zoning ordinance, rather than the submitter’s strong emphasis on terrace housing (townhouse). Similarly, ChatGPT can also omit key information. In the chosen example, the submitter expressed support for the identification of Industrial Zone on the existing planning maps, which was explicitly identified by the planner but was only referred to as the current zoning by the machine.

Table 3. Representative examples of ChatGPT errors for summarization.

Furthermore, ChatGPT sometimes focused too much on the procedural aspect of the feedback, rather than summarizing the feedback based on a rationale like planners did. For example, planners were able to evaluate the submitter’s feedback against the overarching plan change context and then summarized that the submitter was expressing a lack of consideration for space in the existing densification policy framework, whereas ChatGPT was blankly summarizing the face meaning of the feedback text ().

Finally, for short feedback often lacking obvious reasons, ChatGPT would suggest that no reason was given. In such cases, planners usually repeated the request as the reason given. This was the same case for many of the decision-sought summaries. The inconsistencies between ChatGPT and planners in many cases stemmed from the varying logic in summarization given that this task was completed by numerous individual planners. For instance, for similar feedback without an obvious reason/decision sought provided, some summaries pasted the request in the summaries, whereas others left it blank or put “null” as a response. In addition, as demonstrated by the above examples, we could hardly conclude that ChatGPT’s responses were completely wrong. In fact, in most cases, they were just not good enough or unable to meet our expectations with respect to the levels of details given, references against the planning context, or the presentation in a desirable format. Though few-shot prompting seemed to marginally enhance the outputs, this strategy appeared ineffective overall. We argue that it is because incorporating a training data set within the prompt has limited impact on the model’s behavior, in contrast to fine-tuning, which, in theory, could significantly alter the model and thereby its performance. Future research could investigate whether a fine-tuned model would show significant improvements in summarization tasks.

Reasoning

ChatGPT has also shown strong analytical and reasoning capabilities in identifying the relevant political stance as well as the sentiment as reflected by the feedback text. When compared with planners, ChatGPT often wrongly identified the political stance by failing to imply against the broader planning context, misunderstanding the meaning of the text, or simplifying the multifaceted, complex feedback into a binary answer ().

Table 4. Representative examples of errors for reasoning.

Specifically, in the first two examples, both submitters were expressing their feedback in a negative sense, including words like removal, concerned, line the pockets, and not to care for our citizens. ChatGPT reported both submissions as opposing the plan change but, in fact, both feedbacks were referring to the actual proposed changes in PC12. Thus, the planners who were familiar with the plan change could easily identify that these submissions were supporting the plan change. Such judgment was not obvious by examining the feedback alone. Similarly, without background knowledge, feedback written in a positive voice could also lead ChatGPT to respond with the incorrect conclusion about the political stance. For example, in the first example of the misinterpretation, the submitter expressed concerns about landscaping needs and the potential adverse impacts of planning trees near buildings, but they did so in a positive voice, thereby leading ChatGPT to identify this feedback as support of the plan change.

However, there were instances where ChatGPT correctly analyzed the stance of feedback, even when planners were wrong. This often occurred in cases where planners labeled the stance as null while the actual feedback displayed a somewhat polarized sentiment that ChatGPT successfully identified. Furthermore, in a few instances, the feedback was complex and nonbinary. This made the identification of binary political stance challenging and less certain, such as the last example where the submitter expressed concerns about and support for density at the same time. In such cases, either supporting the densification concept or opposing the council’s proposed plan change could be considered correct.

When benchmarked against the commonly used lexicon-based sentiment analysis, ChatGPT significantly outperformed it in correctly identifying sentiment. As illustrated in , lexicon-based sentiment analysis often mistakenly labeled positively toned feedback as negative, whereas ChatGPT usually identified the correct sentiment. It did, however, occasionally mislabel neutral sentiments as positive. These findings suggest that ChatGPT is superior for precise sentiment analysis, particularly when analyzing text written in a professional and often positive tone. However, ChatGPT is both computationally and monetarily expensive. For smaller data sets, it remains feasible. For example, our data set of 1,982 individual submissions totaling around 319,000 words took ChatGPT 1.1 hr to process, whereas the lexicon-based sentiment analysis took just 32 s. Therefore, in its current state, ChatGPT is not ideal for analyzing large-scale data such as those found on social media. However, for analyzing public feedback data sets collected by governments, typically containing thousands to tens of thousands of entries, ChatGPT has proved to be a superior tool.

Topic Identification

For topic identification, both ChatGPT and the LDA topic model were prone to errors due to their respective methodological issues. For instance, LDA relies on probabilistic distributions, automatically assigning a probability of belonging to various topics to each document, which can result in inaccuracies. This problem is evident when LDA assigns equal probability across topics for simple feedback texts like “make sense.” In addition, topic modeling usually involves subjective human judgment in determining the number of topics and interpreting the resultant outputs, which may further contribute to potential errors.

For ChatGPT, misinterpretations of the questions asked were the main source of errors in topic identification, because it could sometimes produce incorrect or irrelevant outputs due to misunderstanding or hallucinating responses. As a language model, it predicted word sequences based on its training, without truly understanding the text. An example of this, shown in , included the incorrect topics infrastructure capacity and urban design. These can be corrected by using few-shot prompting.

Table 5. Representative examples of errors for topic identification.

Few-shot prompting proved to be highly effective for topic identification. Providing detailed instructions, including planning terminologies and examples, enhanced ChatGPT’s comprehension, leading to more accurate responses. This approach, in contrast to its limited effectiveness in altering ChatGPT’s behavior or responses, proved beneficial for clarifying specific queries, especially those involving context-dependent terminologies and jargon. When the accurate labeling of individual submission by relevant topics is essential, ChatGPT with few-shot prompting strategies is recommended. Like sentiment analysis, using ChatGPT to identify topics requires significant time; it took 5.5 hr to process the entire data set. In contrast, LDA topic modeling was completed in less than 1 min. Though LDA is more practical for larger data sets, ChatGPT excels in deriving high-level topics and accurately labeling individual feedback with associate topics.

Takeaway for Planners

Large language models like ChatGPT hold the potential to revolutionize how planners analyze public feedback data, especially for large jurisdictions that need to process increasing volumes and frequencies of such data. Based on our analysis, we found that ChatGPT, with both zero-shot and few-shot prompting, could assist planners in automating the summarization and analysis of substantial volumes of public submissions. From this, we derived the following key takeaways for cutting-edge planners:

  1. ChatGPT can significantly save planners time in processing textual data, but machine-generated results are not perfect. ChatGPT shows impressive capabilities in summarizing public feedback, offering planners an efficient tool to process textual data in bulk. It can be leveraged for initial analysis of public feedback data, aiding in tasks such as identifying decisions sought, summarizing reasons, and evaluating political stances and sentiments. Though it may not replace human expertise, it can significantly reduce the time and cost associated with processing large volumes of public submissions. Planners should be aware of ChatGPT’s limitations, including sensitivity to prompt phrasing, inherent biases from training data, tendency to overgeneralize, and missing nuanced details, among others. Though ChatGPT offers quick and valuable insights, planners should always remain skeptical and interpret its results in the context of these limitations and supplement them with human review and expertise.

  2. Be specific in the prompt and test iteratively. Provide clear and specific instructions with relevant context in the initial prompt, such as specifying the role (acting as an urban planner), the event (analyzing public submissions regarding a plan change), the data (submitters supporting the plan change and their reasons), and the task (identifying reasons for submission). Before applying the prompt to the entire data set, iteratively test and refine it on a small sample. This testing helps ensure that the responses align with expectations. This process can also reduce—but not eliminate—errors. It is important to note that what works in initial testing might not hold true for the entire data set, underscoring the importance of human review for accuracy. Planners can refer to our example prompts in the Technical Appendix for guidance.

  3. Zero-shot prompts are sufficient for most simple tasks. For simpler tasks, such as summarization and sentiment or political analysis, zero-shot prompting is usually sufficient. This is due to the balance between the effort needed to generate a training data set and the marginal improvements observed. For more sophisticated tasks, it is recommended to use few-shot prompts by providing examples for ChatGPT as references to avoid misunderstanding and therefore generate the desired outcomes. This approach is particularly useful when dealing with planning-specific terminologies or jargon in the prompt. Using few-shot prompting, which involves defining terminologies and supplementing them with examples, can help ChatGPT better understand the context and therefore enhance its performance.

  4. Prompt engineering is highly contextual and not a one-size-fits-all strategy. Though useful, planners should always experiment with various prompt-engineering tactics to find the right combination that works well in a specific case. In addition to the above high-level tactics (i.e., be specific, provide context, test iteratively, and use few-shot for complex tasks), other effective prompt engineering strategies may include a) using delimiters (e.g., triple quotation mark) to indicate distinct part of the input, such as the actual feedback text provided in our case; b) splitting multiple, complex tasks into a sequence of simpler subtasks, such as asking ChatGPT to analyze the feedback data and generate the answers for reasons, decision sought, and political stance respectively in sequence; and c) clearly specifying the desirable outputs including the length and/or format of the outputs. For example, for direct comparison with planners’ summaries, we deliberately asked ChatGPT to generate the outputs without any explanation and to be as concise as possible, but in other circumstances planners might need ChatGPT to explain and identify the relevant references in the given text to enable fact-checking its answers. Finally, it should be noted that prompt engineering is not a panacea and has limits: It can often improve the performance but only to a certain degree because prompt engineering only helps the model to better understand our queries, but it does not alter the model’s behavior.

Conclusion

We examined ChatGPT’s potential to assist urban planners in analyzing public feedback by benchmarking its performance and establishing best practices. Specifically, we derived practical examples and guidance for planners on effectively using this tool. We tested ChatGPT’s capabilities, highlighting its success in identifying political stances, summarizing reasons, capturing decisions, and analyzing sentiment with high accuracy, despite some limitations like overgeneralization and missing key information. Though zero-shot prompting was sufficient for most simple tasks, few-shot prompting could enhance accuracy for complex tasks like topic identification that involve planning terminologies. In the future, planners can leverage ChatGPT to automate the summarization of large volumes of public feedback, categorize submissions by topics and sentiments, and generate rapid insights from the data. This ChatGPT-aided approach has the potential to streamline the planning process and improve the information base for public engagement, ultimately contributing to a more informed and data-driven urban planning process. However, though ChatGPT is a powerful tool, planners should remain vigilant and fact-check results. It should always be noted that NLP techniques, including ChatGPT, have their limitations, particularly in handling nuanced language and context, and should be used as tools to assist human decision making rather than replace it, at least for now. When used inappropriately and without a planner’s review, NLP tools can result in biased analyses of public feedback. This may lead to the omission of certain voices, particularly those from underrepresented groups, thereby exacerbating urban inequality in the planning processes. Future research should continue exploring and refining AI applications in planning, maintaining a human-in-the-loop approach as technology evolves.

Supplemental material

Technical Appendix

Download PDF (155.9 KB)

ACKNOWLEDGEMENTS

We thank Teresa Thornton and Chao Li from the University of Waikato Environmental Planning Program for their assistance in this research. They assisted us in evaluating the quality of machine- and human-generated responses. We would also like to thank JAPA Editor Ann Forsyth and the three anonymous reviewers for their invaluable time and suggestions, which have significantly helped improve this article. All opinions expressed and any errors present are solely the responsibility of the authors.

Supplemental Material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/01944363.2024.2309259.

Additional information

Notes on contributors

Xinyu Fu

XINYU FU, AICP ([email protected]) is a senior lecturer of environmental planning at the University of Waikato.

Thomas W. Sanchez

THOMAS W. SANCHEZ, AICP ([email protected]) is a professor of urban planning at Texas A&M University.

Chaosu Li

CHAOSU LI ([email protected]) is an assistant professor of urban planning at the Urban Governance and Design Thrust of the Hong Kong University of Science and Technology (Guangzhou) and an affiliate assistant professor at the Division of Public Policy of the Hong Kong University of Science and Technology.

Juliana Reu Junqueira

JULIANA REU JUNQUEIRA ([email protected]) is the urban and spatial planning team lead at Hamilton City Council

References

  • Aigwi, I. E., Egbelakin, T., Ingham, J., Phipps, R., Rotimi, J., & Filippova, O. (2019). A performance-based framework to prioritise underutilised historical buildings for adaptive reuse interventions in New Zealand. Sustainable Cities and Society, 48, 101547. https://doi.org/10.1016/j.scs.2019.101547
  • Antaki, F., Touma, S., Milad, D., El-Khoury, J., & Duval, R. (2023). Evaluating the performance of ChatGPT in ophthalmology: An analysis of its successes and shortcomings. Ophthalmology Science, 3(4), 100324. https://doi.org/10.1016/j.xops.2023.100324
  • Arnstein, S. R. (1969). A ladder of citizen participation. Journal of the American Institute of Planners, 35(4), 216–224. https://doi.org/10.1080/01944366908977225
  • Bhayana, R., Krishna, S., & Bleakney, R. R. (2023). Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology, 307(5), e230582. https://doi.org/10.1148/radiol.230582
  • Brinkley, C., & Stahmer, C. (2021). What is in a plan? Using natural language processing to read 461 California city general plans. Journal of Planning Education and Research. Advance online publication. https://doi.org/10.1177/0739456X21995890
  • Cai, M. (2021). Natural language processing for urban research: A systematic review. Heliyon, 7(3), e06322. https://doi.org/10.1016/j.heliyon.2021.e06322
  • Clark, J. K. (2021). Public values and public participation: A case of collaborative governance of a planning process. The American Review of Public Administration, 51(3), 199–212. https://doi.org/10.1177/0275074020956397
  • Davidoff, P. (1965). Advocacy and pluralism in planning. Journal of the American Institute of Planners, 31(4), 331–338. https://doi.org/10.1080/01944366508978187
  • Duong, D., & Solomon, B. D. (2023). Analysis of large-language model versus human performance for genetics questions. European Journal of Human Genetics. Advance online publication. https://doi.org/10.1038/s41431-023-01396-8
  • Ertiö, T. P., & Bhagwatwar, A. (2017). Citizens as planners: Harnessing information and values from the bottom-up. International Journal of Information Management, 37(3), 111–113. https://doi.org/10.1016/j.ijinfomgt.2017.01.001
  • Foroughi, M., de Andrade, B., Roders, A. P., & Wang, T. (2023). Public participation and consensus-building in urban planning from the lens of heritage planning: A systematic literature review. Cities, 135, 104235. https://doi.org/10.1016/j.cities.2023.104235
  • Fredericks, J., & Foth, M. (2013). Augmenting public participation: Enhancing planning outcomes through the use of social media and web 2.0. Australian Planner, 50(3), 244–256. https://doi.org/10.1080/07293682.2012.748083
  • Fu, X. (2024). Natural language processing in urban planning: A research agenda. Journal of Planning Literature. Advance online publication. https://doi.org/10.1177/088541222412295
  • Fu, X., Li, C., & Zhai, W. (2023). Using natural language processing to read plans: A study of 78 resilience plans from the 100 resilient cities network. Journal of the American Planning Association, 89(1), 107–119. https://doi.org/10.1080/01944363.2022.2038659
  • Fu, X., Wang, R., & Li, C. (2023). Can ChatGPT evaluate plans? Journal of the American Planning Association. Advance online publication. https://doi.org/10.1080/01944363.2023.2271893
  • Fung, A., & Wright, E. O. (2001). Deepening democracy: Innovations in empowered participatory governance. Politics & Society, 29(1), 5–41. https://doi.org/10.1177/0032329201029001002
  • Haklay, M., Jankowski, P., & Zwoliński, Z. (2018). Selected modern methods and tools for public participation in urban planning–a review. Quaestiones Geographicae, 37(3), 127–149. https://doi.org/10.2478/quageo-2018-0030
  • Hamilton City Council. (2023). Plan change 12–Enabling housing supply. https://haveyoursay.hamilton.govt.nz/city-planning/planchange12/
  • Hamilton City Council (2024). Plan change 12–Enabling housing supply. https://hamilton.govt.nz/property-rates-and-building/district-plan/plan-changes/plan-change-12/
  • Han, A. T., & Laurian, L. (2023). Tracking plan implementation using elected officials’ social media communications and votes. Environment and Planning B: Urban Analytics and City Science, 50(2), 416–433. https://doi.org/10.1177/23998083221118003
  • Han, A. T., Laurian, L., & Dewald, J. (2021). Plans versus political priorities: Lessons from municipal election candidates’ social media communications. Journal of the American Planning Association, 87(2), 211–227. https://doi.org/10.1080/01944363.2020.1831401
  • Healey, P. (1997). Collaborative planning: Shaping places in fragmented societies. UBC Press.
  • Huai, S., Liu, S., Zheng, T., & Van de Voorde, T. (2023). Are social media data and survey data consistent in measuring park visitation, park satisfaction, and their influencing factors? A case study in Shanghai. Urban Forestry & Urban Greening, 81, 127869. https://doi.org/10.1016/j.ufug.2023.127869
  • Hutto, C., & Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of the International AAAI Conference on Web and Social Media, 8(1), 216–225. https://doi.org/10.1609/icwsm.v8i1.14550
  • Innes, J. E., & Booher, D. E. (2004). Reframing public participation: Strategies for the 21st century. Planning Theory & Practice, 5(4), 419–436. https://doi.org/10.1080/1464935042000293170
  • Khoo, C. S., & Johnkhan, S. B. (2018). Lexicon-based sentiment analysis: Comparative evaluation of six sentiment lexicons. Journal of Information Science, 44(4), 491–511. https://doi.org/10.1177/0165551517703514
  • Khurana, D., Koli, A., Khatter, K., & Singh, S. (2023). Natural language processing: State of the art, current trends and challenges. Multimedia Tools and Applications, 82(3), 3713–3744. https://doi.org/10.1007/s11042-022-13428-4
  • Kim, B., Yoo, M., Park, K. C., Lee, K. R., & Kim, J. H. (2021). A value of civic voices for smart city: A big data analysis of civic queries posed by Seoul citizens. Cities, 108, 102941. https://doi.org/10.1016/j.cities.2020.102941
  • Kong, L., Liu, Z., Pan, X., Wang, Y., Guo, X., & Wu, J. (2022). How do different types and landscape attributes of urban parks affect visitors’ positive emotions? Landscape and Urban Planning, 226, 104482. https://doi.org/10.1016/j.landurbplan.2022.104482
  • Krizek, K., Forysth, A., & Slotterback, C. S. (2009). Is there a role for evidence-based practice in urban planning and policy? Planning Theory & Practice, 10(4), 459–478. https://doi.org/10.1080/14649350903417241
  • Laskey, A. B., & Nicholls, W. (2019). Jumping off the ladder: Participation and insurgency in Detroit’s urban planning. Journal of the American Planning Association, 85(3), 348–362. https://doi.org/10.1080/01944363.2019.1618729
  • Lock, O., & Pettit, C. (2020). Social media as passive geo-participation in transportation planning–how effective are topic modeling & sentiment analysis in comparison with citizen surveys? Geo-Spatial Information Science, 23(4), 275–292. https://doi.org/10.1080/10095020.2020.1815596
  • Ministry for the Environment (MfE). (2021a). Building competitive cities: Reform of the urban and infrastructure planning system - A technical working paper. https://environment.govt.nz/publications/building-competitive-cities-reform-of-the-urban-and-infrastructure-planning-system-a-technical-working-paper/
  • Ministry for the Environment (MfE). (2021b). Resource management (enabling housing supply and other matters) amendment act 2021. https://www.legislation.govt.nz/act/public/2021/0059/latest/LMS566049.html
  • Ministry for the Environment (MfE). (2022). National policy statement on urban development 2020. https://environment.govt.nz/assets/publications/National-Policy-Statement-Urban-Development-2020-11May2022-v2.pdf
  • Mleczko, M., & Desmond, M. (2023). Using natural language processing to construct a National Zoning and Land Use Database. Urban Studies, 60(13), 2564–2584. https://doi.org/10.1177/00420980231156352
  • OpenAI. (2023). New models and developer products announced at DevDay. https://openai.com/blog/new-models-and-developer-products-announced-at-devday
  • Quick, K. S., & Feldman, M. S. (2011). Distinguishing participation and inclusion. Journal of Planning Education and Research, 31(3), 272–290. https://doi.org/10.1177/0739456X11410979
  • Ramage, D., Rosen, E., Chuang, J., Manning, C. D., & McFarland, D. A. (2009). Topic modeling for the social sciences. NIPS 2009 Workshop on Applications for Topic Models: Text and beyond, 5(27), 1–4.
  • Ray, P. P. (2023). ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 3, 121–154. https://doi.org/10.1016/j.iotcps.2023.04.003
  • Sanchez, T. W. (2023). Planning with artificial intelligence. PAS Report 604. American Planning Association. https://www.planning.org/publications/report/9270237/
  • Sanchez, T. W., & Brenman, M. (2013). Public participation, social equity, and technology in urban governance. In C. N. Silva (Ed.), Citizen e-participation in urban governance: Crowdsourcing and collaborative creativity (pp. 35–48). IGI Global.
  • Shipley, R., & Utz, S. (2012). Making it count: A review of the value and techniques for public consultation. Journal of Planning Literature, 27(1), 22–42. https://doi.org/10.1177/0885412211413133
  • Surameery, N. M. S., & Shakor, M. Y. (2023). Use Chat GPT to solve programming bugs. International Journal of Information Technology and Computer Engineering, 3(31), 17–22. https://doi.org/10.55529/ijitc.31.17.22
  • van der Hoeven, A. (2020). Valuing urban heritage through participatory heritage websites: Citizen perceptions of historic urban landscapes. Space and Culture, 23(2), 129–148. https://doi.org/10.1177/1206331218797038
  • Wilson, A., Tewdwr-Jones, M., & Comber, R. (2019). Urban planning, public participation and digital technology: App development as a method of generating citizen involvement in local planning processes. Environment and Planning B: Urban Analytics and City Science, 46(2), 286–302. https://doi.org/10.1177/2399808317712515
  • Wu, T., He, S., Liu, J., Sun, S., Liu, K., Han, Q. L., & Tang, Y. (2023). A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10(5), 1122–1136. https://doi.org/10.1109/JAS.2023.123618
  • Yeo, Y. H., Samaan, J. S., Ng, W. H., Ting, P. S., Trivedi, H., Vipani, A., Ayoub, W., Yang, J. D., Liran, O., Spiegel, B., & Kuo, A. (2023). Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clinical and Molecular Hepatology, 29, 721–732.
  • Yu, T., Liang, X., Shen, G. Q., Shi, Q., & Wang, G. (2019). An optimization model for managing stakeholder conflicts in urban redevelopment projects in China. Journal of Cleaner Production, 212, 537–547. https://doi.org/10.1016/j.jclepro.2018.12.071
  • Zhai, W., Peng, Z. R., & Yuan, F. (2020). Examine the effects of neighborhood equity on disaster situational awareness: Harness machine learning and geotagged Twitter data. International Journal of Disaster Risk Reduction, 48, 101611. https://doi.org/10.1016/j.ijdrr.2020.101611
  • Zhong, Q., Ding, L., Liu, J., Du, B., & Tao, D. (2023). Can ChatGPT understand too? A comparative study on ChatGPT and fine-tuned Bert. arXiv preprint arXiv:2302.10198.