563
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Bridging ethnography and AI: a reciprocal methodology for studying visual political action

ORCID Icon, ORCID Icon & ORCID Icon
Received 14 Jun 2023, Accepted 24 Feb 2024, Published online: 25 Mar 2024

ABSTRACT

This article proposes a methodological approach to address the analytical challenge of meaningfully studying visual politics in the current abundance of online image data. It proposes a novel methodological process of bridging ethnography and computational methods for analysing visual data in a manner that avoids mixing and blurring the boundaries of the two methods while establishing a continuous interchange between them. The methodological development enables investigating how youth participate through image creation and usage, both online and offline, by outlining repertoires of visual political action. We argue that combining ethnographic research with supervised deep learning-based AI and pre-trained neural networks allows for systematically analysing large image datasets while maintaining interpretative, analytical perspectives and contextual sensitivity. Our results show that the existence of common visual features in images, the re-evaluation of the image categories and the creation of subcategories constitute key aspects for an ethnographically-informed image classification.

Introduction

The public sphere is increasingly dominated by visual content: with high-quality cameras on mobile phones, the practices of taking, posting, sharing, commenting, liking, and talking about images are an increasingly prevalent form of acting publicly.Footnote1 The visuality of the today’s public sphere is the most evident on social media platforms like Facebook, Twitter, Instagram and Snapchat that are all primarily visual media. As a result, social media use has become, as Hand (Citation2020, p. 311) notes, ‘a matter of visual communication’.

This visual communication turn has significant consequences for public action and democracy, especially for today’s youth (Hankey & Tuszynski, Citation2017; Kuntsman, Citation2017; Luhtakallio et al., Citation2024). Prior research has highlighted the political character of memes (Shifman, Citation2013) and selfies (Caldeira et al., Citation2021; Hardesty et al., Citation2019; Kuntsman, Citation2017; Meriluoto, Citation2023; Tiidenberg & Gómez Cruz, Citation2015), emphasising the need for scholarly engagement with visual political action. Such engagement has a strong methodological underpinning. As semiotic signs, images are multi-layered by their very essence and invoke multiple interpretations, and are thus harder to code and analyse in a systematic manner than words and text (e.g. Dunleavy, Citation2020; Harper, Citation2005; Pauwels, Citation2010). The challenges of their analysis are multiplied when facing the flood of images in today’s visual communication: not only is the systematic analysis of images labour intensive but the numbers of images potentially included in a dataset are near infinite (Hankey & Tuszynski, Citation2017). Thus, the questions of selecting visual data, on the one hand, and addressing big data, on the other hand, are of crucial importance.

At this juncture, the impact of computational methods on complex problem-solving becomes increasingly relevant. These methods, leveraging algorithmic processes and mathematical models, have dramatically enhanced our capacity to analyse and interpret large datasets, yielding insights far beyond the scope of traditional techniques (Chen & Guestrin, Citation2016, p. 785). A significant subset of these methods is deep learning, which has attracted interest for its ability to model high-level abstractions in data through intricate, multi-layered processing structures (Goodfellow et al., Citation2016, p. 78). Deep learning, as an advanced form of computational analysis, draws inspiration from the neural networks of the human brain. These networks excel in managing tasks such as image and speech recognition, outperforming many conventional computational methods (LeCun et al., Citation2015, p. 436). Their ability to discern complex patterns within vast amounts of visual data presents an opportunity to address the analytical challenges posed by the digital age’s visual data deluge.

This article proposes a methodological approach to address the analytical challenge of meaningfully studying visual politics in the current abundance of online image data. It presents the work of methods-bridging we have carried out in the analysis of visual data, which was conducted in an ongoing project investigating visual political action among European youth (Luhtakallio, Citation2018; Luhtakallio et al., Citation2024). The methodological combination of ethnography and supervised machine learning was set up to answer research questions pertaining to the visual political action of young Europeans: 1) How do the youth participate by taking/creating images? 2) How do the youth participate with and through images (online and offline)? Ultimately, the project’s aim is to outline repertoires of visual political action, and the goal of the methodological development is to enable responding these questions.

The methodological tool we have developed is based on the bridging of ethnographic research and supervised deep learning – based AI approach, enhanced by a supplementary classification by pretrained neural networks. The latter two methods are individually well established in data analysis but rarely combined and, to the best of our knowledge, have never been connected to an ethnographic approach (Webb Williams et al., Citation2020, pp. 4, 19). This methodological combination, we argue, allows for analysing large datasets of images in a systematic manner while retaining an interpretative analytical gaze and the sensitivity to context vital for a meaningful analysis of visual data.

In this article, we present a model of bridging ethnography and computational methods for analysing visual data with the aim of valuing their difference and building a continuous back-and-forth between the different modes of analysis. Thus, bridging here means keeping the two distinctly apart while enabling a methodological cross-feeding; not mixing and blurring the boundaries of the two methods or glossing over their epistemological differences. In sum, keeping the best of both worlds and exposing them to one another to open a new analytical sphere. The resulting model is a product of ongoing pioneering work marked by trial and error due to few previous examples to follow. To enable as wide a pool of experiments as possible in the future, we provide a step-by-step description of the process and a detailed presentation of the challenges, misunderstandings and compromises we made, as well as the successes and future promises of the methodological combination we have identified. Throughout this description, we provide an account of the key differences between ethnographic and computational approaches in analysing visual data and how these differences can be turned into a powerful joint tool in understanding visual political action. In the following section, we provide a detailed description of the process of the methodological development and learning curve experienced.

Combining ethnography and big data – lessons from the past

The interplay between ethnography and big data in social sciences has grown into a significant methodological trend, aiming to combine the strengths of both approaches to yield a comprehensive ‘thick picture’ (Charles & Gherman Citation2019; Evans, Citation2016; Laaksonen et al., Citation2017; Seo et al., Citation2022). This trend has manifested in various forms, such as virtual ethnography, online ethnography, or netnography (Hine, Citation2000, Kozinets, Citation2002; Kozinets, Citation2010; Gómez-Cruz & San Cornelio, 2018), which utilise participant observation-based research to explore computer-mediated communications and the visual materials produced by these communities.

In the digital realm, the semiotics of online images, especially those associated with activism, have gained prominence due to their transformative role in society. The internet reshapes both discourse and societal structures, affecting how activist images are produced, circulated, and received (Scollon & Scollon, Citation2004). Scollon and Scollon (Citation2007) also highlight the importance of focusing ethnography on action and its nexus analysis in understanding the dynamics in such visual discourses. Similarly, Noy (Citation2020) explores the authenticity of materials in digital spaces, emphasising how the medium itself plays a crucial role in shaping the reception and interpretation of visual materials.

Building on this need for a holistic approach, Laaksonen et al. (Citation2017) suggest a ‘big data – augmented online ethnography’ for understanding electoral candidates’ public action. Their main argument for a mixed-method approach on this topic is that ‘ethnographic observations can be used to contextualise the computational analysis of large data sets, while computational analysis can be applied to validate and generalise the findings made through ethnography’ (Laaksonen et al., Citation2017, p. 111).

Our research into the visual political action of youth utilises this integrative approach. By collecting extensive data, we seek to outline the broader characteristics of these actions and categorise them into discernible trends. Simultaneously, our ethnographic fieldwork aims to uncover the nuanced contexts and everyday practices that underlie these actions (Junnilainen & Luhtakallio, Citation2015; Luhtakallio & Meriluoto, Citation2022). We argue that virtual observations alone cannot capture the full context in which images acquire meaning; hence, a tangible presence in the field is essential.

This presence not only informs the algorithm’s training but also enhances the validity and reliability of the computational analysis, though with a critical eye on the biases, presuppositions, and selective interpretations intrinsic to ethnographic methods. The language of validity and reliability is used here with restraint, acknowledging the challenging crossover from nuanced human insights to quantifiable data.

Computational methods, in turn, offer valuable tools for classifying data based on ethnographic insights. However, instead of merely serving as a validation process for generalising ethnographic observations, we point to possibilities where computational analysis may also feed into guiding future ethnographic research by suggesting categories of data that are ‘bubbling under’ and potentially, phenomena that are about to emerge.

In sum, our methodological solution highlights the appreciation of the distinct value of both approaches and their epistemological commitments. From an ethnographic point of view, a computational analysis should not only add ‘a quantitative quality’ to the analysis but it should also contribute to the ethnographic work in a meaningful way. Inversely, for computational analysis, ethnography does not only bring in contextual insights but it also enhances the accuracy and validity of computational analysis through better justified data collection, coding and reading of the results of analyses. Nonetheless, any claims of increased validity must again be considered within the scope of potential biases and the selective nature of meaning-making inherent in qualitative research.

Bridging the methods: The winding path taken

In this section, we showcase a long and winding trial-and-error path that, eventually, resulted in a functioning methodological bridging tool. The section describes the process by separating four distinct stages of development: from the first stage of the initial ethnographic image data collection to the second, long stage of testing and labelling; to the third stage of building ethnographically informed labels; and, finally, to the stage of production and assessment of results acquired after the successful training of a neural network based on the labelling procedure, performed at stage three. The aim of this presentation is to lay out the complications we met and sources of solutions we managed to make, and thus help those building the method further to avoid the detours we made before arriving at a functioning, sustainable set of tools and practices.

Stage 1: Ethnography-informed initial image data collection

The foundation of the method developed is multi-sited ethnographic fieldwork. We conducted snap-along ethnography (Luhtakallio & Meriluoto, Citation2022) among different individual activists and activist groups across the four countries of comparison: Finland, Germany, France and Portugal. We followed activist groups with a broad range of topics from climate to housing justice, mental health, LGBTQ+ rights, etc., took part in the groups’ political activities, that is, the events they organised or took part in, and followed their ‘backstage’ life and discussion in their meetings and digital discussion platforms. Simultaneously, we followed the individuals’ and groups’ posts and their digital lives on social media. This combination was key in understanding the meanings assigned to specific images posted, the negotiations involved and the offline uses of images and their taking (Malafaia & Meriluoto, Citation2022). This thorough contextual understanding of the activists’ meaning-making was a crucial element in developing a successful machine-learning algorithm for image analysis.

The first step in the process of the large image data collection was building an ethnography-informed collection of hashtags. The first list concerning the Finnish context was complemented by means of crowdsourcing among colleaguesFootnote2 and by comparing it to previous research on youth political participation in Finland. The list was regularly updated, and new hashtags were added. Crucially, the list was compiled in tandem with the start of ethnographic fieldwork online and offline, using the ethnographers’ field insights on issues of political relevance among young people. When a new theme emerged in fieldwork – either online or offline, often simultaneously in both – the related hashtags were added and then followed to identify the accompanying hashtags they appeared with. In June 2019, the list consisted of 191 hashtags. The list was roughly organised into categories (climate change and environment, minority rights, feminism, immigration and racism, social stigma, etc.) and it then served as a kindle for the compilation of hashtag lists for the other three countries. The same hashtag collection procedure was thus replicated in all the countries of comparison, in the respective languages but also in English following context-specific topical themes.

Initiating the data collection procedure, we merged the lists from the four country contexts and used this to obtain a big image dataset in an automatic manner. Concretely, we created a programme using Python programming language and the ‘Instaloader’ library to download the images hash-tagged on Instagram. After cleaning the dataset from duplicates and images in an undisplayable form, we formed a dataset of 132,939 images.

Stage 2: Testing, testing, labelling, labelling

Our overarching objective with AI was to teach it to recognise political action in images. To do this, we needed to break political action down into different forms and features that the AI could visually identify. When teaching the AI, these identifiable features can be developed into labels—descriptive names for image categories – that the AI will learn to recognise and will use to categorise images. Thus, identifying the labels was the essential step both ethnographically and computationally: the labels should both cover the main features of recognised political action and be visually distinct and identifiable for the AI. This process thus ultimately also meant a transition from an interpretative epistemological universe to a positivist logic: images that hold multiple meanings for humans needed to be looked at through visual elements that were concretely present in the image and thus identifiable for the AI, as we will see in what follows.

In the early testing stage, we used an initial list of hashtags to compile a small data sample for the team’s social scientists to go through and identify relevant frames that could then serve as labels for image classification. We looked at 931 images and wrote down observations, loosely informed by a Goffman (Citation1974) frame analysis approach: What seems to be going on in the image (see also Luhtakallio, Citation2013)? It became quickly evident that the ‘meanings’ and ‘framings’ the social scientists found interesting were too complex for the AI to grasp. Unsupervised machine learning algorithms were proven insufficient in categorising a complex sociological phenomenon like visual political action. This eventually steered us away from an unsupervised approach towards a supervised machine learning.

The next task was that of simplification: How could we break down a complex sociological research question – ‘How politicisation is done in and through images?’ – into a set of questions an AI can answer by retracting data from images? We scaled the research question first into a more descriptive approach: ‘What does societal participation look like?’ ‘What does a protest look like?’ Although the appearances of a protest were but one feature of the broader question, it was a start.

At this point, we set up a labelling programme that allowed our team to label each image into one prevalent category, based on predefined visual features, so that the algorithm could perform the same categorisation. The labelling program was set up in a university server and developed using Django framework in Python and MySQL database software (see the Technical Appendix for reference) to store the labelling-related information.

As the technical setup for image labelling was put in place, we had to come up with meaningful and at the same time functioning labels. The ethnographers’ longstanding problem throughout the process was overestimating the classification capabilities of AI. Our very first listing consisted of 49 different labels divided into five different categories. We were simultaneously looking at the technical properties of the image (screenshot, illustration, etc.), image genre (meme, advertisement, etc.), image content (landscape, etc.) and also more theoretically informed aspects such as image frames (Goffman, Citation1974) (campaigning, violence, traffic, etc.) and visual grammars of commonality (individual, public, etc.) (Boltanski & Thévenot, Citation2006/1991; Thévenot, Citation2007). Inspired by earlier work done with textual big data teasing out different justifications people employ in public debates (Ylä-Anttila et al., Citation2022), we attempted to build a codebook based on justification analysis (Ylä-Anttila & Luhtakallio, Citation2016). Every researcher was able to label each image in the following way: For every image, we selected one state – positive, neutral or negative – for each of the nine predefined categories according to the justification analysis theoretical framework. We labelled the images using justification theory’s ‘worlds’ as labels, adding a + or – to indicate whether the image was a justification or a critique drawing on that specific value base (see Blokker & Brighenti, Citation2011; Boltanski & Thévenot, Citation1999; Luhtakallio & Ylä-Anttila, Citation2023). For example, an image of a mass protest was labelled civic + because it drew on popular support and people’s voices as an argument.

For social scientists, this kind of multi-layeredness of the image was self-evident. For the AI, it was pure chaos. We performed tentative training sessions of a supervised deep learning model using AlexNet in MATLAB programming language to automate this categorisation. Every test on data that was different from the training data set produced results that looked random and were incomprehensible. At this stage, we were already increasingly disillusioned by the interpretative capacities of AI and aspired towards a closer combination between computational analysis and human interpretation.

Since it had become clear that our initial, sociologically relevant but computationally incomprehensible, overlapping and visually undetectable list of labels needed to be simplified, we again revisited our simultaneously ongoing ethnographic fieldwork. With the ethnographic insight into what images and their taking meant among the actors of various protest and civic groups, we started building conceptual frames that could then be used as labels. Again, at first, ours were too closely tied to an interpretative epistemological stance. They were crafted by asking the following in a frame analysis-informed manner: ‘What does the image do?’ ‘What kind of meaning does it send out? Although frames such as ‘building self-esteem’, ‘showing outrage’ and ‘showing strength’ were sociologically sound and relevant, they made no sense at all for the computer: the results from the testing of each and every tentative training session of our supervised deep learning model seemed random. After labelling 2,272 images with 130 different labels, the percentage of agreement (with human researchers alone) was far too low at 35.8%. The social scientists saw layers upon layers of meaning, while the algorithm saw nothing at all.

At this point, we started to realise that there was a missing step in between our labelling and the algorithm’s ‘way of seeing’. For the labels to be applicable for the algorithm, they had to focus more strictly on what is visible in the image. An image that conveys a meaning of ‘showing strength’ or justifying based on civic values can both be of one person looking directly at the camera and of a mass of people taking to the streets. AI can only classify what is visible, while for social scientific eye fixes easily to what lies behind the immediately obvious. Meaningfully bridging these ‘ways of seeing’ – i.e. two epistemological worlds – was the key compromise necessary for crafting a functional set of labels for image classification. For AI to perform, labels based on what is visible, and yet they also had to correspond to a broader frame of action based on the ethnographic work for the classification to have any meaning beyond the mere organisation of data.

Hence, we took a step back and changed to a bottom-up approach: instead of starting with a vast pool of images, we started to build a training set with images based directly on ethnographic fieldwork, described next. In addition, we compartmentalised the procedure of training the algorithm by later adding refinement processes of our training dataset to train the algorithm more efficiently and track our progress and potential mistakes during each process. This solution was key to building labels both visually detectable and sociologically meaningful.

Stage 3: Building ethnographically informed labels

The ethnography-based data collection began by uploading images from our ethnographic fields, that is, images that were taken and posted on Instagram by our research participants who had given their consent to be followed as part of this research. We also included images the participants had shared with us as part of the research, and a limited set of images we had taken ourselves as field notes. We gathered images from seven different fieldwork sites in the countries of comparison and started to identify categories of visual politicisation from the images.

This effort differed significantly from all our previous attempts to build labels: this time, the process was based on a grounded approach to what we see in the image. Furthermore, we did not even seek to build a comprehensive list of labels that could successfully classify all the images. Rather, we sought to identify the most meaningful and recurrent categories in each field.

In earlier phases, the categories based solely on what is visible seemed uninteresting and unfruitful for a sociological analysis of images scraped directly from Instagram. Now, it was the bottom-up approach from ethnographic fieldwork that ensured a connection between what is visible in the image and what is meaningful sociologically. Identifying frames and constructing subsequent labels based on the image data that we thoroughly knew because it originated in our own fieldwork made it possible to translate different meanings of political action into characteristics that are visible in an image.

The ethnography-led approach also enabled the transition between different epistemologies. We could assume a positivist gaze on our image data without losing the interpretative insight because the images were part of our ongoing fieldwork: we had their multiple meanings and the participants’ interpretations already recorded in our fieldnotes. This shielded the data from being reduced to mere visual characteristics. While we needed to identify discernible image features for the AI, we knew we could come back from these and expand their meaning with the ethnographic insight we had about the images.

We first identified frames from our own fields and then came together to identify the most prevalent one across all our field sites. The negotiations about prevalence and the differences identified in the contexts of different field sites ultimately resulted in shared image frames, in the following called image categories, identified across all our field sites.Footnote3 These categories were protest selfies, groupies, crowds, performances, formal gatherings, protest material, artificial protest image and threat. The category titles already partly give away the first observation, which is that most of the categories comprised people conducting various political activities. Indeed, selfies, groupies, crowds, performances, formal gatherings, and for the most part also the category of threat, were mostly pictures of people in different protest settings and environments.

We crafted specific definitions for each category, outlining in detail what needs to be visible in the image for it to be classified into the respective categories. For instance, the category of crowds were defined by the ethnographers as A mass of people (typically +100) assembled for collective action in public – a demonstration march on a street or a square, and often with flags, banners or signs depicted in addition to people. In the image, some people may be more visible than others because of perspective, but the image focus is not primarily on anyone in particular. Similarly, each category had a similar descriptive definition that guided the team of ethnographers to form the first categorisation for the training image dataset.Footnote4 The initial training set consisted of 369 images, which is relatively small compared with similar research datasets (Hashemi & Hall, Citation2019; Won et al., Citation2017, p. 1). Thus, the initial analyses were mostly tentative.Footnote5

As we tested and trained the neural network, we also refined the categories. Some refinements were done with a computational analysis in mind to make the image classes more distinct and easily identifiable for the neural network. Some, on the other hand, were done from an ethnographic perspective to make the frame of the image more sociologically precise. The category initially labelled as violence, for example, entailed ‘confrontations with the police, barricades, guns, stone throwing, tear gas, arrestations, injuries, fires’. First, we took fire and smoke as one of the guiding visible cues for this category, teaching the neural network to classify all images with smoke to the category of violence. As a result, however, a broad selection of photos from demonstrations was misclassified as ‘violence’ because of the fire and torches that are often present in demonstrations. Moreover, the category’s name – violence – was not an accurate description of the frame from a sociological perspective. Violence was not, in fact, what was going on in the broad category of images of police encounters, but more precisely the threat thereof.

As a solution, we refined the category from two perspectives. To make the category more easily identifiable for the neural network, we defined police presence as the key visible cue in the image. Police uniforms and vehicles were quite effortlessly identifiable for the AI, thus resulting in a well-defined and distinct category of images with police presence. At the same time, we renamed and redefined the category from ‘violence’ to ‘threat’. What we concluded these images were actually about was ‘threat’, that is, the potential of a violent event taking place at any point.

From the computational side, a beneficial practice that we decided to follow was to split the output of the algorithm into ‘High’ and ‘Low’ probability classification results, retaining the same category structure for each of the two new outputs. This would allow us to track if and how much progress we would observe in the classification accuracy in the ‘High Probability’ output, which was confidently classified by the algorithm as such. Having observed progress in that output would then lead us to more thoroughly inspect the ‘Low Probability’ part of our results and ensure that consistency can be observed there also. Moreover, images classified with high probability are likely to belong strictly to the category, thus providing a robust and more reliable output for the social researcher to examine. Adding a neutral category to include the images that did not fit any category was also a step we took to filter out irrelevant images.Footnote6

As an example of the complexities in assessing the success of the classification, we present the percentages of accurately classified images from one category – the ‘Protest material’ – throughout six computational analysis procedures (see ). This category was chosen as an example because most images from the other categories depict natural (physical) persons in the jurisprudential meaning of the term, and GDPR-based data ethics binding our research project restrict the publication of such material (Amram, Citation2020). We use this category as the example throughout the rest of the article for two reasons: firstly, this way, the evolution of the method can be best followed, and secondly, we can illustrate this category with actual pictures without the risk of violating the privacy of the people appearing in the scraped image data.Footnote7 While one of the fewer categories that did not majorly include pictures of people (see footnote 3 above), the category of protest materials did not in other ways stand out from the rest – it was neither outstandingly ‘good’ nor ‘bad’ in terms of our model.

Figure 1. Statistically calculated percentage of accuracy for the ‘protest material’ category for six chronologically consecutive computational analyses.

Figure 1. Statistically calculated percentage of accuracy for the ‘protest material’ category for six chronologically consecutive computational analyses.

By looking at the figure, we can see that the classification results were the best at the sixth attempt, after the neutral category had been added. The application of ResNet50V2 was important for improving the accuracy percentage. The percentage of True Positives, which was utilised as a statistical measure of accuracy in this case, is defined as ‘the ratio of correctly identified positive examples to the total number of positive examples in the dataset, expressed as a percentage’ (Fawcett, Citation2006, p. 862). We selected this evaluation method to be able to benefit from a longitudinal view on our process and thus to assess the actual progress as carefully as possible. In the first stages of development (using MATLAB), we had only recorded the accurately predicted cases, i.e. the True Positives. Thus, to assess the progress, we applied the same statistical measure in the latest attempts to compare percentages in the same metric.

However, we observed that the qualitatively assessed accuracy of the system after each of these attempts on global (non-previously annotated) data did not coincide with the improvement of the statistically calculated classification accuracy percentages.

Consequently, we concluded that the final evaluation of the computational system’s accuracy should be both qualitative and statistical to ensure the consistency of its ability to accurately classify images into the given categories.

After obtaining the first results with the neutral category, we decided to follow a refinement strategy to augment the training dataset. This was tentatively applied to reduce the workload and time required to augment the dataset properly. As a first step, we ran the programme on new images that were pulled from the final dataset of 132,939 images obtained by the last scraping process we had conducted. After the classification, we refined the results by removing the false positives, that is, images that were falsely categorised into each category. Consecutively, we added the results that were accurately predicted, after a qualitative inspection, to the next training set. Through this feedback circle, we secured a gradual improvement in classification accuracy after each application of the algorithm in new data by simultaneously maintaining consistency of the training dataset. Finally, we tested different machine learning classification methods after the extraction of features from the images using the convolutional neural network.Footnote8

The refinement process ultimately proved to be not only important for the computational analysis’s success but also crucially fruitful for the ethnographic analysis. As we were tentatively refining the categories, we observed images that ‘did not fit’ into categories created based on images from our fieldwork but that were nonetheless so frequently emerging that they required categories of their own. This was the case, for example, for images that served as ‘advertisements’ for political causes and maps that emerged as a distinct category of its own. Moreover, images we had classified as ‘neutral’ contained images that could not be brushed off as politically insignificant. For example, images of bicycles and solar panels served as symbols for the required changes to fight climate emergencies.

From this procedure, it became evident that there was a need to create subcategories for reasons related both to computational efficiency and enrichment of ethnographic analysis. First, doing this would allow the algorithm to distinguish more accurately and with a higher confidence percentage the images depicting very specific scenes (such as an object held by a hand, a bicycle without a person, an animal with artificially added text, etc.). Second, it would provide subcategories of potential interest within the ‘neutral’ main category that could then be explored in more detail qualitatively. Their creation, however, demanded a new refinement process to determine what the most prevalent subcategories would be and how many of them would be required.

We started creating subcategories based on a distinguishable visible feature of the images added to each one. We named each subcategory based on that distinguishable feature. For example, a selfie that would also contain artificially added text would be added to ‘ArtificialSelfies’, even though the parent category would be one of the main categories (in that case, ‘Protest Selfies’). Having created as many subcategories as we could and having agreed upon their creation throughout two refinement processes, each after the applications of our programme to new images from our dataset, we ended up with 91 subcategories, each belonging to a main category. We stopped at 91 because the programme already efficiently and consistently classified the data we used as input, to an extent analogous to a social researcher performing the same task. As part of this process, we also created a new main category – ‘Ambiguous’ – in which we placed images that contained visual features from two or more of the main categories.

The last step, which was added to the pipeline of computational procedures executed by our program, was the meta-analysis of the results by the most advanced open-source object detection neural network architecture (Redmon et al., Citation2016). This was subjected to constant improvements (Wang et al., Citation2023) in an effort to isolate suspected misclassifications without deleting or moving items arbitrarily to a different category. Based on the presence or absence of a specific ‘object’, which in computational logic could also be a person, we programmed our computational system to group images after the classification from the main system and move them to a directory containing images suspected as not belonging to the category that our main algorithm classified them. In this way, a qualitative inspection of the results could take place afterwards by the ethnographers, potentially correcting the results of the first computational procedure and, thus, increasing the overall efficiency and accuracy of our computational analysis.

Overall, this backpropagation from the computational analysis back to the ethnographic is, we argue, a potentially fruitful avenue for future mixed methods research. By immersing oneself in the images that do not fit, the categories that seem ambiguous and new subcategories that emerge, we can discover new and emerging topics that are ‘bubbling under’ in social media. These observations can then guide future ethnographic work by helping identify potentially interesting topics and fields across countries and contexts.

Stage 4: Production and assessment of results

Our final dataset had now been divided into 13 batches, that is, smaller datasets of 10,227 images each, to perform the refinement process gradually while measuring, in parallel, the programme’s improvement.

Concretely, after running the programme of the first batch of images and refining the training set based on the output, we formed a training set containing nine main categories and 101 subcategories and, not yet including the ‘Ambiguous’ class. With this training set, we reached an overall percentage of accuracy of 81%, here using an F1-score as a statistical evaluation measure (Takahashi et al., Citation2022). Although the percentage was high enough to consider the classification successful, the results in some main categories, such as ‘Groupies’, and specific subcategories (e.g. ‘outdoors’) looked random.

After running the programme with a second batch of 10,227 images and refining the training set again, forming it to include nine main categories and now 99 subcategories, we received the same overall percentage of accuracy (81%). The refinement process showed again that, in a few specific subcategories, the results seemed to be ambiguous. At this point, the training set consisted of 15,172 images.

Therefore, the final refinement included the deletion of subcategories that were seemingly confusing our system, and subsequently, the creation of the ‘Ambiguous’ main category. The percentage of classification accuracy increased to 83%. Based on a qualitative evaluation of the results, the classification was now robust in every category, with obviously higher and almost impeccable classification accuracy for the ‘High Probability’ images. We decided to end the refining procedure at this point because the results were consistent. The final training set consisted of 18,348 images distributed in 101 folders, representing 10 main categories and 91 subcategories.

To illustrate how the evaluation of classification evolved, we will look more closely, again, at the ‘Protest Material’ category, whose classification accuracy seemed to statistically improve through the tentative classifications (see ) while researchers could not observe this improvement. The accuracy percentage of the ‘Protest Material’ main category was 71.75% after the F1-score-based evaluation of the final training set, a percentage that was significantly lower than the 83% of the overall classification. Consecutively, we extracted a random sample of 50 images taken by the classification output after running the programme to the totality of the rest of the batch (102,260 images), as depicted in Through their qualitative evaluation, it became evident that, even in this category, the program was able to imitate the ethnographer’s categorisation rationale and robustly detect the ‘Protest Material’ related images.

Figure 2. Random sample of images classified by the final version of the programme as ‘protest Material’.

Figure 2. Random sample of images classified by the final version of the programme as ‘protest Material’.

Overall, the results were of a similar quality in every category, with the slight exception of the main category ‘Performance’, where the results were also consistent, but more prone to erroneous classification given the vast potential variety of combinations of costumes and objects that can be used for a performance conveying political messages. This category withstanding, however, the trained algorithm’s capacity to detect the ethnographically defined categories of visual political action at any given set of images ‘out there’ arrived at an extremely successful level.

Conclusions

In this article, we have demonstrated how, through the development of an AI-based image classification system, ethnographic and computational approaches can be bridged, complementing each other and producing new insights and perspectives for both. Reflecting upon the methodological discourse presented in the beginning of this article, we have instantiated our theoretical assertions with empirical research. The development of an AI-based image classification system grounded in ethnographic insight reflects our commitment to a methodological synergy that not only serves the study of visual political action but also stands as evidence of the arguments made. We have done this by detailing the four stages our research group went through in search for a functioning way of training an algorithm based on ethnographic fieldwork. This process illustrates our assertion that ethnography and big data, when combined in a connective fashion, provide a nuanced perspective that enriches our understanding of political imagery and activists’ communication strategies.

We witnessed how machine learning algorithms often fail to predict outcomes accurately, even when statistical measures indicate the opposite. However, on the one hand, this seeming failure can provide rich cases for qualitative analysis, and on the other hand, it emphasises the necessity of collaboration between humanities scholars and computer scientists for developing a critical approach to machine learning. In addition, during the process we went through, it became evident how qualitative analysis of algorithmic failure can reveal hidden biases, assumptions and power structures in data and algorithms (Rettberg, Citation2022). From a computational perspective, we observed that the existence of common visual features in images, and consistency in the rationale for assigning images to specific categories, constitute key aspects for an ethnographically informed image classification. We also concluded that category re-evaluation – the refinement process in our case – proved to be crucial for forming categories that are both meaningful ethnographically and have distinctive visual characteristics for the computer to detect. Finally, we found that subcategories can be beneficial for a more accurate categorisation by the computer, but more importantly, they can allow for ambiguities and previously undetected differentiations in the data to emerge and be observed by the ethnographers. The critical re-evaluation of categories that emerged from our iterative process between ethnographic inquiry and computational analysis exemplifies the substantiation of our initial argument: that the intersection of these methods can yield insights that are greater than the sum of their parts.

Our results indicate that, through a transition from ethnographic to computational back and forth, a computational system that can categorise visuals in ethnographically informed categories with high confidence and accuracy can be created. In parallel, we suggest that, using our method, ethnographers can come up with new theoretical perspectives on a given data after preparing it for the training of the computational system. The methods-bridging results in, rather than a methodological ‘bank of the river’ in which both parties stay intact, a back-and-forth ‘flow of the river’ of both research approaches learning new insights by being exposed to one another. This conclusion is compatible with previous research demonstrating that an AI system can be not just a tool for humans to use but rather a complex socio-technical system with its own agency, subjectivity and ‘social life’ (Munk et al., Citation2022). Our findings and the resulting discussion provide concrete examples of how to effectively bridge the perceived gap between ethnographic richness and computational efficiency. This underscores the value of bridging these methodologies, while also acknowledging the intricate balance between ethnographic depth and computational rigor. Bridging AI to ethnography and ethnography to AI provides a promising avenue for future research mapping online visual politics by the methodological combination bringing the best from both worlds: a thick picture drawing results from large datasets while making sense to not only the human eye but also ambitious research questions about the uses and meanings of images in political activity.

Acknowledgments

The authors want to acknowledge the invaluable participation of Karine Clément, Juulia Heikkinen, Jenni Kettunen, Carla Malafaia and Jyrki Rasku to the development of the method described in this article.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The funding source for this work is the following: Grant Agreement number: 804024 - ImagiDem - ERC-2018-STG.

Notes on contributors

Vasileios Maltezos

Vasileios Maltezos is a PhD candidate in Computational Social Science at the University of Helsinki.

Eeva Luhtakallio

Eeva Luhtakallio is Professor of Sociology at the University of Helsinki.

Taina Meriluoto

Taina Meriluoto is Senior Researcher in Sociology at the University of Helsinki.

Notes

1. The authors want to acknowledge the invaluable participation of Karine Clément, Juulia Heikkinen, Jenni Kettunen, Carla Malafaia and Jyrki Rasku to the development of the method described in this article.

2. Crowdsourcing was conducted among social scientific colleagues who were a) studying topics related to activism and b) active on social media. Concretely, we opened a google doc document and shared it with a group of approximately 10 scholars, who all read through our initial list of hashtags and added their suggestions of hashtags they had encountered in their respective fieldwork or their own political activities.

3. These negotiations also provided us with valuable data for comparative analyses, to be addressed in future publications.

4. In brief, protest selfies = one person depicted in protest and/or with a visible protest material; groupies = like the former but comprising a small group of people; performances = theatrical protest events marked by different forms and degrees of staging, costumes, coordinated movements etc.; protest material = in-site photographed materials crafted for the protest such as signs, flyers, temporary constructions, wall paintings; formal gathering = depictions of meetings both indoors and on the streets, speaker-audience depictions, joint working efforts; threat = presence of riot police forces, confrontations with the police, barricades, guns, stone throwing, tear gas, arrestations, injuries, fires; and artificial protest images = memes and otherwise computer modified photos with captions, titles, quotes, emojis, animations etc. added.

5. They were tentative also because of the potential occurrence of the overfitting, in case we had high percentages of accuracy in this early stage. Overfitting is the phenomenon occurring during algorithmic training when the algorithm fits the noise in the training data memorizing various peculiarities, rather than finding a general predictive rule.

6. We added 457 images to the training set and tested the algorithm by running the programme once again to get the results for the classification accuracy, now including the neutral category. We performed this action to obtain an additional indicator of whether the classification analysis was improving or not.

7. For analyses with images from the categories with identifiable people in them, see (Luhtakallio & Meriluoto, Citation2022; Luhtakallio et al., Citation2024; Meriluoto, Citation2023).

8. Informed by prior literature, we chose to apply support vector machines because this method has been proven to be a more efficient classification algorithm than the other available algorithms (K-nearest neighbours and random forest classifier) for computer vision-related classifications with large datasets in high dimensional spaces.

References

  • Amram, D. (2020). Building up the “Accountable Ulysses” model. The impact of GDPR and national implementations, ethics, and health-data research: Comparative remarks. Computer Law & Security Review, 37, 105413. https://doi.org/10.1016/j.clsr.2020.105413
  • Blokker, P., & Brighenti, A. (2011). Politics between justification and defiance. European Journal of Social Theory, 14(3), 283–300. https://doi.org/10.1177/1368431011412346
  • Boltanski, L., & Thévenot, L. (1999). The sociology of critical capacity. European Journal of Social Theory, 2(3), 359–377. https://doi.org/10.1177/136843199002003010
  • Boltanski, L., & Thévenot, L. (2006). On justification: Economies of worth (C. Porter, Trans.). Princeton University Press.
  • Caldeira, S. P., Van Bauwel, S., & De Ridder, S. (2021). ‘Everybody needs to post a selfie every once in a while’: Exploring the politics of Instagram curation in young women’s self-representational practices. Information, Communication & Society, 24(8), 1073–1090. https://doi.org/10.1080/1369118X.2020.1776371
  • Charles, V. & Gherman, T.(2019). Big data analytics and ethnography: Together for the greater good. Big data for the greater good. 19–33.
  • Chen, H., & Guestrin, C. (2016). Xgboost: A scalable machine learning system for tree boosting. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, United States (pp. 785–794).
  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018
  • Dietterich, T. (1995). Overfitting and undercomputing in machine learning. ACM Computing Surveys, 27(3), 326–327. https://doi.org/10.1145/212094.212114
  • Dunleavy, D. (2020). Visual semiotics theory: Introduction to the science of signs. In S. Josephson, J. Kelly, & K. Smith (Eds.), Handbook of visual communication: Theory, methods, and media (2nd ed., pp. 155–170). Routledge.
  • Evans, B. (2016). Paco-applying computational methods to scale qualitative methods. Ethnographic Praxis in Industry Conference Proceedings, 2016(1), 348–368. https://doi.org/10.1111/1559-8918.2016.01095
  • Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874. https://doi.org/10.1016/j.patrec.2005.10.010
  • Goffman, E. (1974). Frame analysis. Penguin Books.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
  • Hand, M. (2020). Photography meets social media: Image making and sharing in a continually networked present. In G. Pasternak (Ed.), The handbook of photography studies (pp. 310–326). Routledge.
  • Hankey, S., & Tuszynski, M. (2017). Exposing the invisible: Visual investigation and conflict. In J. Eder & C. Klonk (Eds.), Image operations: Visual media and political conflict. Manchester University Press. (pp. 169–183).
  • Hardesty, M., Gironda, C., & Belleau, E. (2019). This is what a #FEMINIST, #ANTIFEMINIST looks like? Political selfies and the paradox of giving voice to virtual bodies. Feminist Formations, 31(2), 229–261. https://doi.org/10.1353/ff.2019.0023
  • Harper, D. (2005). An argument for visual sociology. In J., Prosser (ed.). Image-based research (pp. 24–41). Routledge.
  • Hashemi, M., & Hall, M. (2019). Detecting and classifying online dark visual propaganda. Image and Vision Computing, 89, 95–105. https://doi.org/10.1016/j.imavis.2019.06.001
  • Hine, C. (2000). Virtual ethnography. Sage.
  • Junnilainen, L., & Luhtakallio, E. (2015). Media ethnography. In G. Mazzoleni, & M. Rousiley (Eds.), The international encyclopedia of political communication (pp. 1–4). John Wiley & Sons.
  • Kozinets, R. V.(2002). The field behind the screen: Using netnography for marketing research in online communities. Journal of Marketing Research, 39(1), 61–72.
  • Kozinets, R. V. (2010). Netnography: Doing ethnographic research online. Sage Publications.
  • Kuntsman, A. (Ed.). (2017). Selfie citizenship. Springer.
  • Laaksonen, S. M., Nelimarkka, M., Tuokko, M., Marttila, M., Kekkonen, A., & Villi, M. (2017, 2). Working the fields of big data: Using big-data-augmented online ethnography to study candidate–candidate interaction at election time. Journal of Information Technology & Politics, 14(2), 110–131. https://doi.org/10.1080/19331681.2016.1266981
  • LeCun, Y., Bengio, Y., & Hinton, G. E. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
  • Luhtakallio, E. (2013). Bodies keying politics: A visual frame analysis of gendered local activism in France and Finland. In N. Doerr, A. Mattoni, & S. Teune (Eds.), Advances in the visual analysis of social movements (pp. 27–54). Emerald Group Publishing Limited.
  • Luhtakallio, E. (2018). Imagi(ni)ng democracy: European youth becoming citizens through visual participation. European Research Council Research Plan Granted ERC-Stg. https://doi.org/10.3030/804024
  • Luhtakallio, E., & Meriluoto, T. (2022). Snap-along ethnography: Studying visual politicisation in the social media age. Ethnography, 14661381221115800. https://doi.org/10.1177/14661381221115800
  • Luhtakallio, E., Meriluoto, T., & Malafaia, C. (2024). Visual politicization and youth challenges to an unequal public sphere: Conceptual and methodological perspectives. In Conner, J. (Ed.), The handbook on youth activism (pp. 140–153). Edward Elgar press. https://doi.org/10.4337/9781803923222.00021
  • Luhtakallio, E., & Ylä-Anttila, T. (2023). Justifications analysis. In R. Diaz-Boine & G. Larquier (Eds.), Handbook of economics and sociology of conventions. (pp. 1–20). Springer.
  • Malafaia, C., & Meriluoto, T. (2022). Making a deal with the devil? Portuguese and Finnish activists’ everyday negotiations on the value of social media. Social Movement Studies, 23(2), 1–17. https://doi.org/10.1080/14742837.2022.2070737
  • Meriluoto, T. (2023). The self in selfies – conceptualising the selfie-coordination of marginalised youth with sociology of engagements. British Journal of Sociology, 74(4), 638–656. https://doi.org/10.1111/1468-4446.13015
  • Munk, A. K., Olesen, A. G., & Jacomy, M. (2022). The thick machine: Anthropological AI between explanation and explication. Big Data and Society, 9(1), 205395172110698. https://doi.org/10.1177/20539517211069891
  • Noy, C. (2020). Voices on display: Handwriting, paper, and authenticity, from museums to social network sites. Convergence: The International Journal of Research into New Media Technologies, 26(5–6), 1315–1332. https://doi.org/10.1177/1354856519880141
  • Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: An application to face detection. In Proceedings of the 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 130–136). IEEE. https://doi.org/10.1109/CVPR.1997.609310
  • Pauwels, L. (2010). Visual sociology reframed: An analytical synthesis and discussion of visual methods in social and cultural research. Sociological Methods & Research, 38(4), 545–581. https://doi.org/10.1177/0049124110366233
  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 779–788). IEEE. https://doi.org/10.1109/CVPR.2016.91
  • Rettberg, J. W. (2022). Algorithmic failure as a humanities methodology: Machine learning’s mispredictions identify rich cases for qualitative analysis. Big Data & Society, 9(2), 205395172211312. https://doi.org/10.1177/20539517221131290
  • Scollon, R., & Scollon, S. W. (2004). Nexus analysis: Discourse and the emerging internet (1st ed.). Routledge. https://doi.org/10.4324/9780203694343
  • Scollon, R., & Scollon, S. W. (2007). Nexus analysis: Refocusing ethnography on action. Journal of Sociolinguistics, 11(5), 608–625. https://doi.org/10.1111/j.1467-9841.2007.00342.x
  • Seo, Y., Moon, J., Choi, G. W., & Do, J. (2022). A scoping review of three computational approaches to ethnographic research in digital learning environments. Tech Trends, 66(1), 102–111. https://doi.org/10.1007/s11528-021-00689-3
  • Shifman, L. (2013). Memes in digital culture. The MIT Press. https://doi.org/10.7551/mitpress/9429.001.0001
  • Takahashi, K., Yamamoto, K., Kuchiba, A., & Koyama, T. (2022). Confidence interval for micro-averaged F1 and macro-averaged F1 scores. Applied Intelligence (Dordrecht), 52(5), 4961–4972. https://doi.org/10.1007/s10489-021-02635-5
  • Thanh Noi, P., & Kappas, M. (2018). Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using sentinel-2 imagery. Sensors, 18(1), 18. https://doi.org/10.3390/s18010018
  • Thévenot, L. (2007). The plurality of cognitive formats and engagements: Moving between the familiar and the public. European Journal of Social Theory, 10(3), 409–423. https://doi.org/10.1177/1368431007080703
  • Tiidenberg, K., & Gómez Cruz, E. (2015). Selfies, image and the re-making of the body. Body & Society, 21(4), 77–102. https://doi.org/10.1177/1357034X15592465
  • Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y. M. (2023). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7464–7475). IEEE. https://doi.org/10.1109/CVPR52729.2023.00721
  • Webb Williams, N., Casas, A., & Wilkerson, J. D. (2020). Images as data for social science research. Images As Data for Social Science Research. https://doi.org/10.1017/9781108860741
  • Won, D., Steinert-Threlkeld, Z. C., & Joo, J. (2017). Protest activity detection and perceived violence estimation from social media images. In Proceedings of the 25th ACM International Conference on Multimedia (MM ’17) (pp. 786–794). Association for Computing Machinery. https://doi.org/10.1145/3123266.3123282
  • Ylä-Anttila, T., Eranti, V., & Kukkonen, A. (2022). Topic modeling for frame analysis: A study of media debates on climate change in India and USA. Global Media and Communication, 18(1), 91–112. https://doi.org/10.1177/17427665211023984
  • Ylä-Anttila, T., & Luhtakallio, E. (2016). Justifications analysis: Understanding moral evaluations in public debates. Sociological Research Online, 21(4), 1–15. https://doi.org/10.5153/sro.4099

Technical Appendix

When it came to our early efforts to annotate the image data, we used the Django framework in Python and MySQL database software for annotation attribution to each image (Image 1).The annotation procedure continued by adding images to the folders, with each folder representing a label. In this section, we also describe the finalised form and parameters of our computational system. For the parameters, we rescaled each input image to 224x224 pixel dimension before passing them to the feature extractor. For the feature extraction procedure, we applied ResNet50V2 (He et al., 2016), passing the weight parameter as ‘imagenet’, with an average pooling (Lin et al., 2014, p. 8). For the classification procedure, we applied a support vector classification (Asimit et al., 2021) algorithm with an ‘ovo’ decision function shape and a linear kernel. A boundary of 0.7 (70%) level of confidence was set for splitting high and low probability results; that is, more than 70% algorithmic confidence that the image belongs to x category would categorise the image in the corresponding main category and its subcategory to the ‘High Probability’.

Figure 1. Random sample of images classified by the final version of the programme as ‘protest Material’.

Figure 1. Random sample of images classified by the final version of the programme as ‘protest Material’.

The final form of the secondary programme was applied after the main automatic categorisation. It classified as “Suspected False Positives” the images that were predicted as “Protest selfies” if they depicted more than six people and the ones that were predicted as “Crowds” if they depicted one person, no person, or between three to eight people as suspected groupies. Furthermore, if the secondary program detected one person or none in images predicted as “Groupies”, it classified those images as “Suspected False Positives”. If it detected chairs, except from people it classifies the corresponding images as “Suspected Meeting Deliberation”. Finally, the same programme categorised the initially predicted images as ‘Meeting Deliberation’, depicting less than three or fewer people, in the same way, as “Suspected False Positives”. All additional folders created after the secondary programme’s meta-analysis were not included in the training set. Also, they were created as a subfolder in the folder of the corresponding main category. The model we utilised for the classification was YOLOv5s (Jocher., 2020), an object detection architecture pre-trained on the COCO1 dataset to detect the ‘objects’ in each image. Based on the absence or presence of the specified ‘objects’, we then separated the images in the additional folder(s) with suspected misclassifications.

References

Asimit, A. V., Kyriakou, I., Santoni, S., Scognamiglio, S., & Zhu, R. (2022). Robust Classification via Support Vector Machines. Risks, 10(8), 154–179. https://doi.org/10.3390/RISKS10080154

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December, 770–778. https://doi.org/10.1109/CVPR.2016.90

Jocher, G. (2020). YOLOv5 by Ultralytics (Version 7.0) [Computer software]. https://doi.org/10.5281/zenodo.3908559

Lin, M., Chen, Q., & Yan, S. (2014, April 14). Network in network. 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings.