wikihistories 2024: Wikipedia and/as Data

Image generated by Microsoft Co-Pilot using DALL-E 3 to combine the concepts of wikipedia, histories, and data

What is Wikipedia’s relationship to data? What should Wikipedia’s relationship to data be?

The 2024 wikihistories symposium took place on the 19th of June and was co-located with ICA Gold Coast. It was brought to you by the wikihistories project at the University of Technology Sydney in partnership with the Centre for Media Transition, the ARC Centre of Excellence in Automated Decision-Making and Society (ADMS+) and Wikimedia Australia.

The symposium gathered together social scientists, humanists, critical technologists, and others to investigate Wikipedia’s connection to data and the importance of this relationship for the global information ecosystem and the production of knowledge.

Programme

09.30-10.00	Coffee and registration
10:00-10.15	Welcome and introductions	Michael Falk
10.15-10.30	Setting a research agenda for Wikipedia data studies	Heather Ford
10.30-12:00Theme 1: Data Translation Work in this theme involves the study of how Wikipedia data is used by researchers and engineers and the effects of this translation	I: Agreeable Data: How Wikipedian consensus is conceptualized by computer science researchers II: How researchers understand Wikipedia bias	I: Steve Jankowski II: Heather Ford and Francesca Sidoti Respondent: Benjamin Mako Hill
12.00-12.30	Lunch
12.30-14:00Theme 2: Data Patterns Research in this theme analyses Wikipedia data to reveal underlying patterns	I: Reflective practice through Wikipedia data II: Fabricating Facts: A Semantic Network Analysis of the Wikidata Ontology	I: Sohyeon Hwang II: Andrew Iliadis Respondent: Francesco Bailo
14:00-15:30Theme 3: Data Markets Research in this theme analyses data flows between Wikimedia and other knowledge sources	I: Tracking organisations as sources in Wikipedia II: Verifiability, epistemic value and the open knowledge market	I: Amanda Lawrence II: Michael Davis and Heather Ford Respondent: Michael Falk
15:30-16:00	Close and next steps	Heather Ford and Michael Falk

In each research theme, there was 15 minute presentations from each of the two speakers followed by a 10 minute response from the respondent and then the remainder (approx. 50 mins) was a discussion to map out research questions, issues and methods within the theme. In the final session, we mapped out tasks to complete the write up of a journal article that sets out a research agenda for studying Wikipedia’s role as a data source and as a producer of data.

Call for papers

Wikipedia has always been a critical source of data for computer science projects, offering data scientists a massive store of open data. Researchers and developers use Wikipedia to work on natural language processing (NLP) tasks and applications, model user interactions with content and other users, deliver factual statements to users in automated question-answering tasks, and find nearby features as represented by Wikipedia articles (Iliadis, 2022; Iliadis & Ford, 2023).

These practitioners use Wikipedia as a store of facts assuming that it expresses an established consensus as a result of its policies and processes. Yet, Wikipedia’s natural language could contain meanings that resist translation into data and whose classifications might be open to interpretation and critique (Ford & Iliadis, 2023). For example, articles about complex topics such as Jerusalem do not easily align with standard ways of representing entities like cities. Jerusalem’s infobox reflects Wikipedia’s power to make important decisions about how we understand facts and the meanings that are associated with them (Ford & Graham, 2016). This power is intensified when entire Wikipedia articles are translated into structured datafied knowledge bases of machine-readable statements – by the Wikidata project, for example, which started in 2012 as a project of the Wikimedia Foundation (Ford, 2020).

How researchers measure Wikipedia’s sociocultural biases also depends on the datafication of Wikipedia’s content and how such processes may be questioned rather than taken for granted. Measuring the extent to which Wikipedia represents Australians, for example, could simply be achieved by counting articles that are categorised in the “Australians” data category, and yet this category itself is not an objective representation of Australianness but rather the result of particular practices that resist stable referents (Falk et al., 2023). As Wikipedia’s content is increasingly used to power virtual assistants such as Amazon Alexa and more recently large language model applications like ChatGPT and Google’s Bard, Wikipedia participates in the global information ecosystem in ways that go well beyond its role as a web-based encyclopaedia (McDowell & Vetter, 2023). Thus, it is important to understand Wikipedia’s relationship to data, not as a given, but as something to be critically investigated.

This symposium will gather together social scientists, humanists, critical technologists, and others to investigate Wikipedia’s connection to data and the importance of this relationship for the global information ecosystem and the production of knowledge. The workshop will be organised as a day-long, face-to-face event prior to the annual International Communication Association conference on the Gold Coast in Australia.

Participants will be invited to share short presentations and to participate in discussions focused on the questions “What is Wikipedia’s relationship to data?” and/or “What should Wikipedia’s relationship to data be?” Participants will also agree to read a few background papers prior to the gathering. The workshop will result in a collaborative document that maps out possible areas for researching these questions from a sociotechnical lens and the option to continue the collaboration post-symposium.

Abstracts

Agreeable Data: How Wikipedian consensus is conceptualized by computer science researchers

Dr Steve Jankowski (Lecturer in New Media and Digital Culture, University of Amsterdam)

Consensus is a political process that has been interpreted as data by computer scientists. In many ways, this makes sense from a disciplinary position. Because consensus is used as “the main vehicle for editorial decision-making” (Ford, 2022, 4), and these decisions are recorded as various kinds of freely accessible data on Wikipedia, computer scientists have used the site as “a convenient dataset” to study social behaviours (Hill and Shaw, 2019). However, the meaning of consensus is far from straight-forward. Wikipedians use it to mean anything from democratic deliberation to technocratic data exchange, and it depends on the layered contexts where it is made sensible (Jankowski, 2022). In this presentation, I ask to what degree are computer scientists engage with this complexity and how does interpreting consensus through the data they collect change what it means? These questions are significant because Wikipedia has long been used as a source of data for training algorithms, and “the political application of AI and machine learning is so commonly geared to settle or predict difficult societal problems in advance,” that computer science articles themselves “become political texts in the sense that they decide what is at stake in the parameters of a problem” (Amoore et al., 2023, 1; 9). Through a systematic literature review of computer science articles (2014-2024), this presentation provides insight into the character, frequency, and scope of how computer scientists define Wikipedian consensus through data. The paper concludes with suggestions about the need for humanities scholars to further engage with computer science research to understand emergent political ideas, while also arguing that computer scientists need other disciplines to colour their interpretations of phenomena and methods of analysis.

References

Amoore, L., Campolo, A., Jacobsen, B., and Rella, L. (2023). Machine learning, meaning making: On reading computer science texts. Big Data & Society, 1–13.

Ford, H. (2022). Writing the Revolution: Wikipedia and the Survival of Facts in the Digital Age. MIT Press.

Hill, B. M. and Shaw, A. (2019). The Most Important Laboratory for Social Scientific and Computing Research in History. Wikipedia @ 20.

Jankowski, S. (2022). Making consensus sensible: The transition of a democratic ideal into Wikipedia’s interface. Journal of Peer Production, 15.

Reflective practice through Wikipedia data

Sohyeon Hwang (PhD Candidate, Northwestern University)

The use of Wikipedia as a key source of data has raised questions about how researchers, technologists, and users might critically understand Wikipedia as a data source shaping emerging tools and analyses. In this presentation, I discuss how Wikipedia’s data offers us opportunities for reflective practices by shedding light on the dynamics of negotiation, change, and difference that shape the encyclopedia. I draw on both work I have done and the rich body of research work by the Wikimedia research community to argue that the same digital trace data of Wikipedia used to develop downstream technologies also makes it possible to critically audit and evaluate Wikipedia so useful as a source of data in these technologies. I note three key ways that Wikipedia data can provoke critical investigation into Wikipedia as a data source. First, leveraging data to compare patterns across diverse language editions of Wikipedia enables us to disrupt narratives that Wikipedia necessarily expresses widely establishes consensus about content. Second, the fine-grained nature of the data from Wikipedia gives us insight into the organizational dynamics – such as governance processes or mechanisms – that shape the production of content, allowing us to critically consider how content gaps and imbalances occur. Finally, data from Wikipedia allows us to evaluate the effectiveness and limits of community interventions to counter concerns about Wikipedia content such as bias, toward improving strategies. Together, Wikipedia data can become an artifact for reflective practice that helps guide the community’s efforts and contributions toward positive change, as well as flag and anticipate potential downstream concerns.

Fabricating Facts: A Semantic Network Analysis of the Wikidata Ontology

Dr Andrew Iliadis (Assistant Professor, Temple University)

Wikidata is a free, open-source knowledge base with millions of “data items that anyone can edit” (wikidata.org). These data items are structured data that convey facts about the world. People use many media technologies that benefit from querying and retrieving factual data in Wikidata, including search engines and virtual assistants. The Wikidata project thus plays a fundamental role in fabricating facts and transmitting them worldwide. Yet, as is well-known in the case of Wikipedia, facts are often contested and socially constructed (Ford, 2022). How are these factual data in Wikidata defined, organized, and related? What metadata vocabulary terms are included, and what is their underlying structure? Such information can be found in Wikidata’s ontology, which lists the top levels after the superclass root term “entity.” This project will investigate Wikidata’s ontological structure by conducting a semantic network analysis of Wikidata’s upper-level ontology and explain how researchers interested in social power can study fact-transmitting infrastructures globally. I will use the methods of ontology network analysis (Kalfoglou et al., 2002; Alani et al., 2003; Weng et al., 2008; Figueres-Esteban et al., 2016; Hui-Jia et al., 2017) and semantic network analysis of ontologies (Hoser et al., 2006). While Wikidata’s ontology is not immediately visible in its entirety, there are several ways to look up some of its contents. Wikidata contains pages documenting pieces of the ontology, and some of them provide information about querying the entities and relationships. I will use the Wikidata Query Service and run SPARQL queries looking for Wikidata’s top-level ontology.

Tracking organisations as sources in Wikipedia

Dr Amanda Lawrence (Research Fellow, RMIT)

This presentation reports on a recent research project exploring the sources used for public interest and policy related articles on Wikipedia, particularly the role of organisation research publishing such as reports and policy papers (grey literature). Although generally not included in WP guidelines on ‘reliable sources’, we know that reports and papers are widely used for public policy research and practice. To what extent are organisations used as sources on public policy topics on Wikipedia and what can we learn about who they are through linked data via Wikidata and other sources? To investigate this question, we developed a knowledge graph of around 1000 public policy related articles and their citations on English Wikipedia based on 10 key WP articles including climate policy, health policy, education policy, international relations policy, economic policy etc. and followed their internal page links. This data set was then filtered for concepts and the citations extracted using the Wikipedia API and augmented and analysed using linked data from Wikidata and other databases such as CrossRef, ISNI, OpenAlex, etc to determine publisher or organisation name, type, sector, location, publication format and other information. This research provides insights into how Wikipedia and Wikidata can provide insights into wider questions of knowledge production as well as the way in which those forces play out within Wikimedia projects. This project was conducted in collaboration with Angel Felipe Magnossao de Paula and supported by a Wikimedia Foundation research grant.

Verifiability, epistemic value and the open knowledge market

Dr Michael Davis and Dr Heather Ford (Associate Professor, University of Technology Sydney)

Critical repositories of open knowledge, especially Wikipedia, are increasingly used as data sources for tools like Google knowledge graphs, virtual assistants and generative AI models (Ford & Iliadis, 2023). The demands of expediency in this economy are potentially threatening key elements of the infrastructure of open knowledge, in what McDowell and Vetter (2023) have recently called the ‘realienation of the commons’. In this paper we focus on verifiability, a core content policy which underpins both the epistemic value of Wikipedia itself and of Wikipedia data as it is disseminated beyond Wikipedia. With the rise of the Wikimedia strategy of ‘knowledge as a service’ (Zia et al., 2019), verifiability is under threat. Worryingly, this threat is also apparent within the Wikimedia ecosystem itself, even as the Wikimedia Foundation has focused on verifiability in the context of its knowledge integrity program (Zia et al., 2019).

We understand verifiability through the lens of pragmatist epistemology, in particular C.S. Peirce’s conception of knowledge production as a social process governed by evolving epistemic norms that have their basis in an ethics of inquiry. By means of this framework we articulate the epistemic value of verifiability for the public knowledge ecosystem and use this to analyse the implications of the loss of verifiability for open knowledge. We also apply concepts from political economy and philosophy to explore the power dynamics that underlie the loss of verifiability and the potential impact of its loss on the public sphere.

People

2024 wikihistories symposium co-located with ICA Gold Coast and brought to you by the wikihistories project at the University of Technology Sydney in partnership with the Centre for Media Transition, the ARC Centre of Excellence in Automated Decision-Making and Society (ADMS+)
and Wikimedia Australia

Lead curator and contact:

Participants:

Steve Jankowski

Sohyeon Hwang

Francesca Sidoti

References

Falk, M., Ford, H., Tall, K., & Pietsch, T. (2023). How Australians are represented in Wikipedia. Reports of the Wikihistories Project. University of Technology, Sydney. https://doi.org/10.5281/zenodo.10296217

Ford, H. (2020). Rise of the underdog. In J. Reagle & J. Koerner (Eds.), Wikipedia @ 20: Stories of an Incomplete revolution (pp. 189–201). MIT Press.

Ford, H., & Graham, M. (2016). Provenance, power and place: Linked data and opaque digital geographies. Environment and Planning D: Society and Space, 34(6), 957-970. https://doi.org/10.1177/0263775816668857

Ford, H., & Iliadis, A. (2023). Wikidata as semantic infrastructure: Knowledge representation, data labor, and truth in a more-than-technical project. Social Media + Society, 9(3). https://doi.org/10.1177/20563051231195552

Iliadis, A. (2022). Semantic media: Mapping meaning on the internet. Polity.

Iliadis, A., & Ford, H. (2023). Fast facts: Platforms from personalization to centralization. Social Media + Society, 9(3). https://doi.org/10.1177/20563051231195546

McDowell, Z. J., & Vetter, M. A. (2014). The re-alienation of the commons: Wikidata and the ethics of “free” data. International Journal of Communication, 18. https://ijoc.org/index.php/ijoc/article/view/20807

Zia, L., Johnson, I., Mansurov, B., Morgan, J., Redi, M., Saez-Trumper, D., &
Taraborelli, D. (2019). Knowledge Gaps—Wikimedia Research 2030.
http://doi.org/10.6084/m9.figshare.7698245.v1