Wikipedia:Wikipedia Signpost/2023-11-20/Recent research

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by HaeB (talk | contribs) at 20:00, 19 November 2023 (subtitle, byline, image). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Recent research

Canceling disputes as the real function of ArbCom


A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.


Canceling Disputes: How Social Capital Affects the Arbitration of Disputes on Wikipedia

Reviewed by Bri

This provocative paper[1] by a socio-legal scholar shows through research mostly based on interviews with Wikipedia insiders, that the Arbitration Committee functions to cancel disputes, not to arbitrate to a compromise position, nor to reach a negotiated settlement, nor to actively promote truthful content, which one might naively infer from the name of the committee.

Some of the arguments and language used in the paper are both arresting and concerning. This reviewer found the interpretive language, and the often verbatim quotes of people involved in the arbitration process – often deeply involved, including at least one described as a member of the committee – more compelling than the light data analysis included in the paper. The author interviewed 28 editors, either current or former members of the committee, those who have been involved parties, those who have commented on cases, and "who have knowledge of the dispute resolution process due to their long-standing involvement with Wikipedia" (not further defined).

"Social Capital and the Arbitration Committee’s Remedies" (figure 2 from the paper)

The data analysis consisted of a breakdown of sanction severities against edit count as a proxy for social capital, and found a negative correlation between light severity outcomes (admonishment) and heavy severity (up to and including site bans), see figure 2, above. The author presented two potential interpretations: one, the conventional one, that more mature and upstanding editors with deep social capital were more likely to obey norms; the other, that those editors with the social capital were free to disobey norms without severe consequences because of the wiki's empowerment of bad behavior through various means, in essence, validating the "cabal", or "too essential to be lost" mentality that endows a "wiki aristocracy" capable of creating either true consensus or to promote their "version of the truth" to quote the paper (p. 15). It was a non-data-driven approach that attempted to find which of these theories was correct.

The key idea in the paper is that social capital – largely built up and represented by an editor's edit count regardless of their ability to peacefully coexist with other editors – is the most important factor when it comes to arbitration. The committee's purpose is to quash disputes in order for editing to continue, not to reach a "just" outcome in some broader sense. One way the social capital is expressed and brought to bear is essentially in the opening phases of an arbitration case, called preliminary statements. If one reads between the lines of the paper, the outcome is frequently predetermined by these opening phases and all that the committee can do is go along with the crowd. In fact it is explicitly stated, again based on evidence gathered from insiders, that cases are frequently orchestrated off-wiki precisely in order to stack the deck against the other side.

[A] Wikipedia insider told me how a disputant prepared her "faction" for months before bringing a case before the Arbitration Committee (which she ended up winning). These efforts are usually made covertly, as Wikipedia norms prohibit what is called "canvassing"...for instance ... on a secret mailing list ... A long-standing editor who was described as a member of Wikipedia's "aristocracy" told me: "we are a tight clique of very long-standing editors and none of our words find their way onto the site"...
— p. 12

Sadly for Wikipedians, the author concludes that it is the Machiavellian use of power that holds true on Wikipedia, in other words, the cabal is true. One passage that comes across as especially skeptical of this structure is found on p. 17: "an editor compared the Arbitration Committee to 'riot cops' … [who] can be compared to the 'repressive peacemakers'…guaranteeing the level of social peace that is necessary for the Wikipedia project to unfold, even to the detriment of fairness." Then the author appears to equate the arbitration process as a trial by ordeal, a feudal concept eschewed by the West in favor of due process based legal proceedings, and further states that

My empirical findings are consistent with the argument that, despite its rhetoric of inclusiveness ("anyone can edit"), Wikipedia is a "unwelcoming and exclusive environment" for newcomers, which tends to reinforce the "hegemony" of a consensus that is mostly shaped and controlled by white Western men.
— p. 19

Summing up on the next page:

[W]hat emerges from the evidence I have collected, and is perhaps more conclusive, is that experienced editors with dense networks are well positioned to avoid the consequences of their own breaches and to use their power to prevail in disputes against weaker parties.
— p. 20

In other words, a system that puts the powerful above the law.

15% of datasets for fine-tuning language models use Wikipedia

A new preprint titled "The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI"[2] presents results from "a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace [...] the data lineage of 44 of the most widely used and adopted text data collections, spanning 1800+ finetuning datasets" that have been published on platforms such as Hugging Face or GitHub. The authors make their resulting annotated dataset of annotated datasets available online, searchable via a "Data Provenance Explorer".

The paper presents various quantitative results based on this dataset. wikipedia.org was found to be the most widely used source domain, occurring in 14.9% (p.14) or 14.6% (Table 4, p.13) of the 1800+ datasets. This result illustrates the value Wikipedia provides for AI (although it also means, conversely, that over 85% of those datasets made no use of Wikipedia).

The paper highlights the following example of such a dataset that used Wikipedia:

Surpervised Dataset Example: SQuAD

Rajpurkar et al. (2016) present a prototypical supervised dataset on reading comprehension. To create the dataset, the authors take paragraph-long excerpts from 539 popular Wikipedia articles and hire crowd-source workers to generate over 100,000 questions whose answers are contained in the excerpt. For example:

Wikipedia Excerpt In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity.

Worker-generated question: What causes precipitation to fall? Answer: Gravity

Here the authors use Wikipedia text as a basis for their data and their dataset contains 100,000 new question-answer pairs based on these texts.

The bulk of the paper is of less interest to Wikimedians specifically, focusing instead on general questions about the sourcing information about these datasets ("we are in the midst of a crisis in dataset provenance") and their licenses (observing e.g. "sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data"). An extensive "Legal Discussion" section acknowledges that the paper leaves out "several important related questions on the use of copyrighted works to create supervised datasets and on the copyrightability of training datasets." In particular, it does not examine whether the Wikipedia-based datasets satisfy the requirements of Wikipedia's CC BY-SA license. Regarding the use of CC-licensed datasets in AI in general, the authors note: "One of the challenges is that licenses like the Apache and the Creative Commons outline restrictions related to 'derivative' or 'adapted works' but it remains unclear if a trained model should be classified as a derivative work." They also remind readers that "In the U.S., the fair use exception may allow models to be trained on protected works," although "the application of fair use in the context is still evolving and several of these issues are currently being litigated."

(The datasets examined in the paper are to be distinguished from the much larger unlabeled text corpuses used for the initial unsupervised training of large language models (LLMs). There, Wikipedia is also known to have been used, alongside other sources such as Common Crawl, e.g. for the GPT-3 family that formed the basis of ChatGPT.)


Briefly

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by Ca and Tilman Bayer

"Evaluation of Accuracy and Adequacy of Kimchi Information in Major Foreign Online Encyclopedias"

From the abstract:[3]

In this study, we analyzed the content and quality of kimchi information in major foreign online encyclopedias, such as Baidu Baike, Encyclopædia Britannica, Citizendium, and Wikipedia. Our results revealed that the kimchi information provided by these encyclopedias was often inaccurate or inadequate, despite kimchi being a fundamental part of Korean cuisine. The most common inaccuracies were related to the definition and origins of kimchi and its ingredients and preparation methods.

"Speech Wikimedia: A 77 Language Multilingual Speech Dataset"

Abstract:[4]

"The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models."


References

  1. ^ Grisel, Florian (2023-05-04). "Canceling Disputes: How Social Capital Affects the Arbitration of Disputes on Wikipedia". Law & Social Inquiry: 1–22. doi:10.1017/lsi.2023.15. ISSN 0897-6546.
  2. ^ Longpre, Shayne; Mahari, Robert; Chen, Anthony; Obeng-Marnu, Naana; Sileo, Damien; Brannon, William; Muennighoff, Niklas; Khazam, Nathan; Kabbara, Jad; Perisetla, Kartik; Wu, Xinyi; Shippole, Enrico; Bollacker, Kurt; Wu, Tongshuang; Villa, Luis; Pentland, Sandy; Roy, Deb; Hooker, Sara (2023-11-04), The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI, arXiv, doi:10.48550/arXiv.2310.16787
  3. ^ Park, Sung Hoon; Lee, Chang Hyeon (2023). "Evaluation of Accuracy and Adequacy of Kimchi Information in Major Foreign Online Encyclopedias". Journal of the Korean Society of Food Culture. 38 (4): 203–216. doi:10.7318/KJFC/2023.38.4.203. ISSN 1225-7060. (in Korean, with English abstract)
  4. ^ Gómez, Rafael Mosquera; Eusse, Julián; Ciro, Juan; Galvez, Daniel; Hileman, Ryan; Bollacker, Kurt; Kanter, David (2023-08-29), Speech Wikimedia: A 77 Language Multilingual Speech Dataset, arXiv, doi:10.48550/arXiv.2308.15710

This page is a draft for the next issue of the Signpost. Below is some helpful code that will help you write and format a Signpost draft. If it's blank, you can fill out a template by copy-pasting this in and pressing 'publish changes': {{subst:Wikipedia:Wikipedia Signpost/Templates/Story-preload}}


Images and Galleries
Sidebar images

To put an image in your article, use the following template (link):

[[File:|center|300px|alt=Placeholder alt text]]

CAPTION
{{Wikipedia:Wikipedia Signpost/Templates/Filler image-v2
 |image     = 
 |size      = 300px
 |alt       = Placeholder alt text
 |caption   = CAPTION
 |fullwidth = no
}}

This will create the file on the right. Keep the 300px in most cases. If writing a 'full width' article, change |fullwidth=no to |fullwidth=yes.

Inline images

Placing

{{Wikipedia:Wikipedia Signpost/Templates/Inline image
 |image   =
 |size    = 300px
 |align   = center
 |alt     = Placeholder alt text
 |caption = CAPTION
}}

(link) will instead create an inline image like below

[[File:|300px|center|alt=Placeholder alt text]]
CAPTION
Galleries

To create a gallery, use the following

<gallery mode = packed | heights = 200px>
|Caption for second image
</gallery>

to create

Quotes
Framed quotes

To insert a framed quote like the one on the right, use this template (link):

{{Wikipedia:Wikipedia Signpost/Templates/Filler quote-v2
 |1         = The goose is on the loose!
 |author    = AUTHOR
 |source    = SOURCE
 |fullwidth = no
}}

If writing a 'full width' article, change |fullwidth=no to |fullwidth=yes.

Pull quotes

To insert a pull quote like

use this template (link):

{{Wikipedia:Wikipedia Signpost/Templates/Quote
 |1         = The goose is on the loose!
 |source    = SOURCE
}}
Long quotes

To insert a long inline quote like

The goose is on the loose! The geese are on the lease!
— User:Oscar Wilde
— Quotations Notes from the Underpoop

use this template (link):

{{Wikipedia:Wikipedia Signpost/Templates/block quote
 | text   = The goose is on the loose! The geese are on the lease!
 | by     = Oscar Wilde
 | source = Quotations
 | ts     = Notes from the Underpoop
 | oldid  = 1234567890
}}
Side frames

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

A caption

Side frames help put content in sidebar vignettes. For instance, this one (link):

{{Wikipedia:Wikipedia Signpost/Templates/Filler frame-v2
 |1         = Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
 |caption   = A caption
 |fullwidth = no
}}

gives the frame on the right. This is useful when you want to insert non-standard images, quotes, graphs, and the like.

Example − Graph/Charts
A caption

For example, to insert the {{Graph:Chart}} generated by

{{Graph:Chart
 |width=250|height=100|type=line
 |x=1,2,3,4,5,6,7,8|y=10,12,6,14,2,10,7,9
}}

in a frame, simple put the graph code in |1=

{{Wikipedia:Wikipedia Signpost/Templates/Filler frame-v2
 |1=
{{Graph:Chart
 |width=250|height=100|type=line
 |x=1,2,3,4,5,6,7,8|y=10,12,6,14,2,10,7,9
}}
 |caption=A caption
 |fullwidth=no
}}

to get the framed Graph:Chart on the right.

If writing a 'full width' article, change |fullwidth=no to |fullwidth=yes.

Two-column vs full width styles

If you keep the 'normal' preloaded draft and work from there, you will be using the two-column style. This is perfectly fine in most cases and you don't need to do anything.

However, every time you have a |fullwidth=no and change it to |fullwidth=yes (or vice-versa), the article will take that style from that point onwards (|fullwidth=yes → full width, |fullwidth=no → two-column). By default, omitting |fullwidth= is the same as putting |fullwidth=no and the article will have two columns after that. Again, this is perfectly fine in most cases, and you don't need to do anything.

However, you can also fine-tune which style is used at which point in an article.

To switch from two-column → full width style midway in an article, insert

{{Wikipedia:Wikipedia Signpost/Templates/Signpost-block-end-v2}}
{{Wikipedia:Wikipedia Signpost/Templates/Signpost-block-start-v2|fullwidth=yes}}

where you want the switch to happen.

To switch from full width → two-column style midway in an article, insert

{{Wikipedia:Wikipedia Signpost/Templates/Signpost-block-end-v2}}
{{Wikipedia:Wikipedia Signpost/Templates/Signpost-block-start-v2|fullwidth=no}}

where you want the switch to happen.

Article series

To add a series of 'related articles' your article, use the following code

Related articles
Visual Editor

Five, ten, and fifteen years ago
1 January 2023

VisualEditor, endowment, science, and news in brief
5 August 2015

HTTPS-only rollout completed, proposal to enable VisualEditor for new accounts
17 June 2015

VisualEditor and MediaWiki updates
29 April 2015

Security issue fixed; VisualEditor changes
4 February 2015


More articles

{{Signpost series
 |type=sidebar-v2
 |tag=VisualEditor
 |seriestitle=Visual Editor
 |fullwidth=no
}}

or

{{Signpost series
 |type=sidebar-v2
 |tag=VisualEditor
 |seriestitle=Visual Editor
 |fullwidth=yes
}}

will create the sidebar on the right. If writing a 'full width' article, change |fullwidth=no to |fullwidth=yes. A partial list of valid |tag= parameters can be found at here and will decide the list of articles presented. |seriestitle= is the title that will appear below 'Related articles' in the box.

Alternatively, you can use

{{Signpost series
 |type=inline
 |tag=VisualEditor
 |tag_name=visual editor
 |tag_pretext=the
}}

at the end of an article to create

For more Signpost coverage on the visual editor see our visual editor series.

If you think a topic would make a good series, but you don't see a tag for it, or that all the articles in a series seem 'old', ask for help at the WT:NEWSROOM. Many more tags exist, but they haven't been documented yet.

Links and such

By the way, the template that you're reading right now is {{Editnotices/Group/Wikipedia:Wikipedia Signpost/Next issue}}.