Research:Wikimedia Summer of Research 2011

From Meta, a Wikimedia project coordination wiki

This Summer the Wikimedia Foundation will be bringing in a handful of graduate students to work with the Community Department, led by Diederik van Liere and Maryana Pinchuk, on a few months of rapid iterations on vital research questions related to the recruitment and retention of new editors on Wikipedia. This page is a placeholder for links to our announcements and preliminary research. If you have any questions or are a Wikimedian interested in participating in either quantitative or qualitative research, please comment on the Talk page.

Preliminary work

These datasets and analyses are mostly test cases for the rest of the work for the summer, but do suggest some interesting trends nonetheless. Our basic methodologies are described below.

Assessing quality of the first edits made by new editors, 2004 and 2011

How many contributions by new editors are made in good faith and are worth retaining or improving? Are most edits by newbies vandalism or spam, or are they made primarily in good faith?

We selected a randomized sample of first edits by contributors who joined in April 2004 and in April 2011, derived via simple SQL query run against the toolserver. We then analyzed these edits by hand, ranking the first edit on a 1-5 scale, with one being pure vandalism and five being a well-referenced content addition indistinguishable from the edit of an experienced contributor. We also noted when the first edit was not a mainspace contribution, and whether that was vandalism or not.

Results are described at: "How much do new editors actually improve Wikipedia?"

We'll publish the totals data shortly, but the actual samples will not be distributed to avoid calling out individual editors by name.

The type and tone of user talk page edits directed at new editors within their first 30 days

As a follow up experiment to the previous one, which gave us an idea of how many new editors made valuable contributions according to Wikipedia standards, we wanted to look at how these good faith contributors were being communicated with on their user talk pages early on.

Process

We prepared another random sample of several hundred edits made to user talk pages of new registered users on English Wikipedia from 2004 through 2011. These edits were made by other contributors within 30 days of a new person’s first edit.

The sample was gathered using the Toolserver, and the following query is an example of how the 2005 set was gathered. (If you want to run it on different years, simply change the timestamps.) In very early years, such as 2004, where there were fewer editors altogether, we limited the query to 500.

use enwiki_p;
select su.user_name,r.rev_id
from (SELECT u.user_id,u.user_name,u.user_registration,min(r.rev_timestamp) t
FROM user u
INNER JOIN revision r
ON u.user_id = r.rev_user
JOIN page p
ON r.rev_page = p.page_id
WHERE u.user_registration BETWEEN '20050201000000' AND '20050301000000' and u.user_id between 135000 and 235000 
AND UNIX_TIMESTAMP(r.rev_timestamp) - UNIX_TIMESTAMP(u.user_registration) < (60*60*24*7)AND page_namespace = 1
GROUP BY u.user_id
LIMIT 500) su
INNER JOIN page p
ON su.user_name = p.page_title
INNER JOIN revision r
ON  r.rev_page=p.page_id  and r.rev_user != su.user_id
where p.page_namespace = 3 
AND UNIX_TIMESTAMP(r.rev_timestamp) - UNIX_TIMESTAMP(su.t) < (60*60*24*30);

Results

Results are described at: "The Rise of Warnings to New Editors on English Wikipedia". The totals data is below, but the actual samples will not be distributed to avoid calling out individual editors by name.

Two types of edits made to the user talk pages of good faith editors, correlated with tone analysis
Year Edits that included praise Edits that added a template with a negative tone Total number of edits analyzed
2004 36 0 251
2005 23 0 223
2006 26 11 243
2007 5 24 347
2008 7 33 235
2009 13 36 176
2010 3 50 209
2011 6 84 244

The totals calculated as a percent of the whole (in the sample) resulted in the following chart:

Year Edits that included praise Edits adding a template with a negative tone
2004 14.34% 0
2005 10.31% 0
2006 10.70% 4.53%
2007 1.44% 6.92%
2008 2.98% 14.04%
2009 7.39% 20.45%
2010 1.44% 23.92%
2011 2.46% 34.4%