Grants:IEG/Automated Notability Detection: Difference between revisions

There is a large volume of articles added to Wikipedia, in both the Main namespace and the new Draft namespace. All these articles require some form of review, and one of the major elements of that review is whether or not the article is about a notable topic. Notability is difficult to determine without a significant amount of familiarity with the notability guidelines, and sometimes domain expertise. In general, there are not enough reviewers, particularly in the Draft namespace (many of whose articles were formerly in Articles for Creation), and reviewing takes them too long and is a very laborious process. The heavy load on AfC reviewers and New Page Patrollers leads some to make overly hasty decisions and can drive away well-meaning new users whose good-faith articles are quickly deleted. Since it can be difficult to adequately determine a topic's notability quickly, reviewers often end up basing their judgments on the content and quality of the draft or article, rather than the potential notability of its subject.

What is your solution?

We propose to use machine learning to support the reviewing process. We'll train classification algorithms to automatically determine whether articles are notable or not. The notability scores (e.g. probably of notability & confidence) calculated by the classifier will be made available via public APIs on Wikimedia Labs. We hope that these scores will give users more information about the notability of articles they may be unsure about, hopefully helping reviewers find and improve articles that are indeed notable though they don't seem so, and help support better decisions.

Note that this system will not be assessing specifically whether or not an article should be deleted or not. The classifier's output will only address whether the topic of an article draft is probably notable or not. We feel that this is an important distinction. Stated simply, the algorithm will be intended to support human judgement -- not replace it. We also do not want our algorithm to become a crutch for users to hastily delete or decline articles without due thought. We therefore plan to only return those scores that are > .5; i.e, the article is more than 50% likely to be notable; any less than that and the user will receive a message that the notability of the article cannot be determined.

Project goals

We hope that making reviewing easier will:

improve the quality of notability assessments by calling attention to drafts or articles that seem to be about notable topics
- Hypothesis 1: Drafts about truly notable topics will be less likely to be deleted/declined.
decrease the workload of current draft reviewers by reducing the burden of assessing notability
- Hypothesis 2: Reviewers using the notability tool will effectively review drafts more quickly.

Project plan

Activities

Create a manually coded sample of articles:
- In order to train a classifier, we will need to start with labeled data -- a sample of article drafts that have been carefully assessed for notability. In order to obtain this labeled data, we will create a random sample of draft articles and present them to Wikipedians (found via WikiProjects relevant to the articles' content) to determine if they are notable or not.
Define and extract notability features: To support a classifier's ability to differentiate between notable and non-notable drafts, we will need to specify and extract features that carry a relevant "signal" for notability (e.g. How many web search engine results? How many red links appear to reference the topic? Can the article's topic be matched to an existing category? etc.).
Train and test classifiers: We will use the feature set and labeled data to train classifiers and test them against a reserved set of labeled data to identify the fitness of different classification strategies and choose the most accurate.
Serve via API on WMF Labs: Once we have a functioning classifier, we will expose it to wiki tools on a Wikimedia Labs instance. This will allow us and other wiki-tool developers to make use of the service. We will also build a minimal, proof-of-concept wiki gadget to demonstrate the utility of the classifier's scores to draft reviewers.

Timeline:

weeks 1-4: Brainstorming features and determining feasibility for obtaining them. Researching methods for determining the topic of an unclassified article. We will try to solicit community suggestions in this step.
weeks 1-4 (simultaneous with the above step): Manual labeling of training data by subject-matter experts.
weeks 5-9: implementing the features in code; testing the classifier
weeks 10-12: improving the classifier, adding and removing features as necessary
weeks 12-14: thorough validation, creating proof-of-concept website for community members to evaluate the classifier
weeks 15 - 18: build the classifier into an API that would return notability and confidence scores.

Budget

Graduate student salary for grantee for the duration of the research: 30 USD/hour for 20 hours a week * 14 weeks: 8,400 USD
Possible travel to CSCW 2015 in March to present the results to the Wiki academic community and get input and advice: 1,800 USD (airfare, hotel, earlybird registration, and miscellaneous expenses); if I win the CSCW Student Volunteer lottery, the cost will decrease by ~350 USD (registration cost). If we are not selected to present our research, the grant will decrease by this amount.
Cost to hire someone to implement the API: 50/hour for 120 hours: 6,000 USD. If there are any volunteers to do this job, the grant will decrease by this amount.
Human resources - finding the API developer, managing their work, etc: 25/hour for 15 hours = 375 USD

Community engagement

We plan to ask members of various WikiProjects to help in constructing a training set. These volunteers would read a number of articles and mark them as notable or not notable. In this way, we can create a gold standard data set based on expert judgement. We will also continuously solicit help and advice from the community as to what they would like to see in such a tool; we would love if members of the community would suggest possible features for the classifier. These would be based on their experience as to what aspects of an article they look at when determining notability.

Sustainability

We hope that by the time the grant period ends, we will have a working, robust classifier whose decisions are made available through an AP. This should allow maximum flexibility for others to build tools using the scores. We will provide detailed documentation of how the system works and open-source the code so that it can be improved by anyone who wishes. We hope that others will build on top of this API and create tools that will help different parts of the community make better decisions. We will also solicit help from the community in continuing to label articles as notable or not so we can keep expanding our training set to make more accurate predictions. (This is the paradigm used by Cluebot NG; see here.)

Measures of success

We will be using machine learning, so we can measure our success by the precision and recall of our classifier. Since this is a hard problem, we will consider around 75% accuracy to be good. This is more than high enough to use as a first step in the review process to help reviewers with their work.

We would also like to get community feedback on our classifier's results. (Thanks, Siko (WMF)!) Though we do not expect to have a fully functional tool at the end of the grant period, we hope to make a web page available where users can input an article, receive our classifier's score for it, and mark whether or not they agree with the classifier. This will both allow the community to decide whether or not our classifier is meeting their needs. It will also help us improve the classifier by expanding its training set.

Get involved

Participants

Bluma Gelley : I am a PhD student at New York University and have done a significant amount of research on Wikipedia. In particular, my research looked at some of the problems with the deletion/New Page Patrol process, and with the Articles for Creation/Draft process. Both these processes could be improved by making automated notability detection available to those reviewing/vetting articles.

I have previously done related work on automatically predicting articles that will likely be deleted; this project would build on that using a new, hand-constructed training set of recent articles for better results. In this published paper, I attempted to detect notability, but I suspect that I was successful only in predicting deletion. Besides a better training set, I also plan to use better features. I already have the framework for the classifier, so part of the work is done already.

Jodi Schneider has done research on the deletion process and problems with AfC/Draft process. She brings the perspective of qualitative research, which enhances the proposed quantitative work.

Aaron Halfaker (EpochFail) is a Research Scientist at the Wikimedia Foundation (staff account: Halfak (WMF)) and has developed tools that use intelligent algorithms to support wiki work (e.g. en:WP:Snuggle and R:Screening WikiProject Medicine articles for quality) and has performed extensive research on problems related to deletion, AfC, and newcomer socialization in general.

Community Notification

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

The following communities have been notified. We plan to contact several WikiProjects for help with manual labeling of articles once we have a set of articles to work with.

notification to AfC

notification to the Article Rescue Squadron

Endorsements

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

Community member: add your name and rationale here.

As a long-term participant and closer at AfD, and a frequent participant at AfC on ENWIKI, I have long experience the grave difficulties involved both working to our notability guidelines and explaining them to new editors. This effort holds promise for moving us not only in the direction of assisting reviewers and new editors at a process I currently describe as a "whisper chipper for new editors." I do not mean to suggest that I expect an immediate panacea here, but this is the right first step towards what I hope will be a longer series of efforts leveraging what technologies we can leverage into solving some of Wikipedia's biggest user engagement hurdles. I strongly endorse this effort. --Joe Decker (talk) 15:35, 16 September 2014 (UTC)
I am supportive of this idea, as I would like to start seeing more formal ways to judge notability on our site outside of individual ideas on notability. Kevin Rutherford (talk) 02:04, 17 September 2014 (UTC)
Support the idea. Additionally, if this tool could also pick out unsourced articles (which often goes hand-in-hand with being non-notable), and integrated into the helper script, this could massively improve the reviewing workflow. Mdann52 (talk) 07:38, 17 September 2014 (UTC)

This does not require machine learning, so will not be part of the preliminary prototype, but should be very easy to integrate into the completed tool. Thanks for your support! Bluma.Gelley (talk) 07:53, 17 September 2014 (UTC)

Yeah, I am aware. Just giving suggestions of possible future features, and something that could be picked up on to show no notability :) Mdann52 (talk) 12:35, 17 September 2014 (UTC)

Definitely! Please keep the suggestions coming! Bluma.Gelley (talk) 20:40, 17 September 2014 (UTC)

Hi Mdann52, a thought on unsourced articles: en:Special:NewPagesFeed does this for new mainspace articles, but I'm not sure how. I've left a comment for the maintainers to try to get more info. Out of scope of the envisioned project here -- but worth pursuing separately IMO. Jodi.a.schneider (talk) 15:55, 24 September 2014 (UTC)

Integration into the AfC helper script would work well. I think you should at least create an API for your project, in case anyone would want to also integrate this into a new pages patrol application. Working on the actual AfC helper script would be a waste, but an API would work well. I would support this. If this is an ANN, then I'll support if there's some sort of measurement of confidence. Not a binary decision. If you base it on Google News result, Google Scholar, or any other formula that takes into account specific factors, then you'd need to show the reasoning of the decision. For example, if an article is about a recent event, seems very notable to an untrained eye, but it turns out that there is no Google News results, I'd like this application to tell me that, "Hey, this article doesn't seem notable because there are no Google News results, and it seemed to have taken place recently". That is a big red flag, and one that this tool would hopefully alert me of. Chess (talk) 03:16, 18 September 2014 (UTC)

Sounds like a worthy project. Since the proposed deliverable is a tool that helps reviewers, the tool should give more than a binary or one-dimensional assessment. For instance it should hilight which sources in the article are reliable. It could try to determine and indicate to reviewers the subject area so to know which notability criterion to apply. Teaching a machine to do stuff like this will likely require more than binary pass/fail notability information from trainers. Kvng (talk) 02:17, 19 September 2014 (UTC)

Interesting, if ambitious, idea. As a long-time new page patroller, frequent user (and previously beta-tester) of the new pages feed tool in en.wiki, I think there is a genuine possibility that this tool could be useful and also act as a test bed for additional semi-automated means of screening our ever-growing queues of draft and new articles. I will post some additional thoughts on the talk page. VQuakr (talk) 02:59, 19 September 2014 (UTC)

This is a lot more complex than just showing some numbers for recall and precision, this can end up as a tool that changes the deletion process to support its own notion of whats possible to delete. If a tool propose positive changes it does not matter so much if it spins the direction to the left or right, but when a tool are geared towards negative changes it can be extremely dangerous if it starts to optimize its own decisions towards deletions. That will happen if the tool use machine learning, it will try to optimize for deletion as its outcome. Deletion processes should be rooted in firm rules, not in continuous machine learning. A system could although be used for learning which articles could survive a deletion process driven by those rules, but then it drives positive changes. Note that also non-continuous machine learning has problems as it hides the underlaying rules of its decisions. This is a general problem with nearly all types of machine learning. I can endorse development and testing, but not setting it up as part of the production process unless I have a lot more information about it. — Jeblad 10:58, 21 September 2014 (UTC)

Jeblad, you raise some very important points. The possibility of a machine learned system being biased towards what gets deleted is something we've thought about extensively. For that reason, we do not plan to use deleted articles to train the classifier with. We will hopefully get subject-matter experts to make careful judgments on whether articles are notable or not, and use that to train. We are also not aiming this tool specifically at deletion(what you refer to as 'negative changes'); rather, we hope that it can be used at various points in the article creation/improvement workflow to help users make positive decisions as well. Currently, articles are often deleted or rejected, even though potentially notable, because they are missing sources and/or are low quality. We hope that this tool will help reviewers see the potential in articles on notable topics and allow them to develop, rather than rejecting them out of hand. Bluma.Gelley (talk) 12:14, 21 September 2014 (UTC)

To clarify what I wrote; it is about what happens when the errors the machine learning introduce starts to pile up, and how it will erode previous knowledge. In a vector space this will look like a slow blurring of the features, and if the surface in the vector space is the defining limit then you get a constantly evolving notability.

Notability in Wikipedia is about the deletion process, but what you now writes seems to be more about quality processes in general. I need more info before I can endorse this. — Jeblad 06:11, 22 September 2014 (UTC)

Notability isn't just used in the deletion process. One of the main quality processes we're thinking about is the en:WP:AfC process; this reviews articles from the en:WP:Article wizard (including those written by IP users & newcomers). Does that make sense? I can point you to a paper we wrote for OpenSym about AfC; OpenSym |and the slides Aaron. One of our findings was that Notability decisions are really difficult for reviewers to make because they're very subject specific and require significant judgement. But AfC articles are an assortment ("On the same day, a reviewer might consider articles on a World War I soldier, a children’s TV special from the 1980’s, a South African band, and an Indian village."). Does that make the motivation more clear? Jodi.a.schneider (talk) 16:05, 24 September 2014 (UTC)

You do know that Notability isn't an issue at speedy deletion, the test for A7 being no credible assertion of importance or significance? Otherwise I like the idea of trying this, but I would be very uncomfortable with a tool that was only 75% confident when it marked an article as probably meriting deletion. Better if this goes ahead to identify some articles as almost certainly meriting certain deletion tags, some as almost certainly needing to be marked as patrolled, and another group as needing human review, and that in my view should include anything where the bot is >5% unsure. Where this would be really useful would be in highlighting probable G10 candidates and bringing them to the attention of patrollers and admins, a few simple rules such as the inclusion of certain phrases or having a previous article deleted G10 should make a really useful difference. Otherwise presence of references is only relevant to one deletion criteria - BLPprod, unless that is the tool automates the level of mistakes we already see? WereSpielChequers (talk) 19:20, 22 September 2014 (UTC)
Hey WereSpielChequers, glad to see you commenting here. The goal isn't to mark pages as probably meriting deletion (for A7 or any other reason). The main goal is to help en:WP:AfC reviewers sift through the backlog of draft articles; and secondarily to speed up human review en:WP:NPP by identifying how likely something is to be notable. Based on previous comments, we will include confidence scores.

Bluma.Gelley's the ML expert -- I'll let her answer about whether it's feasible to autodetect probably attack pages (Wikipedia:Criteria_for_speedy_deletion#G10) within the scope of this project. Jodi.a.schneider (talk) 16:35, 24 September 2014 (UTC)
I think I see what you are trying to do, but I remain nervous that the deletionists will just use this as a way to speed up deletion tagging of articles. I foresee people saying "don't blame me, when I tagged it there was only x% chance of it being notable". Would it be possible to build in something that gave contraindications, for example marking anything less than 24 hours old as either notable or "too early to tell". WereSpielChequers (talk) 13:53, 26 September 2014 (UTC)
I definitely understand your concerns; there is always that possibility. However, we do feel strongly that in many cases, the problem is that the article is notable, but its notability is not visible on a superficial review. The point of this project is to combine all the external information not necessarily easily available to reviewers/patrollers, so that they can have much more information when they make their decision, rather than just relying on the text of the article and whatever they can find themselves, if they bother. We hope this will make it easier to keep, rather than easier to delete.

Also, while it may be a good check on the power of this scorer to build in some contraindications, I don't think that we should do that. That would mean making policy decisions (i.e. anything less than 24 hours old cannot be flagged as not notable) in what should be an objective score. Whoever ends up building actual tools to make use of the scores should be the ones to build in such conditions. Bluma.Gelley (talk) 06:49, 28 September 2014 (UTC)

It may take some time for this tool to meet reliability standards for reviewers, but I think it has a lot of promise and I certainly would be very interested in testing it out. In principle, I think tools that can help guide decision-making (but not replace it) are very helpful for getting editors interested in a very time-consuming and effortful task. I JethroBT (talk) 22:29, 29 September 2014 (UTC)

As a long-term wikipedian and frequent AfD and AfC participant with a PhD in Comp Sci who has published in machine learning I Support this project. I have some concrete suggestions which I will post on the talk page. Stuartyeates (talk) 20:30, 1 October 2014 (UTC)

Oppose

I strongly oppose this idea. It will simply cause more potential articles to be deleted, this time by a bot. Editors who enjoy deleting other editors work will point to this bot as proof that the article is not notable. Terrible idea on so many levels. Walterruss (talk) 07:45, 30 September 2014 (UTC)

@@ Line 158: / Line 158: @@
 *It may take some time for this tool to meet reliability standards for reviewers, but I think it has a lot of promise and I certainly would be very interested in testing it out.  In principle, I think tools that can help guide decision-making (but not replace it) are very helpful for getting editors interested in a very time-consuming and effortful task. [[User:I JethroBT|I JethroBT]] ([[User talk:I JethroBT|talk]]) 22:29, 29 September 2014 (UTC)
-* As a long-term wikipedia and frequent AfD and AfC participant with a PhD in Comp Sci who has published in machine learning I '''Support''' this project. I have some concrete suggestions which I will post on the talk page. [[User:Stuartyeates|Stuartyeates]] ([[User talk:Stuartyeates|talk]]) 20:30, 1 October 2014 (UTC)
+* As a long-term wikipedian and frequent AfD and AfC participant with a PhD in Comp Sci who has published in machine learning I '''Support''' this project. I have some concrete suggestions which I will post on the talk page. [[User:Stuartyeates|Stuartyeates]] ([[User talk:Stuartyeates|talk]]) 20:30, 1 October 2014 (UTC)
 ===Oppose===