Grants talk:IEG/Automated Notability Detection: Difference between revisions

From Meta, a Wikimedia project coordination wiki
Latest comment: 9 years ago by Stuartyeates in topic Suggestions
Content deleted Content added
sign
reply
Line 60: Line 60:
[Note: I was [[w:en:WP:CANVASS]] to come here by Jodi Schneider, based on a mailing list post related to my work at [[w:en:Wikipedia:WikiProject_New_Zealand/Requested_articles/New_Zealand_academic_biographies]], I'm assuming that this is kosher in the grants process.]
[Note: I was [[w:en:WP:CANVASS]] to come here by Jodi Schneider, based on a mailing list post related to my work at [[w:en:Wikipedia:WikiProject_New_Zealand/Requested_articles/New_Zealand_academic_biographies]], I'm assuming that this is kosher in the grants process.]


#The operative parts of the proposal / work need to be rewritten from "notability" to "evidence of notability". Without a huge knowledge of the real world the algorithm is going to be unable to judge notability, but judging the evidence of notability in the draft explicitly constrains the scope to the examination of the draft alone. The term 'evidence' is used frequently in [[w:en:WP:GNG]] and commonly in deletion rationales. On wiki.en 'notability' is the outcome of a consensus decision making activity. 'Evidence of notability' is anything that feeds into that consensus decision.
#<s>The operative parts of the proposal / work need to be rewritten from "notability" to "evidence of notability". Without a huge knowledge of the real world the algorithm is going to be unable to judge notability, but judging the evidence of notability in the draft explicitly constrains the scope to the examination of the draft alone. The term 'evidence' is used frequently in [[w:en:WP:GNG]] and commonly in deletion rationales. On wiki.en 'notability' is the outcome of a consensus decision making activity. 'Evidence of notability' is anything that feeds into that consensus decision. </s>
#I suggest that rather than solving the (relatively hard) binary problem of notability, a set of overlapping problems of increasing difficulty are solved. Insertion of [[w:en:Template:Peacock]] templates should be pretty trivial, insertion of [[w:en:Template:Advert]] slightly harder, etc. See [[w:en:Category:Wikipedia_articles_with_style_issues]] and [[w:en:Category:Wikipedia_article_cleanup]] for suggestions. Even if the notability problem is not solved (or solved in a manner the community finds unacceptable) the tool will still be useful.
#<s>I suggest that rather than solving the (relatively hard) binary problem of notability, a set of overlapping problems of increasing difficulty are solved. Insertion of [[w:en:Template:Peacock]] templates should be pretty trivial, insertion of [[w:en:Template:Advert]] slightly harder, etc. See [[w:en:Category:Wikipedia_articles_with_style_issues]] and [[w:en:Category:Wikipedia_article_cleanup]] for suggestions. Even if the notability problem is not solved (or solved in a manner the community finds unacceptable) the tool will still be useful.</s>
#I suggest that the tool be pitched not at reviewers, but at article creators. This would allow better, more immediate feedback to more users to enable them to grow as editors and improve their drafts in the short term and removes the delay of a reviewing queue. Automated tagging of a draft after it had been idle (unedited) for ~24 hours might be suitable (but require a bot approval process).
#<s>I suggest that the tool be pitched not at reviewers, but at article creators. This would allow better, more immediate feedback to more users to enable them to grow as editors and improve their drafts in the short term and removes the delay of a reviewing queue. Automated tagging of a draft after it had been idle (unedited) for ~24 hours might be suitable (but require a bot approval process).</s>
#I suggest that the domains used in URLs in references are likely to be a useful attribute for machine learning
#<s>I suggest that the domains used in URLs in references are likely to be a useful attribute for machine learning</s>
#I suggest that a manual labelled corpus is already available by looking at the articles that have been declined in the past. Very informative will be articles which have been declined, are improved and then accepted, since these are matched pairs of articles, one with evidence of notability and one not.
#<s>I suggest that a manual labelled corpus is already available by looking at the articles that have been declined in the past. Very informative will be articles which have been declined, are improved and then accepted, since these are matched pairs of articles, one with evidence of notability and one not. </s>
#I '''volunteer''' to help in the manual labelling.
#I '''volunteer''' to help in the manual labelling.
#I '''volunteer''' to help with other aspects (I'm a coder with a PhD in a relevant field)
#I '''volunteer''' to help with other aspects (I'm a coder with a PhD in a relevant field)
[[User:Stuartyeates|Stuartyeates]] ([[User talk:Stuartyeates|talk]]) 21:21, 1 October 2014 (UTC)
[[User:Stuartyeates|Stuartyeates]] ([[User talk:Stuartyeates|talk]]) 21:21, 1 October 2014 (UTC)
::It appears that I misread your proposal. You're trying to measure the "absolute notability of topic" rather than the "evidence of notability in the article/draft"? That's altogether a separate kettle of fish and I'll have to think about suggestions for that. [[User:Stuartyeates|Stuartyeates]] ([[User talk:Stuartyeates|talk]]) 22:24, 1 October 2014 (UTC)

Revision as of 22:24, 1 October 2014

As a service

How will we provide this notability score to other tools? I propose a web API (as a service) and a simple human UI hosted in WMFLabs. --EpochFail (talk) 14:57, 10 September 2014 (UTC)Reply

Should we plan to hire someone to implement this, and add it to the grant? Or keep it as planned, for a later stage? Bluma.Gelley (talk) 12:20, 17 September 2014 (UTC)Reply
Yes, I think it should be an outcome of the grant, and should be in the budget, with plans for recruiting the person who will do this. Jodi.a.schneider (talk) 15:31, 24 September 2014 (UTC)Reply

which kind of reviewers?

It would be good to explain which kind of reviewers you think this would be helpful for: AfC, NPP, people trolling categories for cruft, ... All of the above? Jodi.a.schneider (talk) 18:38, 10 September 2014 (UTC)Reply

Proof-of concept

Thanks for joining today's IEG Hangout. I am looking forward to seeing this proposal develop! Particularly will be curious to see how you decide to move forward with any ideas for putting a lightweight prototype in the hands of real Wikipedian testers as part of this project - ultimately, I'd love to see something heading towards practical application of the classifier, with some measures of success indicating that the classifier is not only accurate, but useful for active community members. Cheers, Siko (WMF) (talk) 23:51, 16 September 2014 (UTC)Reply

Which WikiProjects?

Which WikiProjects would be the best to ask for help in creating the training set? Shall we request feedback from them now? One list of WikiProjects is at en:Category:Wikipedia_WikiProjects or there's an categorized list here. I guess it should be a good subset of content-based ones? Looking at what's declined at AfC also might give ideas of the topics of importance... Or else Wikipedia:Notability and its many, many subguidelines... Let me know how I can help! Jodi.a.schneider (talk) 20:42, 17 September 2014 (UTC)Reply

Possibly contacting the Articles for Creation and NPP people for feedback on how they might find such as tool useful. --Kudpung (talk) 16:09, 18 September 2014 (UTC)Reply
Good thought, Kudpung -- they've both been notified. This project would rely on having some initial data -- about notability (or not) of articles on some given topics. That would come from experts from certain WikiProjects -- I think it would be good to establish which ones, and get some feedback/buy-in. Jodi.a.schneider (talk) 09:14, 19 September 2014 (UTC)Reply
Suggest AfC and NPP. The subject-matter specific Wikiprojects probably are too focused of groups for assessment of a general-use tool, unless this tool is intended for a large subset of articles (such as historical events or biographies). VQuakr (talk) 04:07, 20 September 2014 (UTC)Reply
I guess the question is whether this is going to be based on en:WP:GNG or whether it's going to be based on subject-specific notability guidelines. Thoughts? Jodi.a.schneider (talk) 15:32, 24 September 2014 (UTC)Reply

Functionality suggestions

I am coming from newpages patrol on en.wiki, where I normally use the new pages feed (NPF) tool [1]. My thoughts are in the context of that tool, and may or may not apply well to other similar applications. That tool currently lists new pages, along with pertinent information such as the creator's edit count, the first sentence or so of article content, and whether it contains categories and/or references. A tool that identified topics that might not meet the General Notability Guideline would complement this information. Given the terse nature of the NPF interface, I think the automated notability assessment would best be indicated by a single scalar score, maybe a 0-100% scale, possibly color-coded red to green. It could be clickable to get open a secondary screen with additional analytical information, with a bug/failure reporting link (at least in alpha) for false negatives and false positives. It would be nice if the secondary screen also linked potentially useful (but unused in the article) sources. Triggers for lowering the notability score could include presence in social media but not reliable sources, use of only primary and vanity sources in the article without better sourcing available, and close paraphrase detection from corporate websites and/or social media.

Ironholds might have thoughts on how to test modifications integrated into the NPF. Good luck! VQuakr (talk) 03:19, 19 September 2014 (UTC)Reply

Validity of results, etc

I endorsed this at first, but then I got second thoughts after reading the reply. It seems like the initial description isn't what the contributors of the proposal is planning, and that makes me confused. Then I read the research paper [2] and gets even more confused.

If you train any kind of classifier through supervised learning you need two training sets, one that represents your wanted class and one for the unwanted class. Both must be representative. In this case it seems like the wanted ones are taken from older articles that have evolved past the initial state of the unwanted ones. When you learn from two classes that are at two different stages in their lifespan, then you will have inherit differences that is not comparable. That makes the classification task much easier, and will give precision and recall that is much better than in a real world case.

Then you have the problem with how you present this information to the users and how the algorithm itself will change the process outcome. If it does change the outcome, will it then learn the new outcome and then slowly game the process over time? And if the creators of new articles can observe the effect, will they be able to adapt their new articles to get a higher score and bypass the algorithm? Reading the paper it seems like the proposal is to make this a reviewer-only tool for notability (really deletion), and then it won't contribute to overall quality, but from the reply to my endorsement it seems like it is a quality-tool for everyone. How the community would interact with such a tool is very important, especially whether they would take the output from the classifier for a fact, but the discussion about this in "8. Limitations and concerns" in the paper is somewhat short and builds on a number of unsubstantiated hypothesis.

A very rough idea of a general tool like this would be to add a "feature export" option to AbuseFilter (yeah it should be renamed EditFilter) and then make a classifier that can use the export to train against some logged value. The obvious problem you have here is how do you identify positive outcome. One is that you log both outcomes from a feature export. Until there is sufficient data it should be possible to manually make training and test sets. Often initial sets can be made from the testing feature in AbuseFilter. It should be possible to reevaluate the classifier as it evolves against the old test set, and how it evolves on precision and recall should be logged and presented graphically.

A tool like I have sketched would be more general and could be used for a lot more than just deletion requests. It would also be visible if it starts to propagate in an unwanted direction over time.

In short I think the proposal needs more discussion than the current timeframe allows. Perhaps it could be run more as a feasibility study. — Jeblad 14:18, 22 September 2014 (UTC)Reply

Hi Jeblad, thanks for the detailed thoughts. This is a new project, distinct from the paper you reference above (the methodology is similar as I understand it; but I'm not the ML expert (that's Bluma.Gelley).
The plan is to take a *new* data set (we've been discussing exactly how we'll select and collect that); this will be from the last year (unlike the dataset used in the other paper). Further, we plan to look at each article t seconds after it was started (for some t, not yet decided). Does that address your concern about the classifier? We'd really welcome any thoughts on collecting the training set. I'll look more at the rest of your comment later. Jodi.a.schneider (talk) 15:41, 24 September 2014 (UTC)Reply

Reminder to finalize your proposal by September 30!

Hi there,

  • Once you're ready to submit your proposal for review, please update the status (|status= in your page's Probox markup) from DRAFT to PROPOSED, as the deadline to proposal for this round is September 30th.
  • Let us know here if you've got any questions or need help finalizing your submission.

Cheers, Siko (WMF) (talk) 20:55, 26 September 2014 (UTC)Reply

Suggestions

[Note: I was w:en:WP:CANVASS to come here by Jodi Schneider, based on a mailing list post related to my work at w:en:Wikipedia:WikiProject_New_Zealand/Requested_articles/New_Zealand_academic_biographies, I'm assuming that this is kosher in the grants process.]

  1. The operative parts of the proposal / work need to be rewritten from "notability" to "evidence of notability". Without a huge knowledge of the real world the algorithm is going to be unable to judge notability, but judging the evidence of notability in the draft explicitly constrains the scope to the examination of the draft alone. The term 'evidence' is used frequently in w:en:WP:GNG and commonly in deletion rationales. On wiki.en 'notability' is the outcome of a consensus decision making activity. 'Evidence of notability' is anything that feeds into that consensus decision.
  2. I suggest that rather than solving the (relatively hard) binary problem of notability, a set of overlapping problems of increasing difficulty are solved. Insertion of w:en:Template:Peacock templates should be pretty trivial, insertion of w:en:Template:Advert slightly harder, etc. See w:en:Category:Wikipedia_articles_with_style_issues and w:en:Category:Wikipedia_article_cleanup for suggestions. Even if the notability problem is not solved (or solved in a manner the community finds unacceptable) the tool will still be useful.
  3. I suggest that the tool be pitched not at reviewers, but at article creators. This would allow better, more immediate feedback to more users to enable them to grow as editors and improve their drafts in the short term and removes the delay of a reviewing queue. Automated tagging of a draft after it had been idle (unedited) for ~24 hours might be suitable (but require a bot approval process).
  4. I suggest that the domains used in URLs in references are likely to be a useful attribute for machine learning
  5. I suggest that a manual labelled corpus is already available by looking at the articles that have been declined in the past. Very informative will be articles which have been declined, are improved and then accepted, since these are matched pairs of articles, one with evidence of notability and one not.
  6. I volunteer to help in the manual labelling.
  7. I volunteer to help with other aspects (I'm a coder with a PhD in a relevant field)

Stuartyeates (talk) 21:21, 1 October 2014 (UTC)Reply

It appears that I misread your proposal. You're trying to measure the "absolute notability of topic" rather than the "evidence of notability in the article/draft"? That's altogether a separate kettle of fish and I'll have to think about suggestions for that. Stuartyeates (talk) 22:24, 1 October 2014 (UTC)Reply