User talk:West.andrew.g: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
→‎STiki: Blanking repetitive query
Line 160: Line 160:


:: See my note above about the CHANGELOG.txt file in the ZIP distribution. The last two updates have been minor. One affects the back-end processing, and the other is a tool for research use. Nothing to exciting on the client-side. However, I am working on integrating the 'rollback' function into STiki (naturally for the those who have it, and into software for those who don't). That should be helpful for cases of multi-edit vandalism.
:: See my note above about the CHANGELOG.txt file in the ZIP distribution. The last two updates have been minor. One affects the back-end processing, and the other is a tool for research use. Nothing to exciting on the client-side. However, I am working on integrating the 'rollback' function into STiki (naturally for the those who have it, and into software for those who don't). That should be helpful for cases of multi-edit vandalism.

== Misdirected Testing? ==

Checkuser results suggest that one of your linkspam related software tests may inadvertently be pointing to the English Wikipedia rather than [http://test.wikipedia.org/wiki/Main_Page test wiki]. Please check your settings & adjust accordingly. Thanks, --[[User:Versageek|<span style="color:midnightblue">Versa</span>]][[User_talk:Versageek|<span style="color:darkred">geek</span>]] 03:08, 14 July 2010 (UTC)

Revision as of 03:08, 14 July 2010


Talk page for West.andrew.g:

Ping List for STiki

If you are interested in being contacted when STiki becomes available, please sign below. Thanks, West.andrew.g (talk) 05:36, 22 March 2010 (UTC)[reply]

  1. Hamtechperson 20:35, 26 February 2010 (UTC)[reply]
  2. tetraedycal, tetraedycal 23:16, 8 March 2010 (UTC)[reply]
  3. Ottawa4ever (talk) 10:03, 11 March 2010 (UTC)[reply]
  4. Mlpearc MESSAGE' 19:19, 15 March 2010 (UTC)[reply]
  5. Avicennasis @ 15:54, 19 March 2010 (UTC)[reply]

Vandalism patrol

I don't have a handy barnstar, but if I did I'd give you one. Nice work. We need a lot more, and if you're developing a new tool, all the better. See you around.

Can the software issue warnings?

Is it possible to use the code to issue a warning to the person who did a disruptive edit using the software you created as well as doing the revert?. I must say i do like how it goes deeper into revision feeds then huggle would. So its possible to catch older incidences that were missed in the first pass. Thanks Happy editing Ottawa4ever (talk) 10:33, 21 March 2010 (UTC)[reply]

Hi Ottawa4ever. This is a feature I am planning to implement. Should I append or pre-pend the warning template onto the user-page? Which template is considered standard? Thanks, West.andrew.g (talk) 15:54, 21 March 2010 (UTC)[reply]
(wordy response follows)Its entirely up to you where you go with warnings, each vandal fighter is a bit different in their reverting preference behaviour. A list of typical warnings for disruptive editing are located here; Wikipedia:VANDALSIM#Warnings (there are more though these are just the typical often used ones). Personally (And this is just myself and others may differ) I have favoured the feature in Huggle where it can give the user the option to (or not to) leave a warning on the disruptive editors talk page after you have done the revert (It also has the option of revert and warn). But a problem in Huggle i've noticed, if you select revert and warn, it can often not revert and just warn, causing trouble. So I would think warning on the users talk page after the revert is safest. The key to the warnings is that it allows the editing behaviour of a user account to be tracked and if neccessary builds a case for reporting at AIV (In hopes to prevent further disruption from disruptive editors). So in the first case of where someone was warned a level one warning is issued, the second case of disruption a level 2 and thus forth to level 4. Beyond level 4, reporting at WP:AIV. I think older versions of huggle allowed you to specify which level, but the newer version simply do this automatically. Anyway let me know if you need any clarifications (Ive tried to be broad here and you likely already knew this stuff). Again its a very intriging software that youve written. I especially think it will be quite useful for catching the sneaky and overlooked disruptive editing. Ill probably give it a bigger test in the coming week.Ottawa4ever (talk) 17:01, 21 March 2010 (UTC)[reply]
Hi again. I've made some changes so that a "warn user" feature (checkbox) is now available as part of my GUI. I've decided to use warn level 2 by default, since my tool has human confirmation of vandalism and is not a bot. I think the incrementing use of warn-levels is very intriguing and useful -- though it will take me a bit more time to implement something along these lines (I assume it will involve parsing the existing User-Talk page, to see if any warn templates are already present?). If the reversion fails to go through for any reason (intermediate edit), then the warning will NOT be left. In the future, I will consider adding support for variable warn-templates. Thanks again for your interest. West.andrew.g (talk) 06:45, 22 March 2010 (UTC)[reply]
Actually, I took the time today to implement the incrementing user-warnings (now available for download at the STiki page). I think it is a slightly imperfect art: Twinkle, Huggle, and all the home-brewed strategies seem to have minor formatting differences and it is ultimately impossible to compensate for all of them -- but my strategy now seems solid and quite encompassing. If someone vandalizes with a recent uw-4, they are reported on the AIV page. I don't make warnings if a vandalism incident occurred long in the past. Warnings only take place if the revert succeeds. Again, thanks for your test and I look forward to your feedback in the future. West.andrew.g (talk) 04:34, 25 March 2010 (UTC)[reply]
I intend to give it a good long test run today. Ill report on the stiki talk page my expereinces. Typically my preference is not to warn an IP if the edit is older. This is mostly because IPs rotate and its pretty rough when someone gets a warning saying they vandlized when they didnt and someone else did. So unless its a fresh edit, Ill pass on warning (unless its severly against policy, in which case my warning would be mostly custom to the editor). I think this tool has a distinct advantage in that its digging into older edits that wouldnt so much be in the Huggle que which as you said compliments the use and doesnt compete, looking good so far :)...talk soon, Ottawa4ever (talk) 08:52, 25 March 2010 (UTC)[reply]
K gave the software a bit of a glowing feedback, but also suggested some fixes please have a look on the wiki talk page of STiki. Handles very well, you should be super proud of the software. Ottawa4ever (talk) 15:17, 25 March 2010 (UTC)[reply]


David Cross

It is relatively obnoxious to receive a message from some kid who has sequestered himself in one specific area of study, and thus acts and admonishes before he actually knows. Being a computer scientist is not tantamount to being a polymath, so you should refrain from your knee-jerk responses and admonishments until you know more about a particular topic, in this instance David Cross. This is a common problem amongst the technically-talented: belief that they are more gifted than they are (which is absurd when you consider the paltry amount of direct competitors in your specific area of concentration) which, as is the case now, often materializes as this quasi-authoritative B.S., at your delusional whim. It is a cute little program that you have, but as it exists now (and apparently you have beta testing it on actual information/articles, when it's clear that it has limitations) it is only as viable as the breadth of knowledge of the user, which means that your program is trying to operate from the midst of the one of the main problems with regard to Wikipedia, namely subjective limitation/skewing. As for the David Cross article, he frequently admits to only "fucking" Amber Tamblyn as his book and interviews like this one [1] illustrate nicely. The word "fucking" is the word Mr. Cross chooses to use when describing his relationship with Ms. Tamblyn, so who are you to edit that out using your little program and no understanding of the topic in question, thereby rolling the edit back to a less accurate version, and then to send me a message admonishing me? Are you the self-appointed information police? Apparently, Russ Tamblyn, Amber's father, does not care that David Cross keeps telling everyone that he is "currently fucking Amber Tamblyn."

STiki doesn't flag vandalism automatically -- it requires humans to look at edits. Thus its limitations are more inconvenience than inaccuracy, but that's beside the point. I completely agree with your comment about subject skewing, but in this case I stand by my decision to revert your edit: You can't just go dropping the f-bomb in the middle of an article. There are more appropriate ways to paraphrase the same thing. Further, you could have quoted the word and provided a reference to the source you listed above. Had any of these criteria been met, I wouldn't have reverted you. I take no issue with the accuracy of your change, just how it was presented. West.andrew.g (talk) 16:09, 25 March 2010 (UTC)[reply]


Fair enough, Andrew. Thanks for your response, and my apologies with regard to my abrasiveness. Like you, I strive for objective truth and accuracy, even if this particular issue is somewhat unimportant, per se. —Preceding unsigned comment added by 69.114.38.15 (talk) 03:59, 26 March 2010 (UTC)[reply]

Researcher API rights

Andrew, you should get in touch with Erik Moeller at the Wikimedia Foundation. DarTar (talk) 10:42, 4 June 2010 (UTC)[reply]

STiki

I have noticed that if you miss a bit of vandalism on the page you can't go back a slide so i thought you might be able to make a forward and back button. Cheers and all the best. Let me know what you think of this idea,Gobbleswoggler (talk) 18:54, 7 June 2010 (UTC)[reply]

The back button is something on my to-do list. I feel like I notice a good bit of vandalism in the milliseconds after I press a classification button. This is especially prevalent when I moving very fast through edits (I tend to to move quickest using the keyboard shortcuts -- if you don't know about them -- after you do a single mouse-based classification you can use the "v", "p", and "i" keys to classify very quickly, without the hassle of moving the mouse (available in newer versions, which you likely have)). Thanks for your feedback and I'll send you a message if/when this improvement gets implemented. Thanks, West.andrew.g (talk) 19:01, 7 June 2010 (UTC)[reply]
When do you think the back button will be completed,Gobbleswoggler (talk) 19:14, 7 June 2010 (UTC)[reply]
The button has been implemented. After some testing and documentation, I plan to upload a new version today. I'll post on your talk page when I do. West.andrew.g (talk) 15:58, 8 June 2010 (UTC)[reply]
The new version has been uploaded with the improvement (version 2010/6/8). I've tested pretty thoroughly -- but let me know if you notice any strange behavior with respect to the "back" button. You can go back at most one edit, and cannot use the button to revisit an edit that was reverted. Secondly, I notice you use the "pass" button far more often than the "innocent" one. If your pretty confident an edit is not vandalism, go ahead and classify as "innocent" -- it helps maintain the edit queue and will make your user experience a little faster. Thanks, West.andrew.g (talk) 18:03, 8 June 2010 (UTC)[reply]
Just out of interest how does STiki filter what slides to show and how does it know it might contain vandalism. Also I have an idea. How about start showing user that are registered that may vandalize. Gobbleswoggler (talk) 18:12, 8 June 2010 (UTC)[reply]
I have published an academic paper describing the STiki philosophy. It is quite technical in nature. Briefly, it does machine learning over prior edits to identify the patterns (both in metadata, and more recently, natural-language) that are common among vandalism. This model is applied to new edits which produces a "vandalism score" that determines the order in which edits get shown (assuming no one else edits the page). Secondly, on the point of registered users. Analysis has shown such users are a very small part of the vandalism problem, so it is not a top priority. Indeed, I fear the inclusion of such edits may increase the false positive rate. Currently I am working on additional ML-features to improve the accuracy of the current system. Thanks, West.andrew.g (talk) 19:30, 8 June 2010 (UTC)[reply]

Re: Logs of Reversions

Hi Gurch, this is the author of STiki. We had some previous conversation about the relationship of our tools -- but I now come to you for a different reason. I was curious if you had logs of the RIDs which your tool issued a particular warning. In particular, I would like to gleam the edits your users classified as spam.

Of course, one could go searching through the UserTalk namespace and look for template additions and then parse out the 'diff/RID' in question. I thought you might have a quicker listing. The "edit summaries" left by Huggle seem a little generic, in that they don't provide RIDs or the reason for reversion, correct? Thanks, West.andrew.g (talk) 14:28, 13 June 2010 (UTC)[reply]

Hi Gurch. Thank you for your long and thoughtful comments. For convenience, I've tried to provide my response "in-line" below. West.andrew.g (talk) 06:42, 14 June 2010 (UTC)[reply]
No, I don't have any such logs. And there would be no feasible way to get them if I wanted them; Huggle clients work independently and do not communicate with any central server other than the wiki itself. Not to mention the privacy issues that would probably result in Huggle, and possibly myself, being banned from the project; there are a lot of privacy wonks here.
Unfortunately (or perhaps, fortunately?) my tool has yet to find the popularity that seem to bring out such controversy and inconvenience. Your handling of the many feature requests, bug reports, and the like on your user-talk page is certainly admirable. I indeed have a central server and hope this does not become an issue in the future. I have had the good fortune to secure a presentation at WikiMania '10 -- I hope this brings a larger user-base, but few issues.
Edit summaries do not include revision IDs because it is not particularly helpful -- and often infeasible -- to do so. The summary tells the reader at a glance that the revision was reverted, and when viewing the page history it is already clear which revisions were reverted because the author of the revision reverted to is identified. Nine-digit numbers are not very human-readable and when multiple revisions are reverted, listing all their IDs would quickly cause the summary to become too long.
Agreed. Certainly not useful for humans (especially in the rollback case) -- but of course tool authors like myself wouldn't mind their inclusion. :-)
Providing a reason for reversion is similarly problematic. Huggle has no concept of a reason for reversion, only for warnings. One reason is that if no reasonless reversion option were provided, users would overuse the vandalism reversion option and we would be left with many summaries claiming revisions to be vandalism when strictly speaking they weren't. Another reason is because of the dumb rules this project has that restrict what you supposedly can and can't do by certain technical means (rollback, undo) even when the end result is the same (effect of a revision undone) and the only difference is the resource demands on the client and server (and speed, of course). By restricting the concept of a reason to warnings users can revert edits they know are unacceptable, and then -- only if they desire to leave a warning -- they select an appropriate warning template, or leave their own message. In this way they can remove things like attempts to embed remote images with URLs in exactly the same manner as they'd revert anything else without administrators threatening them because they're "misusing rollback". In cases where users feel a detailed summary is required for reversion, Huggle already provides a mechanism for that.
Again, I agree with your reasons from a tool and community perspective. To a large extent, the type of warning issued speaks to the "nature" or "cause" of the reversion -- though it is safe to assume users will over-use the standard "vandalism" option.
If you are looking for a machine-generated log of "bad edits" to do some kind of machine-training on, as I suspect you are, you're going to run into trouble whichever way you go about it. This is not a problem that would be solved if I somehow had logs of all Huggle activity. (Also, I've tried such "training", and it works less well than the abuse filter, which is to say badly).
The abuse filter prevents many supposedly "bad" edits from even happening, so you might think the logs of that would be a good starting point, but there are both too many false positives and too many filters that do things like enforce guidelines for that to be of any use.
I don't play with the abuse filter. I know you are not terribly positive about it, either.
Next you might consider looking at page histories, identifying reverts and then inferring from those which revisions were bad. That has many problems. People make mistakes and revert things by mistake, then correct themselves by reverting their revert, which would leave you with two revisions that looked bad but weren't really. People sometimes revert their own, otherwise good, edits if they change their mind and decide to put something else there. People revert edits that are either good or suspicious but not in a vandalism sense during edit wars, often back and forth between two revisions both of which have problems that are content- or style-related, which would again be misleading. And of course, vandals revert good edits; they usually get reverted themselves, of course, but how do you (automatically) know which one is the vandal and which not?
It is not a fool-proof strategy, but this is largely the one that STiki applies. I identify rollbacks (via edit summaries), and then search back through the article history to find the offending edit(s). I do not consider cases where one rolls-back to themselves. An offending-edit isn't recorded if the rollback-initatior doesn't have rollback rights -- so this, for the most part, avoids edit warring.
Users without rollback -- including anonymous users -- account for a surprisingly large portion of (correct) vandalism reverts, so you are possibly missing out on those. (And conversely, edit warring can happen between rollback users too, unfortunately.)
The other option, which you seem to be going for, is warnings. This too has issues. Identifying which revision a Huggle warning was targeted at is simple because the revision ID is included in the URL in the warning message. However, multiple consecutive bad edits by a user will usually only result in one warning message, sometimes users will only revert and not leave a warning, if the user already has a final warning then Huggle will never leave a warning because it would be pointless, and most other patrolling tools do not identify the revision reverted in the warning message. And of course vandals will leave fake warning messages for legitimate users; yes, these are usually reverted when someone sees them, but vandals will also remove genuine warnings from user talk pages, so again you can't (automatically) distinguish the vandal and the legitimate user.
I don't play much outside of NS0 -- and would prefer not to go there. Enough said.
Possibly admirable but difficult to stick to if you want to gather information on problematic users, particularly as not only warnings but vandalism reports are located there.
You'd be correct about my efforts to build a spam corpus. From an academic perspective, my methods need not be flawless. So long as I can have a set of RIDs which are primarily spam and another which are primarily ham, I can begin to make some property distinctions. My main concern is that most wiki-based detection methods don't represent a random sampling of spam. Spam which is detected via rollback just represents the "naive and immediately detected" attempts. What about that super clever editor who pulled some bait-and-switch tactic with a XLink and got it embedded on a article? If I had a way to find that, then I would have something!
My guess is that for the most part the only common ground you'd find between spam edits is that they added an external link. And despite what many of Wikipedia's more influential contributors like to think (ever tried adding an external link while logged out?), treating all external links as spam isn't helpful. If we're only looking at the content of a revision and any other data that can be derived from the wiki itself, I don't think there's anything further that can be done to detect spam specifically -- sure, new users are more likely to spam that established ones, but they're more likely to be vandals, and even more likely to be neither. You'd probably have more luck detecting spam by identifying external links added, then accessing those links and looking at the content, but then we're into territory that isn't really specific to wikis (and probably not that amenable to machine learning either).
The "super clever editor who pulled some bait-and-switch tactic with a XLink" replaced one external link on an article with another and nobody happened to notice it. The latter part of that is for the most part just luck. It's also something that's hard to detect automatically (reverting such an action would also appear to replace one link with another, as would fixing a dead link, as would trimming unnecessary query parameters, as would converting a plain link into a citation, and so forth). The difficulty from the point of view of the spammer is creating enough accounts / using enough IP addresses that they can do this enough times to get lucky and not have it noticed one time, and more likely by the time that's happened the link has ended up on the spam blacklist.
The wiki has a huge number of things designed to stop spam, most of which just get in the way and make things unpleasant for newcomers, all in the name of combating something that really isn't that much of a problem. I'm not entirely convinced that more are needed.
  • MediaWiki:Spam-blacklist. Not only can nobody add any links on that list, if a link is added to that list when it's already on a page, now nobody can edit that page at all. And it uses regular expressions so you can guarantee someone with a poor understanding of them will break it now and then. The list is what can only be described as freaking enormous, most of the links on there chances are nobody knows when or why they were added and indeed the websites they pointed to probably don't even exist any more. Because editing just isn't slow enough without running a few zillion regexes on every save.
  • As I previously said, trying to add external links as an anonymous user is automatically assumed to be malicious. This makes vandalism patrolling as an anonymous user pretty much impossible -- every time a vandal messes up content that happens to include an external link, you've got to answer a stupid captcha again. And if you want to link to a diff, or history page, or old revision, or something else on Wikipedia itself? Sorry, we still think you're a spambot, please answer this captcha.
  • User:XLinkBot. Logged in? Got a link to add? Not on the blacklist? Yay, you might think, until this bot decides to revert you and warn you about it. The bot's list of links to revert is almost as long as the real spam blacklist, and just as dubious.
  • Several "anti-spam" components of MediaWiki, some of which -- unlike the rest of MediaWiki and indeed pretty much everything powering Wikipedia -- are closed-source. With these MediaWiki will just flat out refuse to accept your edit.
  • External links policies drafted by the small number of vocal contributors that make most of the policies, that give administrators the power to remove pretty much any external link they like.
  • Even more insane external links policies that were rejected by the community but are still de facto in effect because of the consequences for any contributors who voices opposition to them.
Gurch (talk) 04:58, 14 June 2010 (UTC)[reply]

Stiki

I got another great idea. How about putting in a filter so you could put in for example a bad word and see what pages it is on. Then if that word is on a page it shouldn't be on you can delete it. What do you think?, Gobbleswoggler (talk) 18:23, 14 June 2010 (UTC)[reply]

Hi again Gobbleswoggler. This is something I will think about -- but it is also something a lot of other people are doing (including STiki, to a certain extent). First, ClueBot operates exclusively over simple regular expressions (i.e., bad words). Plenty of bad-words do get through Cluebot though (since it is a bot, its rules must be conservative to avoid false positives).
Second, there is the Wikipedia edit filter. This is not my area of expertise, but I am sure there are plenty of filter-rules along these lines.
Third, there is STiki. STiki counts the number of bad words added by an edit and uses this as a machine-learning feature (along with the several spatio-temporal ones). Part of the challenge here is what constitutes a "bad word". Obviously a stand-alone bad word counts. For example, "you are an ass" is trivial, but we can't expect vandals to use proper grammar. Instead they might write "youAREass" -- clearly profane -- but the pattern match for something like this would also match the very innocent word "glass." What are your thoughts on this? I might be able to use my existing bad-word count data to create a revision filter, though -- just to give things a trial run.
Finally I'll note that a surprising number of the "bad words" on Wikipedia are legitimate. Between song names, accurate quotings and the like -- I am not sure this would have the great hit-rate you might expect. Another thought is that I could highlight bad words (using some color), making them easy to pick out when quickly patrolling.
How can i download/get this edit filter? Gobbleswoggler (talk) 19:12, 14 June 2010 (UTC)[reply]
It has yet to be implemented -- though I have most of the data in the back-end. I want to give this some thought, and would be interested to hear how you think the scoring and ranking of edits should proceed. Thanks, West.andrew.g (talk) 19:33, 14 June 2010 (UTC)[reply]
Have you thought about adding a subject button so you might just want to check certain areas of wikipedia eg.football,talk pages. What do you think? Gobbleswoggler (talk) 19:45, 14 June 2010 (UTC)[reply]
I think he's talking about searching existing page content, not filtering recent changes. For example, searching for all pages with the word "crap" on them and then looking for any unwanted instances. The problem with doing this is that Wikipedia's search results are not current but always a few days old, so most of the unwanted instances that turn up in the results will already have been removed, and even if you find and remove some yourself, they won't disappear from the search results, nor will new ones show up, until a few days later. Google and other search engines have the same problem. Gurch (talk) 06:26, 15 June 2010 (UTC)[reply]

Talkback

Hello, West.andrew.g. You have new messages at Cit helper's talk page.
Message added 21:09, 16 June 2010 (UTC). You can remove this notice at any time by removing the {{Talkback}} or {{Tb}} template.[reply]

I'm currently writing a suggestion, please check back at my talk page soon... Cit helper (talk) 21:09, 16 June 2010 (UTC)[reply]

You are now a Reviewer

Hello. Your account has been granted the "reviewer" userright, allowing you to review other users' edits on certain flagged pages. Pending changes, also known as flagged protection, is currently undergoing a two-month trial scheduled to end 15 August 2010.

Reviewers can review edits made by users who are not autoconfirmed to articles placed under pending changes. Pending changes is applied to only a small number of articles, similarly to how semi-protection is applied but in a more controlled way for the trial. The list of articles with pending changes awaiting review is located at Special:OldReviewedPages.

When reviewing, edits should be accepted if they are not obvious vandalism or BLP violations, and not clearly problematic in light of the reason given for protection (see Wikipedia:Reviewing process). More detailed documentation and guidelines can be found here.

If you do not want this userright, you may ask any administrator to remove it for you at any time. Courcelles (talk) 01:08, 18 June 2010 (UTC) [reply]

Stiki update

Hi,Gobbleswoggler here yet again. i noticed you have published a new update of STiki but i can't tell what's been added or changed!,Gobbleswoggler (talk) 15:39, 18 June 2010 (UTC)[reply]

Hi Gobbleswoggler. In this particular update there were changes on the back-end (which determines the edits that get displayed), not to the client-side GUI application (which you use). From your perspective, nothing should change (except maybe seeing more vandalism). The back-end change reflected an advance in how "dirty words" are scored, partially from your suggestions. Note that in the *.ZIP file of every distribution there is a file called CHANGELOG.txt -- this will provide you a description of the changes. Thanks, West.andrew.g (talk) 20:32, 18 June 2010 (UTC)[reply]

STiki

What has changed on STiki this time?,Gobbleswoggler (talk) 16:40, 28 June 2010 (UTC)[reply]

See my note above about the CHANGELOG.txt file in the ZIP distribution. The last two updates have been minor. One affects the back-end processing, and the other is a tool for research use. Nothing to exciting on the client-side. However, I am working on integrating the 'rollback' function into STiki (naturally for the those who have it, and into software for those who don't). That should be helpful for cases of multi-edit vandalism.

Misdirected Testing?

Checkuser results suggest that one of your linkspam related software tests may inadvertently be pointing to the English Wikipedia rather than test wiki. Please check your settings & adjust accordingly. Thanks, --Versageek 03:08, 14 July 2010 (UTC)[reply]