Talk:CopyPatrol: Difference between revisions

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 28 days ago by CaroleHenson in topic Deborah Morris and John Franklin
Content deleted Content added
 
(44 intermediate revisions by 12 users not shown)
Line 11: Line 11:
}}
}}


== New backend coming soon ==
== New CopyPatrol is live ==


I'm thrilled to announce the new version of CopyPatrol is now live at https://copypatrol.wmcloud.org. All existing links should redirect to the right place. Please join me in thanking @[[User:JJMC89|JJMC89]] for his tremendous help in this effort. He probably deserves most of the credit here, but certainly ''all'' of it for the backend that he completely rewrote from scratch. The new backend should be much more resilient, with the sporadic downtime that we occasionally see hopefully being a thing of the past. In addition, the new frontend offers a number of new features:
{{tracked|T333724}}
* Significant performance improvements
Hello all! I'm here to inform you a new backend (bot) that powers CopyPatrol will soon be updated. I've been working with @[[User:JJMC89|JJMC89]] on this for quite some time. We now have a demo ready, and are asking you all to see how it fares alongside the legacy feed powered by @[[User:EranBot|EranBot]].
* Edit summaries, change tags, and diff sizes
* "Undo" or "revdel" links for users who have the requisite permissions


One notable change you might see is that the iThenticate reports no longer include the crawl date. Unfortunately this is outside our control. The Turnitin product team has been made aware of this feature request, so we hope it will eventually be reinstated.
'''You can check out the new feed on our staging instance at [[toolforge:plagiabot]]'''. Feel free to test out saving reviews there for the time being, as it is using a test database, but note the production CopyPatrol should still be tended to as well.


Please let myself or JJMC89 know of any issues you see. At the time of writing, the backfill script is still running, so many older reports are missing. They should all be restored in due time. Additionally, we're still ironing out integration with [[mw:Extension:PageTriage]]. We'll mark [[phab:T333724]] as resolved once all of the aforementioned has been completed.
Our main concern is the volume of cases that appear in the new feed versus the old. We worry many of these are false positives, and we may be putting too much burden by cluttering the feed with illegitimate cases.


This release also marks the conclusion of a formal agreement with Turnitin. This has been in the works since at least ''May 2022''. Turnitin has been kind enough to give us free credits when we need them, but from a legal standpoint nothing solidified our relationship in the past. Now it is set in stone, and we have the reassurance that CopyPatrol is here to thrive for years to come. They were gracious enough to give us quite a bit of credits exceeding our current consumption, so we will soon be exploring adding more languages to CopyPatrol. On the front of negotiations with Turnitin, I'd like to thank @[[User:Ocaasi|Ocaasi]] who started the conversations, and more recently my colleagues @[[User:SSpalding (WMF)|SSpalding (WMF)]] from Legal, @[[User:JVargas (WMF)|JVargas (WMF)]] from Partnerships, my manager @[[User:KSiebert (WMF)|KSiebert (WMF)]], and our new Lead Community Tech Manager @[[User:JWheeler-WMF|JWheeler-WMF]].
Other questions, which may effect the number of cases reported by the bot:
* Should the bot skip reverted edits? We're planning on changing it so that it doesn't, and for CopyPatrol to clearly indicate which edits have already been reverted, and if you are a sysop, we'll provide a link to revision-delete the diff. Do you agree with this approach?
* The new backend checks ''replaced'' text, and not just ''added'' text. We hope this surfaces more copyvios, but it may be leading to too many false positives. Let us know if you have any thoughts on this.
* The current threshold for matching text against a source is 50%. We're wondering if that should be changed at all.
* Compared to the old feed, the new one surfaces many more sources, including non-internet sources. Some such as [https://plagiabot.toolforge.org/en?id=1829bac5-9de9-43c0-be94-e66c45f71061 this example] have over 30 sources. Is this overkill? Maybe we should collapse the sources in the view to say, 10 maximum, or just omit showing them at all? This is with the understanding that sources towards the top will have a higher matching percentage.


Above all, allow me to thank all of ''you'' – our users – who are doing the actual work of helping cleanse the wikis of copyright violations. Your tireless efforts are what drove us to reaching this milestone.
Feel free to leave your thoughts on the associated task ([[phab:T333724]]), or here in this thread. Pinging a few of our most prolific users: @[[User:Diannaa|Diannaa]] @[[User:Moneytrees|Moneytrees]] @[[User:Sphilbrick|Sphilbrick]] @[[User:L3X1|L3X1]] @[[User:DanCherek|DanCherek]] @[[User:Ymblanter|Ymblanter]] @[[User:Framawiki|Framawiki]]


Thanks for your feedback! [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 21:40, 16 August 2023 (UTC)
Warm regards, [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 21:42, 9 April 2024 (UTC)


=== Feedback ===
:Hi @[[User:MusikAnimal (WMF)|MusikAnimal (WMF)]]. The new tool is listing a huge number of cases: 521 cases are listed for August 16, for example, where the original CopyPatrol only listed 108. That's an impossible number of cases for us to complete given the number of patrollers we have that work on this task daily. I can only do about 20 cases per hour tops, and often a lot less. Even with the old version of CopyPatrol, if a key person misses even one day, we have difficulties. So that has to be fixed.{{pb}}Something I see in the old version that I am not yet seeing in the new version: When I click on the iThenticate link, the old version tells me the date the source was crawled. That can be a helpful clue to help determine if the material was copied from elsewhere on Wikipedia or if it's a true copyvio, so I would like to see it included.{{pb}}We don't need to see a huge list of possible sources. This is especially true where the edit itself is tiny. Typically a lot of the potential sources are replicating the same material. [https://plagiabot.toolforge.org/en?id=d6e18937-d866-4ceb-a953-da2dd0fb898c Here is an example]. All the editor did was move some prose from an image caption into the body of the article. If an editor has added a lot of copyvio from multiple sources, it's usually noticeable right away from the page history, and can be checked with Earwig's tool.{{pb}}I love that you've added the ability to search within the loaded pages on the iThenticate report. That is impossible to do in the original version of CopyPatrol, at least on my setup. That's all for now. [[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 02:15, 17 August 2023 (UTC)
{{hatnote|1={{fixed}} = code updated and confirmed would not show up if rechecked}}
:I am getting an error message when I attempt to mark a case as "Page fixed" or "No action needed". it says, 'Something went wrong. Please try again.' [[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 02:47, 17 August 2023 (UTC)
: Wow, I can actually ''feel'' everything loading faster (imagine my shock on discovering that marking the status of reports is now near-instant). The new features are great, could I share a little bit of feedback?
::Ah, that's a glitch I must have recently introduced. I'll fix in soon, but for now you can ignore the reviewing process since it's identical to the old one, anyway. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 00:59, 18 August 2023 (UTC)
:* The undo button is really useful, but its location next to the diff button has led to me now clicking it unintentionally multiple times (maybe it could be moved down)
:::This should be fixed now. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 01:58, 18 August 2023 (UTC)
: Other than that, everyone looks good. The leaderboard seems a bit funky, but I imagine that will be fixed with the backfill script. [[User:Isochrone|Isochrone]] ([[User talk:Isochrone|talk]]) 22:06, 9 April 2024 (UTC)
:Hello. As there's unresolved [[:en:User:EranBot/Copyright/rc|Eranbot]] listings from 2015 to 2016, I would like to request all of these listings to be restored to check if they were already resolved. Currently, listings before June 20, 2016 are not at CopyPatrol per [[phab:T138317|Phab]]. Thanks! [[User:MrLinkinPark333|MrLinkinPark333]] ([[User talk:MrLinkinPark333|talk]]) 19:34, 17 August 2023 (UTC)
:It's so awesome to see how this technology and this partnership has evolved and matured. Congrats to everyone who has pushed it so much further!! [[User:Ocaasi|Ocaasi]] ([[User talk:Ocaasi|talk]]) 00:13, 10 April 2024 (UTC)
::Hi @[[User:MrLinkinPark333|MrLinkinPark333]]! As per the phab task, those old reports are still accessible in the [[en:User:EranBot/Copyright/rc|EranBot archives]]. There is no viable means to import them into CopyPatrol, I'm afraid. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 00:36, 18 August 2023 (UTC)
:::Okay. Thank you for the update! [[User:MrLinkinPark333|MrLinkinPark333]] ([[User talk:MrLinkinPark333|talk]]) 00:53, 18 August 2023 (UTC)
* Amazing! edit summaries are so helpful! thanks to all who made this work. [[User:L3X1|<small>en</small>L3X1]] ¡‹[[User talk:L3X1|delayed reaction]]›¡ 13:39, 10 April 2024 (UTC)
: The new version has many positive changes, such as the quick loading time and the expected reduction in outages. However, on the down side, I see that there's already 212 cases posted for April 10 and there's still three hours to go, so a projected 240 cases to assess in the 24 hour period. Given that most days we only have two people working the queue, this needs to be cut in half if that's possible. It's unrealistic and unstustainable to expect our tiny crew to keep up with the voume otherwise. (I can typically only clear about 20 cases per hour and can only commit to working on this for 3-4 hours per day.) [[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 21:20, 10 April 2024 (UTC)
::[[:en:User:EranBot/Copyright/Batches]] lists all the pages where the postings were made, and the work that I did to clean them up before we initiated the CopyPatrol interface. If you wish to investigate those reports, you could do so from those postings. The iThenticate links no longer work though. But I don't think that's a good use of editor time; old cases are very difficult to solve, and we already have a huge amount of work between CopyPatrol, [[:en:wp:CCI]], and [[:en:WP:CP]], and very few people willing to do it. Postings from Batch 46 forward would not need to be checked, because the are duplicates of items that were also listed at CopyPatrol and we dealt with them as they happened on a daily basis. I switched over to working the CopyPatrol queue somewhere around June 17, 2016, and don't have time to do any of those old reports in addition to the hours I spend daily on the CopyPatrol queue. [[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 00:42, 18 August 2023 (UTC)
::Yes, many thanks for the improvements! Very grateful. I agree with Diannaa that we may need some tweaks in terms of what the bot flags as a potential copyright violation as the threshold seems to have been lowered compared to before (one example I mentioned on her talk page was that it now flags cases where someone changes one or two words in a paragraph because it detects a match for the remaining text in the paragraph). Not sure we'll be able to handle the reports otherwise. [[User:DanCherek|DanCherek]] ([[User talk:DanCherek|talk]]) 22:34, 10 April 2024 (UTC)
:::The iThenticate IDs still work, the URL was switched to a new one when Copypatrol was introduced so appending them to <code>https://copypatrol.toolforge.org/ithenticate/<ID></code> works. A lot of the pages are already deleted/the additions are long gone, through all the blacklisted links have to be removed as well. It's still probably worthy looking at though. [[User:Isochrone|Isochrone]] ([[User talk:Isochrone|talk]]) 09:41, 18 August 2023 (UTC)
:::@[[User:Diannaa|Diannaa]] @[[User:DanCherek|DanCherek]] Thanks for all of the feedback! Can you link to specific example(s)? {{tq|someone changes one or two words in a paragraph because it detects a match for the remaining text in the paragraph}} – wouldn't that still usually be a copyright violation, or do you mean the source is a [[w:WP:BACKWARDSCOPY|backwards copy]] (in which case it's not a copyvio at all)?
:I am very interested in participating, although I am on a bus in Slovenia at the moment, with a packed schedule, so we will see. Sphilbrick (having issues with login so will post logged out) [[Special:Contributions/188.198.37.7|188.198.37.7]] 08:59, 19 August 2023 (UTC)
:::Assuming the cases are still valid, my opinion is that it's perfectly fine to have a backlog. While it's admirable to aim for completeness, you can only volunteer but so much time. If however you're seeing a lot of noise, with backwards copies, or otherwise too many cases that are right on the "borderline", etc., we certainly can work to improve that. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 22:45, 10 April 2024 (UTC)
::::I'm seeing a lot of cases like [https://copypatrol.wmcloud.org/en?id=9052d5d4-19ef-41c1-837c-8071f944aa11], where someone copyedits a paragraph and then it matches the rest of the unchanged text to a backwards copy. We still had to deal with backwards copies in the old CopyPatrol, of course, but so far it feels like a lot more after the update. [[User:DanCherek|DanCherek]] ([[User talk:DanCherek|talk]]) 22:50, 10 April 2024 (UTC)
::::: <small>{{fixed}} —&thinsp;[[User:JJMC89|JJMC89]]&thinsp;<small>([[User talk:JJMC89|T]]'''·'''[[Special:Contributions/JJMC89|C]])</small> 17:25, 11 April 2024 (UTC)</small>
::::[https://copypatrol.wmcloud.org/en?id=a85453bc-2bc3-4017-9306-76875225f6d7 This report] flags an edit that just cleaned up references with no real new text added. -- [[User:Whpq|Whpq]] ([[User talk:Whpq|talk]]) 22:56, 10 April 2024 (UTC)
::::: <small><del>{{fixed}}</del> —&thinsp;[[User:JJMC89|JJMC89]]&thinsp;<small>([[User talk:JJMC89|T]]'''·'''[[Special:Contributions/JJMC89|C]])</small> 06:44, 11 April 2024 (UTC) modified 16:19, 11 April 2024 (UTC)</small>
::::Due to the large number of Wikipedia mirrors, we will always have false positives. We can waste a lot of valuable time on those cases, attempting to determine who had it first. We do have a [[User:EranBot/Copyright/Blacklist| whitelist of Wikipedia mirrors]] but people who don't know Regex are warned not to edit it. Here's a few more false positives of various kinds. I don't know if these are useful examples or not:
::::*[https://copypatrol.wmcloud.org/en?id=932428dc-f169-4f16-b64d-68f19c2e7d96 Here's one] where an editor removed multiple occurrences of the word "current" from a list. The list itself is public domain of course.
::::*[https://copypatrol.wmcloud.org/en?id=932428dc-f169-4f16-b64d-68f19c2e7d96 Here's one] where an editor moved a paragraph that was reflected in a Wikipedia mirror. The material they added in the same edit is okay to keep.
::::*[https://copypatrol.wmcloud.org/en?id=5ae0d58a-93fc-434b-b37b-a56bc89b601d In this one], an editor actually removes text but since IMDb has copied our plot summary at some point, the item gets listed.
::::*[https://copypatrol.wmcloud.org/en?id=ad9efee3-51d1-456b-8c63-7fc1a45e931c Here's one] that illustrated DanCherek's point: only a few words are added. Purported source: an obvious Wikipedia mirror.
::::Another suggestion: Perhaps we can somehow teach the system to only show us the most likely cases? Maybe there's a way to reduce the threshold for inclusion, regarding the size of the edit or the amount of the overlap? It's not a question of having a backlog; if we don't reduce the fire hose of incoming cases there will be many that never get assessed at all. [[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 23:25, 10 April 2024 (UTC)
::::: <small>{{fixed}} <del>first and fourth</del><ins>all</ins>. The second link is the same as the first. —&thinsp;[[User:JJMC89|JJMC89]]&thinsp;<small>([[User talk:JJMC89|T]]'''·'''[[Special:Contributions/JJMC89|C]])</small> 06:44, 11 April 2024 (UTC) modified 17:25, 11 April 2024 (UTC)</small>
::::::Sorry about the duplicate link; I am not going to bother to look for the missing example. New comments:
::::::*Community Tech bot used to remove listings of pages that were already deleted. This doesn't seem to be happening so far: [https://copypatrol.wmcloud.org/en?id=feb8de1f-acdb-4bb1-9656-2e6f423a37a6 deleted article], [https://copypatrol.wmcloud.org/en?id=9f949c62-ee2d-415d-8ec6-ec3e19fd5d32 deleted draft]
::::::*Cases so far at the halfway point of April 11 are a much more manageable 40, so if tweeks are underway, it's working.
::::::[[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 12:11, 11 April 2024 (UTC)
::::::: Unfortunately I had to revert one of the fixes due to poor performance causing the bot to buildup a large backlog that hasn't been processed yet. —&thinsp;[[User:JJMC89|JJMC89]]&thinsp;<small>([[User talk:JJMC89|T]]'''·'''[[Special:Contributions/JJMC89|C]])</small> 16:19, 11 April 2024 (UTC)
:One thing I've noticed is that I keep getting logged out everytime I close my browser-- is there a cookie persistence issue? I had no such issues with the old backend. [[User:Isochrone|Isochrone]] ([[User talk:Isochrone|talk]]) 13:38, 11 April 2024 (UTC)
::I will look into this. This seems this happens to every new [[w:Symfony|Symfony]] app that I create ([[phab:T224382]]). I managed to fix it before, so I'll attempt it again for CopyPatrol (the old CopyPatrol did not run on Symfony, FYI) [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 19:32, 11 April 2024 (UTC)
:::I just noticed that I can't view the iThenticate reports unless I am logged in to CopyPatrol. So that might be a feature rather than a bug. [[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 23:14, 11 April 2024 (UTC)
:::: Logging in is required since each user must agree to the EULA to see the reports. The short login session should get worked on. —&thinsp;[[User:JJMC89|JJMC89]]&thinsp;<small>([[User talk:JJMC89|T]]'''·'''[[Special:Contributions/JJMC89|C]])</small> 22:25, 12 April 2024 (UTC)


{{tracked|T362457|resolved}}
=== Mirrors ===
New feedback: Some users are incorrectly being shown with redlinked user talk pages. [https://copypatrol.wmcloud.org/en?id=21a041be-eb65-42ae-8d00-082e4bfe7fbd Here], [https://copypatrol.wmcloud.org/en?id=b42cffb1-6129-450a-bd58-a2ff5f790c1b here], [https://copypatrol.wmcloud.org/en?id=3933eeb7-7454-41ff-ac5d-070f5ebf28bd here], for example. It appears this might be because they don't have a talk page on Meta, but that's immaterial; I would prefer to be able to see at a glance whether or not a user talk page exists at en.wiki for that username. [[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 21:44, 12 April 2024 (UTC)
A new suggestion: We spend an inordinate amount of time repairing unattributed copying within Wikipedia. If some of the more common Wikipedia mirrors could be identified and whitelisted, it would reduce the amount of time we spend on that, which is not as serious a violation as a true copyright violation (copying copyright material from external news sources, books, or elsewhere). There's already a whitelist at [[User:EranBot/Copyright/Blacklist]] but some of the ones I frequently see are not listed there: Bookpedia and Handwiki, for example. [[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 15:10, 19 August 2023 (UTC) Adding: It looks like "Wikia" is on Eran's list; but it's now called "Fandom". Should we whitelist that? [[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 15:38, 19 August 2023 (UTC)
: Or perhaps pages with a high-similarity to existing articles could be marked as such on the UI to quickly identify/filter CWW, as for removing mirrors the list at [[en:WP:MIRRORS]] is quite extensive and machine-friendly.<br>N.B. are the iThenticate links meant to be broken? [[User:Isochrone|Isochrone]] ([[User talk:Isochrone|talk]]) 17:52, 19 August 2023 (UTC)
::@[[User:MusikAnimal (WMF)|MusikAnimal (WMF)]]: we are getting an error message when attempting to view iThenticate reports in the new version. 'Oops! An Error Occurred. The server returned a "500 Internal Server Error"'. [[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 09:44, 21 August 2023 (UTC)
:::@[[User:Diannaa|Diannaa]] {{fixed}} Sorry about that! If it wasn't obvious, this new version of CopyPatrol is a complete rewrite, so some bugs were expected. We'll get everything fixed before we go "live", though :)
:::I'll also note that I just got a 500 error from iThenticate itself. I just refreshed and the report loaded fine, so if you run into this you can try the same. If it happens a lot, we'll report it to Turnitin. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 16:29, 21 August 2023 (UTC)
:I should have mentioned, the new ignore lists are centralized at [[User:CopyPatrolBot/UrlIgnoreList]] and [[User:CopyPatrolBot/UserIgnoreList]]. Please feel free to edit them as desired. Before we deploy the new CopyPatrol, we'll ensure all the entries are copied over from the old ignore lists, so don't worry about that. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 16:34, 21 August 2023 (UTC)
::I don't have any knowledge of Regex so I won't be able to add any urls myself unfortunately. [[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 16:48, 21 August 2023 (UTC)
:::Yeah, I was wondering if it would be possible to leverage the recently introduced [[w:en:Special:BlockedExternalDomains]] system. Just as with the Spamblacklist, the CopyPatrol URL ignore list almost never truly needs regular expressions, rather just plain URLs. Pinging @[[User:Ladsgroup|Ladsgroup]] for input. I'm happy to file a ticket for this as well as help code and review this effort, if we don't think it will be terribly hard. So basically we'd like to generalize the UI, something like [[Special:EditUrlList/Pagename.json]]. I imagine there are other use cases beyond Spamblacklist and CopyPatrol ignore list. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 17:03, 21 August 2023 (UTC)
::::Sure thing. I don't think it's too hard to make that happen. [[User:Ladsgroup|Amir]] ([[User talk:Ladsgroup|talk]]) 03:58, 24 August 2023 (UTC)
:::::[[phab:T345217|Bug filed]]. Thanks, Amir! [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 23:51, 29 August 2023 (UTC)
::Hi @[[User:MusikAnimal (WMF)|MusikAnimal (WMF)]], is [[ User:EranBot/Copyright/Blacklist ]] still used? Because it looks like it is still the one maintained by patrollers {{smile}} [[User:Framawiki|Framawiki]] ([[User talk:Framawiki|talk]]) 17:01, 11 December 2023 (UTC)
:::@[[User:Framawiki|Framawiki]] Yes, until the new version goes live, that's the page to use. A redirect will be left when it is changed. We're still waiting on the final approval from Turnitin before we switch everything over. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 02:17, 19 December 2023 (UTC)


:{{fixed}} [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 19:17, 14 April 2024 (UTC)
=== Edit summaries ===
I just started looking at the new tool. I don't yet have comments on the new tool per se, but since the code is being worked on thought I'd throw out an idea that I would find helpful, and I think it would be pretty easy to implement.


== Moving ignore lists to the CopyPatrol UI ==
'''In a nutshell, I propose that the edit summary be posted as part of the information displayed about the identified edit.'''


In the [[#Feedback|above]] discussion, it was noted how tedious it is to maintain [[User:CopyPatrolBot/UrlIgnoreList]] as it requires knowledge of [[w:regular expression|regular expressions]]. I had an idea that we could get rid of the on-wiki lists and instead have a button "Ignore URLs like this" directly in the CopyPatrol UI. We could do the same for users, too, so you don't have to edit [[User:CopyPatrolBot/UserIgnoreList]]. This is also nice because the new system has the ignore lists centralized on Meta, where not everyone is necessarily able to edit (the page could be semi-protected).
I am fully aware that as soon as I click on the diff button, I can easily see the edit summary, so you might be puzzled why I would want it on the case listing page. My rationale is that I have found, through experience, that looking at the edit summary is one of the most important things to look at because it will help define my process. For example, if the edit summary is "rvv", I'm not going to start with the type that report to see if the text matches some other source, I'm going to look at the history to see if the edit summary is accurate and this is a false positive, because the edit reverts to an earlier version and the matching text arises because the earlier version is in some mirror.
In contrast, if the edit summary states "material copied from {some other article], see that article for attribution", my process will be a little different.


The only issue I foresee with this idea is the potential for abuse. For that, I was thinking we'd either restrict the ability to ingore URLs and users to "privileged" users – say at least 1,000 edits, or even restrict to sysops? Another option is to go ahead and shield all of CopyPatrol from newbies, as proposed at [[phab:T178700]].
"So what", you might be thinking, because I'm always going to click on the diff where I can see the edit summary. The point is that I have different processes depending on the edit summary, and I think it would be more efficient if I could glance at edit summaries and work on similar issues as a group. So, for example, I could glance down the page and look for all of the edit summaries containing RVV, or revert to earlier version or something similar, handle all of those, then come back and look for all of the edit summaries indicating it's a copy from another article, handle all those, and then look for another group of similar articles. Maybe my age is showing, but I don't switch gears is easily as I used to, so I would find it more efficient if I could handle half a dozen reports consecutively where my process is the same, then switch to a different type.


Thoughts? [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 19:43, 11 April 2024 (UTC)
If this only helps me it's not worth implementing, but if someone else finds this potentially useful, I think it's almost a trivial change, copiy the edit summary and place it on the report somewhere. (My simple suggestion would be to just drop it below the iThenticate report button, but if there is an easier option, as long as it's always in the same place I'll be happy.)--[[User:Sphilbrick|Sphilbrick]] ([[User talk:Sphilbrick|talk]]) 12:06, 29 August 2023 (UTC)
: I can't imagine any issues with this for URLs. With users, making it too easy, even for admins (who are humans), to exclude users may lead to unintentional removals of users who should be flagged, or people being too liberal with the ignore button.<br>When there ''are'' errors on the wikitext list, this can just be rectified by another user: would there be a way to "un-ignore" users in case of errors? [[User:Isochrone|Isochrone]] ([[User talk:Isochrone|talk]]) 20:28, 11 April 2024 (UTC)
:@[[User:Sphilbrick|Sphilbrick]] Ask and you shall receive :) In addition to edit summaries, I've also added tags and the edit size. The tags are especially useful I hope, as they will tell you if it's a revert, or if it was ''reverted''. In the latter case, I was thinking of providing a "revdel" link next to the "Diff" link for quick access to the revision delete form. Would that be useful? [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 00:15, 30 August 2023 (UTC)
::I think it makes sense to have an interface to manage the ignored URLs and users. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 22:43, 14 April 2024 (UTC)
::This is great. (wish you had been at my board meeting last night, a lot of asking, and not a lot of receiving:). Yes, easy access to the Revdel button would be nice. Edit I just noticed you said I have rather than I will; very nice thanks. [[User:Sphilbrick|Sphilbrick]] ([[User talk:Sphilbrick|talk]]) 10:40, 30 August 2023 (UTC)


== CopyPatrol has stopped, but.. ==
=== Review comments ===
I'm not sure of how difficult this is, but perhaps adding a review comment button (i.e. under the resolve options) would be useful, as opposed to more options as previous proposed? I know this is mainly a focus on backend changes and I can file a task on Phab if appropriate, but for some cases it may not be obvious to other "patrollers" about the action taken.<br>I can make a little mockup if that helps. Thanks for all the work you and the comtech team are doing. [[User:Isochrone|Isochrone]] ([[User talk:Isochrone|talk]]) 19:52, 30 August 2023 (UTC)
:I believe what you're asking for is basically the same as [[phab:T279083]], only more generalized. I was thinking we could allow adding any arbitrary comments, but also have a dropdown of commonly used ones. That list can be configurable by CopyPatrol users.
:With the new system this is all much easier to implement, so I will look into it :) [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 21:17, 1 September 2023 (UTC)


CopyPatrol has stopped, because Turnitin is down for maintenance. Check https://turnitin.statuspage.io/ for updates. [[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 19:36, 20 April 2024 (UTC)
=== Pre-filled revision deletion ===
You mentioned the possibility of a link to the revision deletion template.


== I keep getting logged out ==
This reminds me of something I've always wanted to ask for, but didn't think I could justify setting up a project for this small request. However, if you are actively working on a new version maybe now's the time.


Maybe I'm losing my mind but I find myself logged out of copy patrol numerous times each week, even though I don't close the window or log out of OAuth at all. I swear the only time I had to log in to CopyPatrol on the old system was when I rebooted. Is there a setting somewhere I can change to keep me logged in or is this the new normal? Thanks [[User:L3X1|<small>en</small>L3X1]] ¡‹[[User talk:L3X1|delayed reaction]]›¡ 00:53, 28 April 2024 (UTC)
I use a number of the options in Special:RevisionDelete when generally working on RD1 requests, But if I am carrying out a revision deletion in the context of copy patrol work, four of the five choices are identical in close to 100% of the cases. It would be nice and helpful if a customized RD1 template came up when doing copy patrol tasks.


:@[[User:L3X1|L3X1]] This was brought up in the feedback above. I have just deployed a change that I hope will help. Please let me know if it does (same for @[[User:Diannaa|Diannaa]] and everyone else :). There larger issue is rather a mystery, I'm afraid. I hope to investigate it more soon. You can follow [[phab:T224382]] for updates. Best, [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 05:14, 28 April 2024 (UTC)
I would preset the template with:
* Delete revision text '''Set'''
* Delete edit summary '''Do not change'''
* Delete performer's username/IP address '''Do not change'''
Reason: <Pre-fill with the RD1 option>


== Deborah Morris and John Franklin ==
To put it differently, the standard invocation has three visibility restrictions for which the default is "do not change" for all three. Change the first default from "Do not change to "Set". The reason field is a drop-down box allowing the editor to choose from seven options. There is no default, so prefill or make the default the RD1 option.


I saw this in my watchlist...
There is also a field for "other/additional reason" I don't know about other editors but I typically use that field to add the URL of the copyrighted source material. I fully grant that changing the first field, and selecting the reason is only a couple of clicks, but a couple of clicks repeated ten thousand times adds up. This customized template would mean I could just drop in the source URL which I typically already have in my buffer and complete the RD1 in half the time.--[[User:Sphilbrick|Sphilbrick]] ([[User talk:Sphilbrick|talk]]) 13:10, 31 August 2023 (UTC)
:@[[User:Sphilbrick|Sphilbrick]] Done! I've also added an undo link (as you wouldn't usually rollback here), and a Delete link for new pages. The latter fills in the deletion summary with G12, and also supplies the top source URL. I can't do the same for Undo and also have the automated summary (''undid revision by so-and-so''), unfortunately. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 21:13, 1 September 2023 (UTC)
:Oh, I should mentioned however that the deletion reason auto-selection only works for English Wikipedia, as we must hard-code the value. This doesn't scale well and is fragile (i.e. if someone changes the copyvio reason at [[en:MediaWiki:Deletereason-dropdown]] then our code must also be updated). Longer-term, I was thinking we could have an interface page where admins for said wiki can customize the links in CopyPatrol. This would allow CopyPatrol users to update the deletion reason as needed without developer intervention, and also allow each wiki to customize links that meet their workflows. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 21:20, 1 September 2023 (UTC)
::I tried the revdel option and loved it. I didn't even ask to prefill the the source as I thought that was asking too much but at least in this case it worked. I was able to invoke the revdel and complete it with a single click. KUDOS [[User:Sphilbrick|Sphilbrick]] ([[User talk:Sphilbrick|talk]]) 13:05, 2 September 2023 (UTC)


:Potential copyright violation log b 22:10 CopyPatrolBot talk contribs marked revision 1221979003 on Deborah Morris and John Franklin as a potential copyright violation ‎ Tag: PageTriage
=== Match percentage ===
The percentage shown next to each "Compare" line now shows two places after a decimal point instead of rounding to a full percentage points. Is this something that was requested? It probably doesn't hurt anything and if there is a value to the increase places, fine but I can't think of a situation where I would need the numbers to the right of the decimal point.--[[User:Sphilbrick|Sphilbrick]] ([[User talk:Sphilbrick|talk]]) 11:41, 3 September 2023 (UTC)


but the only thing that I am finding is the duplication of a long title in the Bibliography: ''The Morris family of Philadelphia, descendants of Anthony Morris, born 1654-1721 died''. It seems that I can post a comment somewhere related to this... but I forgot where. Where can I provide a comment on this? Thanks so much![[User:CaroleHenson|CaroleHenson]] ([[User talk:CaroleHenson|talk]]) 05:29, 3 May 2024 (UTC)
=== Missing reports ===
I understand this is a complete rewrite, so one shouldn't expect the exact same set of cases in the rewrite and the legacy. However, I am puzzled to see this page: [[Draft:Nordic Film & TV Fund]] show up in the legacy not in the rewrite. It may be gone by the time you see this but it was a 93% match and essentially a copy paste from the about us page for the organization. I notice in draft space but I do see some entries in draft space in the new version so I am puzzled why this one wasn't picked up. [[User:Sphilbrick|Sphilbrick]] ([[User talk:Sphilbrick|talk]]) 12:26, 4 September 2023 (UTC)
: It is at [[:toolforge:plagiabot/en?id=8ef096d3-d98d-4c31-9d25-dcbf294c2286]]. —&thinsp;[[User:JJMC89|JJMC89]]&thinsp;<small>([[User talk:JJMC89|T]]'''·'''[[Special:Contributions/JJMC89|C]])</small> 17:24, 4 September 2023 (UTC)
::Thanks, wonder how I missed it. Good to see. [[User:Sphilbrick|Sphilbrick]] ([[User talk:Sphilbrick|talk]]) 22:02, 4 September 2023 (UTC)
=== Damage score ===
I noticed "damage score" for the first time today. Example:
[https://plagiabot.toolforge.org/en?filter=open&filterPage=Greg+Smith+%28cricketer%2C+born+1988%29&filterUser= Link]


:I found the log here: https://copypatrol.wmcloud.org/en. It identifies a source I did not use, but it has content that is in the article in a quote from another source: "https://archive.org/details/havilandgenealog00fros/page/210/mode/1up?q=%22no+longer+able+to+hear%22". It's a quote - and the source is from 1914, so not a copyright violation.
My very cursory review of the entries on the current page identified three examples.


:The source that I used was a 1893 newspaper article: "[https://www.newspapers.com/article/the-new-york-times-old-time-new-york-fri/146458941/ Old-Time New-York Friends: Services of the "Plain People" in Revolutionary Days]". The New York Times. November 11, 1893. p. 16. Retrieved May 2, 2024."
Can you explain what this means?--[[User:Sphilbrick|Sphilbrick]] ([[User talk:Sphilbrick|talk]]) 22:28, 14 September 2023 (UTC)


:I couldn't figure out how to add a comment that it's not a copyright violation.[[User:CaroleHenson|CaroleHenson]] ([[User talk:CaroleHenson|talk]]) 05:51, 3 May 2024 (UTC)
:@[[User:Sphilbrick|Sphilbrick]] It's the same as the [[mw:ORES|ORES score]] in the old system. ORES has been replaced by new service called LeftWing, so we can't call it "ORES" anymore. The models are still the same, though, in this case the [[mw:ORES/FAQ#What_are_damaging_edits_(sometimes_called_"vandalism")?|"damaging" model]]. I didn't add a link yet as I assume the Machine Learning team will move the documentation now that it has new name. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 03:07, 16 September 2023 (UTC)
:: @[[User:CaroleHenson|CaroleHenson]] When using quotes it's normally enough to add the reference. If you use public domain content outside of quotes, then you need to add the attribution template. I've marked the raport as no action needed since you already added the reference. [[User:1AmNobody24|<span style="border:1px solid black;padding:1px;background-color: #4D4DFF;color: white">Nobody</span>]] ([[User talk:1AmNobody24|<span style="color: #4D4DFF">talk</span>]]) 08:15, 3 May 2024 (UTC)
::OK thanks. [[User:Sphilbrick|Sphilbrick]] ([[User talk:Sphilbrick|talk]]) 12:49, 16 September 2023 (UTC)


:::Great thanks![[User:CaroleHenson|CaroleHenson]] ([[User talk:CaroleHenson|talk]]) 19:44, 3 May 2024 (UTC)
==== Do we even need damage scores? ====

On the topic of damage scores (previously called ORES), I'm wondering just how useful this information is for CopyPatrol users. I ask because it is by far the slowest part of the application, especially with the new LeftWing system that replaced ORES that requires us to make a separate request for each revision, instead of doing a bulk query. Once we fetch a damage score, we cache it, but since the feed is constantly updated, it usually will take a while on the first load of a session. If we take out damage scores entirely, you should experience a signficant performance improvement. Pinging a few top users for feedback: {{ping|Sphilbrick|Diannaa|DanCherek|L3X1|Ymblanter|Moneytrees}} Thanks, [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 17:32, 5 October 2023 (UTC)

:I guess I should note that since the old ORES system is now gone, I had to disable it in the old UI entirely, so you all have been going at least a few weeks now without "damage" (aka ORES) scores and no one seems to have complained… perhaps I already have the answer I need. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 17:34, 5 October 2023 (UTC)
::I am not even sure I know what a damage score is. Unless it is the same as the percentage of text overlap, I am probably not using it at all. [[User:Ymblanter|Ymblanter]] ([[User talk:Ymblanter|talk]]) 17:44, 5 October 2023 (UTC)
:::I see, it is not the same. I am unlikely to use it. [[User:Ymblanter|Ymblanter]] ([[User talk:Ymblanter|talk]]) 17:45, 5 October 2023 (UTC)
::::I don't think I ever used it, user edit count is what grabs my attention first then I go straight to the diff [[User:L3X1|<small>en</small>L3X1]] ¡‹[[User talk:L3X1|delayed reaction]]›¡ 01:44, 6 October 2023 (UTC)
:Removing it would not affect my workflow at all. [[User:DanCherek|DanCherek]] ([[User talk:DanCherek|talk]]) 18:08, 5 October 2023 (UTC)
:::I don't use it.[[User:Diannaa|Diannaa]] ([[User talk:Diannaa|talk]]) 21:00, 5 October 2023 (UTC)
::::Great, thanks for the replies, all! [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 00:48, 10 October 2023 (UTC)

== RevDel'd diffs get marked No Action Needed? ==

ran across 3 in a row (https://copypatrol.toolforge.org/en/?id=102542253 https://copypatrol.toolforge.org/en/?id=102542234 https://copypatrol.toolforge.org/en/?id=102542204) and marked them as no action needed. Is there a way to make them not show up in copy patrol? [[User:L3X1|<small>en</small>L3X1]] ¡‹[[User talk:L3X1|delayed reaction]]›¡ 01:45, 28 September 2023 (UTC)

:I am not involved in the development of this tool, but I've occasionally noticed something in the same vein — the tool identifies a problem which has been addressed by some editor other than those involved in reviewing the reports, so has not been identified as fixed. The thought has crossed my mind that it would be nice to know about this but I think it might be challenging to do so. Having said that, it's is my experience that it's helpful to identify the goal, because sometimes what sounds like an intractable problem has a reasonable fix..
:I'll start with my summary of why it's a problem, using [[en:Ruth Yeazell|Ruth Yeazell]] as an example:
:* at 00:41 28 Sep an edit was made to the article adding some copyrighted text
:* At some unknown time shortly after, the edit was examined by Copy Patrol and identified as a potential copyright violation.
:* Almost immediately thereafter the report was added to the database
:* at 00:42 Gobonobo reverted the edit. (I'm guessing this edit occurred before the report was added to the copy patrol logs but I'm not sure that it matters)
:* at 01:33 zzuuzz perform the revision deletion. (I'm speculating but it seems likely this action occurred after the report was added to the database)
:If we want the database to reflect the fact that the material in question has subsequently been removed, this means that the tool has to constantly revisit the article and examine any edits subsequent to the identified edit. I presume that's physically possible, but by definition it's not an action that can take place at the time the original report is filed, unless review of potential offending edits occurs well after creation. It would also mean a different type of examination. I presume now contents of an edit is examined and compared against a database of existing material, but examining subsequent edits might have to look at edit summaries or indicators that the material is revision deleted. Sounds possible but it sounds like a very different action than is undertaken to identify potential violations.
:Note in this particular example there are two edits that could trigger the removal. There is the edit by Gobonobo which reverted the edit in question and then the later edit by zzuuzz to do the revision deletion. If we were to push for a change of the sort should it be restricted to revision deletions or should it also picked up ordinary edits removing material? Should the subsequent review identifying that the original offending material has now been removed simply remove the entry from the database or should it trigger an update to the report identifying that it may have been addressed? [[User:Sphilbrick|Sphilbrick]] ([[User talk:Sphilbrick|talk]]) 15:10, 28 September 2023 (UTC)
::I will say that the [[#New backend coming soon|new version]] (now at https://copypatrol-test.wmcloud.org/ – OAuth login not working yet) will show any tags associated with an edit such as "reverted", so you will have that info upfront.
::We can also easily check if an edit has been revdel'd and indicate it as so in the UI. If you'd rather the system automatically remove them, I can make it so, but it sounds like @[[User:Sphilbrick|Sphilbrick]] is questioning if that's always what we want? I would guess that any subsequent edits that are also copyvios will also show up in the feed, or at least they should. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 17:28, 5 October 2023 (UTC)

== No recent repots ==

The most recent entry is 18 October [[User:Sphilbrick|Sphilbrick]] ([[User talk:Sphilbrick|talk]]) 15:17, 2 November 2023 (UTC)

:{{re|MusikAnimal (WMF)|MusikAnimal}} Any idea as to why this might have been? I'm seeing reports from today now. — <span style="background: linear-gradient(#990000,#660000)">[[User:Red-tailed hawk|<span style="color: white">Red-tailed&nbsp;hawk</span>]]&nbsp;<sub>[[User talk:Red-tailed hawk|<span style="color: white">(nest)</span>]]</sub></span> 15:53, 3 November 2023 (UTC)
::There was an [[phab:T350399|iThenticate outage]] on November 2. That would be why there were fewer reports around then. Beyond that, when viewing "[https://copypatrol.toolforge.org/en/?filter=all&filterPage=&filterUser= All cases]", I'm seeing a normal stream over the past several weeks. [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 21:50, 3 November 2023 (UTC)

== Fix the problem by the same user! ==

It should NOT be allowed to a user to "mark his/her own articles as fixed". Otherwise this tool will NOT be trusted.
Here is an example: (https://copypatrol.toolforge.org/ar/?filter=reviewed&filterUser=Kamalelsayedmohamed), his articles have 99% copy from other site, then the user marked them as "Fixed"! or "No action needed"! [[User:Dr-Taher|Dr-Taher]] ([[User talk:Dr-Taher|talk]]) 05:59, 30 November 2023 (UTC)

:@[[User:Dr-Taher|Dr-Taher]] See {{phab|T334272}}. [[User:1AmNobody24|1AmNobody24]] ([[User talk:1AmNobody24|talk]]) 06:58, 30 November 2023 (UTC)
::Thanks @[[User:1AmNobody24|1AmNobody24]], but more than 7 months, and no action is taken! [[User:Dr-Taher|Dr-Taher]] ([[User talk:Dr-Taher|talk]]) 10:05, 30 November 2023 (UTC)
:::I'll get this implemented in the new version, which we'll be rolling out before the end of the year. However if the intention is solely to prevent misuse, it's worth noting a bad actor can easily get around this by simply creating a new account and using that to review their other account's edits. Perhaps use of CopyPatrol should be limited to autoconfirmed accounts? [[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] ([[User talk:MusikAnimal (WMF)|talk]]) 20:23, 30 November 2023 (UTC)
::::Could there be an option for local communities to associate it with different rights (for example, it could be limited on EnWiki to new page reviewers if the community wants it, since Autoconfirmed gaming is very easy). — <span style="background: linear-gradient(#990000,#660000)">[[User:Red-tailed hawk|<span style="color: white">Red-tailed&nbsp;hawk</span>]]&nbsp;<sub>[[User talk:Red-tailed hawk|<span style="color: white">(nest)</span>]]</sub></span> 03:29, 2 December 2023 (UTC)
:::::@[[User:Red-tailed hawk|Red-tailed hawk]] There's talk about that here, {{phab|T178700}}. Auto-confirmed globally and either that or Extended confirmed for EN Wiki (@[[User:MusikAnimal (WMF)|MusikAnimal (WMF)]] your task {{smiley}}) [[User:1AmNobody24|<span style="border:1px solid black;padding:1px;background-color: #4D4DFF;color: white">Nobody</span>]] ([[User talk:1AmNobody24|<span style="color: #4D4DFF">talk</span>]]) 13:06, 4 December 2023 (UTC)

== Can the tool access paywalled full texts? ==

Curious whether this tool would detect violations like [https://en.wikipedia.org/w/index.php?diff=661707469 this] from 2015 which copied from [https://link-springer-com.wikipedialibrary.idm.oclc.org/chapter/10.1007/978-3-319-09132-7_3#Sec5 this source]<sub>(you'll need to log in)</sub>? If not, have you considered whether the tool can be linked up with [[:en:WP:TWL|The Wikipedia Library]] to access full texts? [[User:Smartse|Smartse]] ([[User talk:Smartse|talk]]) 10:59, 19 December 2023 (UTC)

:@[[User:Smartse|Smartse]] I tried it by copying that old version to [[en:Draft:Sandbox|Draft:Sandbox]]. CopyPatrol picked up the edit [https://copypatrol.toolforge.org/en/?id=105079867]. In the iThenticate-Report it shows that source as a 13% match. [[User:1AmNobody24|<span style="border:1px solid black;padding:1px;background-color: #4D4DFF;color: white">Nobody</span>]] ([[User talk:1AmNobody24|<span style="color: #4D4DFF">talk</span>]]) 13:16, 19 December 2023 (UTC)

::{{re|1AmNobody24}} Thanks for that - I see that percentage at 9% for link.springer.com, but looking at https://www.ithenticate.com/ I see that they do indeed have the full texts for many paywalled articles. Good to see that we should catch edits like this today, but I wonder how many we missed! [[User:Smartse|Smartse]] ([[User talk:Smartse|talk]]) 12:29, 21 December 2023 (UTC)

== Question about marking edits ==

When I encounter an edit that somebody else has already fixed (by removing content and adding copyvio-revdel tags, or by tagging for G12), should I mark the edit as "Page fixed" or as "No action needed"? I've been marking these sorts of things as "Page fixed", since it was a true copyvio and the page was fixed, but the use of {{tq|you}} in {{tq|If you fixed the problem, tagged the page for revision deletion, or tagged the page for deletion as a copyright violation, mark it as "Page fixed"}} is now giving me a bit of pause. — <span style="background: linear-gradient(#990000,#660000)">[[User:Red-tailed hawk|<span style="color: white">Red-tailed&nbsp;hawk</span>]]&nbsp;<sub>[[User talk:Red-tailed hawk|<span style="color: white">(nest)</span>]]</sub></span> 02:54, 21 December 2023 (UTC)

:@[[User:Red-tailed hawk|Red-tailed hawk]] I also mark those as Page fixed. You think something like {{tq|If the problem is fixed, the page tagged for revision deletion, or tagged for deletion as a copyright violation, mark it as "Page fixed"}} could be better? [[User:1AmNobody24|<span style="border:1px solid black;padding:1px;background-color: #4D4DFF;color: white">Nobody</span>]] ([[User talk:1AmNobody24|<span style="color: #4D4DFF">talk</span>]]) 06:27, 21 December 2023 (UTC)

Latest revision as of 19:44, 3 May 2024

NOTE: This page may not be regularly checked. If you need prompt attention from the maintainers please ping a member of Community Tech.

New CopyPatrol is live[edit]

I'm thrilled to announce the new version of CopyPatrol is now live at https://copypatrol.wmcloud.org. All existing links should redirect to the right place. Please join me in thanking @JJMC89 for his tremendous help in this effort. He probably deserves most of the credit here, but certainly all of it for the backend that he completely rewrote from scratch. The new backend should be much more resilient, with the sporadic downtime that we occasionally see hopefully being a thing of the past. In addition, the new frontend offers a number of new features:

  • Significant performance improvements
  • Edit summaries, change tags, and diff sizes
  • "Undo" or "revdel" links for users who have the requisite permissions

One notable change you might see is that the iThenticate reports no longer include the crawl date. Unfortunately this is outside our control. The Turnitin product team has been made aware of this feature request, so we hope it will eventually be reinstated.

Please let myself or JJMC89 know of any issues you see. At the time of writing, the backfill script is still running, so many older reports are missing. They should all be restored in due time. Additionally, we're still ironing out integration with mw:Extension:PageTriage. We'll mark phab:T333724 as resolved once all of the aforementioned has been completed.

This release also marks the conclusion of a formal agreement with Turnitin. This has been in the works since at least May 2022. Turnitin has been kind enough to give us free credits when we need them, but from a legal standpoint nothing solidified our relationship in the past. Now it is set in stone, and we have the reassurance that CopyPatrol is here to thrive for years to come. They were gracious enough to give us quite a bit of credits exceeding our current consumption, so we will soon be exploring adding more languages to CopyPatrol. On the front of negotiations with Turnitin, I'd like to thank @Ocaasi who started the conversations, and more recently my colleagues @SSpalding (WMF) from Legal, @JVargas (WMF) from Partnerships, my manager @KSiebert (WMF), and our new Lead Community Tech Manager @JWheeler-WMF.

Above all, allow me to thank all of you – our users – who are doing the actual work of helping cleanse the wikis of copyright violations. Your tireless efforts are what drove us to reaching this milestone.

Warm regards, MusikAnimal (WMF) (talk) 21:42, 9 April 2024 (UTC)Reply

Feedback[edit]

Fixed = code updated and confirmed would not show up if rechecked
Wow, I can actually feel everything loading faster (imagine my shock on discovering that marking the status of reports is now near-instant). The new features are great, could I share a little bit of feedback?
  • The undo button is really useful, but its location next to the diff button has led to me now clicking it unintentionally multiple times (maybe it could be moved down)
Other than that, everyone looks good. The leaderboard seems a bit funky, but I imagine that will be fixed with the backfill script. Isochrone (talk) 22:06, 9 April 2024 (UTC)Reply
It's so awesome to see how this technology and this partnership has evolved and matured. Congrats to everyone who has pushed it so much further!! Ocaasi (talk) 00:13, 10 April 2024 (UTC)Reply
The new version has many positive changes, such as the quick loading time and the expected reduction in outages. However, on the down side, I see that there's already 212 cases posted for April 10 and there's still three hours to go, so a projected 240 cases to assess in the 24 hour period. Given that most days we only have two people working the queue, this needs to be cut in half if that's possible. It's unrealistic and unstustainable to expect our tiny crew to keep up with the voume otherwise. (I can typically only clear about 20 cases per hour and can only commit to working on this for 3-4 hours per day.) Diannaa (talk) 21:20, 10 April 2024 (UTC)Reply
Yes, many thanks for the improvements! Very grateful. I agree with Diannaa that we may need some tweaks in terms of what the bot flags as a potential copyright violation as the threshold seems to have been lowered compared to before (one example I mentioned on her talk page was that it now flags cases where someone changes one or two words in a paragraph because it detects a match for the remaining text in the paragraph). Not sure we'll be able to handle the reports otherwise. DanCherek (talk) 22:34, 10 April 2024 (UTC)Reply
@Diannaa @DanCherek Thanks for all of the feedback! Can you link to specific example(s)? someone changes one or two words in a paragraph because it detects a match for the remaining text in the paragraph – wouldn't that still usually be a copyright violation, or do you mean the source is a backwards copy (in which case it's not a copyvio at all)?
Assuming the cases are still valid, my opinion is that it's perfectly fine to have a backlog. While it's admirable to aim for completeness, you can only volunteer but so much time. If however you're seeing a lot of noise, with backwards copies, or otherwise too many cases that are right on the "borderline", etc., we certainly can work to improve that. MusikAnimal (WMF) (talk) 22:45, 10 April 2024 (UTC)Reply
I'm seeing a lot of cases like [1], where someone copyedits a paragraph and then it matches the rest of the unchanged text to a backwards copy. We still had to deal with backwards copies in the old CopyPatrol, of course, but so far it feels like a lot more after the update. DanCherek (talk) 22:50, 10 April 2024 (UTC)Reply
Fixed — JJMC89(T·C) 17:25, 11 April 2024 (UTC)Reply
This report flags an edit that just cleaned up references with no real new text added. -- Whpq (talk) 22:56, 10 April 2024 (UTC)Reply
Fixed — JJMC89(T·C) 06:44, 11 April 2024 (UTC) modified 16:19, 11 April 2024 (UTC)Reply
Due to the large number of Wikipedia mirrors, we will always have false positives. We can waste a lot of valuable time on those cases, attempting to determine who had it first. We do have a whitelist of Wikipedia mirrors but people who don't know Regex are warned not to edit it. Here's a few more false positives of various kinds. I don't know if these are useful examples or not:
  • Here's one where an editor removed multiple occurrences of the word "current" from a list. The list itself is public domain of course.
  • Here's one where an editor moved a paragraph that was reflected in a Wikipedia mirror. The material they added in the same edit is okay to keep.
  • In this one, an editor actually removes text but since IMDb has copied our plot summary at some point, the item gets listed.
  • Here's one that illustrated DanCherek's point: only a few words are added. Purported source: an obvious Wikipedia mirror.
Another suggestion: Perhaps we can somehow teach the system to only show us the most likely cases? Maybe there's a way to reduce the threshold for inclusion, regarding the size of the edit or the amount of the overlap? It's not a question of having a backlog; if we don't reduce the fire hose of incoming cases there will be many that never get assessed at all. Diannaa (talk) 23:25, 10 April 2024 (UTC)Reply
Fixed first and fourthall. The second link is the same as the first. — JJMC89(T·C) 06:44, 11 April 2024 (UTC) modified 17:25, 11 April 2024 (UTC)Reply
Sorry about the duplicate link; I am not going to bother to look for the missing example. New comments:
  • Community Tech bot used to remove listings of pages that were already deleted. This doesn't seem to be happening so far: deleted article, deleted draft
  • Cases so far at the halfway point of April 11 are a much more manageable 40, so if tweeks are underway, it's working.
Diannaa (talk) 12:11, 11 April 2024 (UTC)Reply
Unfortunately I had to revert one of the fixes due to poor performance causing the bot to buildup a large backlog that hasn't been processed yet. — JJMC89(T·C) 16:19, 11 April 2024 (UTC)Reply
One thing I've noticed is that I keep getting logged out everytime I close my browser-- is there a cookie persistence issue? I had no such issues with the old backend. Isochrone (talk) 13:38, 11 April 2024 (UTC)Reply
I will look into this. This seems this happens to every new Symfony app that I create (phab:T224382). I managed to fix it before, so I'll attempt it again for CopyPatrol (the old CopyPatrol did not run on Symfony, FYI) MusikAnimal (WMF) (talk) 19:32, 11 April 2024 (UTC)Reply
I just noticed that I can't view the iThenticate reports unless I am logged in to CopyPatrol. So that might be a feature rather than a bug. Diannaa (talk) 23:14, 11 April 2024 (UTC)Reply
Logging in is required since each user must agree to the EULA to see the reports. The short login session should get worked on. — JJMC89(T·C) 22:25, 12 April 2024 (UTC)Reply
Tracked in Phabricator:
Task T362457 resolved

New feedback: Some users are incorrectly being shown with redlinked user talk pages. Here, here, here, for example. It appears this might be because they don't have a talk page on Meta, but that's immaterial; I would prefer to be able to see at a glance whether or not a user talk page exists at en.wiki for that username. Diannaa (talk) 21:44, 12 April 2024 (UTC)Reply

Fixed MusikAnimal (WMF) (talk) 19:17, 14 April 2024 (UTC)Reply

Moving ignore lists to the CopyPatrol UI[edit]

In the above discussion, it was noted how tedious it is to maintain User:CopyPatrolBot/UrlIgnoreList as it requires knowledge of regular expressions. I had an idea that we could get rid of the on-wiki lists and instead have a button "Ignore URLs like this" directly in the CopyPatrol UI. We could do the same for users, too, so you don't have to edit User:CopyPatrolBot/UserIgnoreList. This is also nice because the new system has the ignore lists centralized on Meta, where not everyone is necessarily able to edit (the page could be semi-protected).

The only issue I foresee with this idea is the potential for abuse. For that, I was thinking we'd either restrict the ability to ingore URLs and users to "privileged" users – say at least 1,000 edits, or even restrict to sysops? Another option is to go ahead and shield all of CopyPatrol from newbies, as proposed at phab:T178700.

Thoughts? MusikAnimal (WMF) (talk) 19:43, 11 April 2024 (UTC)Reply

I can't imagine any issues with this for URLs. With users, making it too easy, even for admins (who are humans), to exclude users may lead to unintentional removals of users who should be flagged, or people being too liberal with the ignore button.
When there are errors on the wikitext list, this can just be rectified by another user: would there be a way to "un-ignore" users in case of errors? Isochrone (talk) 20:28, 11 April 2024 (UTC)Reply
I think it makes sense to have an interface to manage the ignored URLs and users. MusikAnimal (WMF) (talk) 22:43, 14 April 2024 (UTC)Reply

CopyPatrol has stopped, but..[edit]

CopyPatrol has stopped, because Turnitin is down for maintenance. Check https://turnitin.statuspage.io/ for updates. Diannaa (talk) 19:36, 20 April 2024 (UTC)Reply

I keep getting logged out[edit]

Maybe I'm losing my mind but I find myself logged out of copy patrol numerous times each week, even though I don't close the window or log out of OAuth at all. I swear the only time I had to log in to CopyPatrol on the old system was when I rebooted. Is there a setting somewhere I can change to keep me logged in or is this the new normal? Thanks enL3X1 ¡‹delayed reaction›¡ 00:53, 28 April 2024 (UTC)Reply

@L3X1 This was brought up in the feedback above. I have just deployed a change that I hope will help. Please let me know if it does (same for @Diannaa and everyone else :). There larger issue is rather a mystery, I'm afraid. I hope to investigate it more soon. You can follow phab:T224382 for updates. Best, MusikAnimal (WMF) (talk) 05:14, 28 April 2024 (UTC)Reply

Deborah Morris and John Franklin[edit]

I saw this in my watchlist...

Potential copyright violation log b 22:10 CopyPatrolBot talk contribs marked revision 1221979003 on Deborah Morris and John Franklin as a potential copyright violation ‎ Tag: PageTriage

but the only thing that I am finding is the duplication of a long title in the Bibliography: The Morris family of Philadelphia, descendants of Anthony Morris, born 1654-1721 died. It seems that I can post a comment somewhere related to this... but I forgot where. Where can I provide a comment on this? Thanks so much!CaroleHenson (talk) 05:29, 3 May 2024 (UTC)Reply

I found the log here: https://copypatrol.wmcloud.org/en. It identifies a source I did not use, but it has content that is in the article in a quote from another source: "https://archive.org/details/havilandgenealog00fros/page/210/mode/1up?q=%22no+longer+able+to+hear%22". It's a quote - and the source is from 1914, so not a copyright violation.
The source that I used was a 1893 newspaper article: "Old-Time New-York Friends: Services of the "Plain People" in Revolutionary Days". The New York Times. November 11, 1893. p. 16. Retrieved May 2, 2024."
I couldn't figure out how to add a comment that it's not a copyright violation.CaroleHenson (talk) 05:51, 3 May 2024 (UTC)Reply
@CaroleHenson When using quotes it's normally enough to add the reference. If you use public domain content outside of quotes, then you need to add the attribution template. I've marked the raport as no action needed since you already added the reference. Nobody (talk) 08:15, 3 May 2024 (UTC)Reply
Great thanks!CaroleHenson (talk) 19:44, 3 May 2024 (UTC)Reply