Jump to content

Wikipedia:Request a query

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by HaeB (talk | contribs) at 04:40, 12 October 2023 (→‎Number of articles under protection: Reply). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

This is a page for requesting one-off database queries for certain criteria. Users who are interested and able to perform SQL queries on the projects can provide results from the Quarry website.

You may also be interested in the following:

  • If you are interested in writing SQL queries or helping out here, visit our tips page.
  • If you need to obtain a list of article titles that meet certain criteria, consider using PetScan (user manual) or the default search. Petscan can generate list of articles in subcategories, articles which transclude some template, etc.
  • If you need to make changes to a number of articles based on a particular query, you can post to the bot requests page, depending on how many changes are needed.
  • For long-term review and checking, database reports are available.

Quarry does not have access to page content, so queries which require checking wikitext cannot be answered with Quarry. However, someone may be able to assist by using Quarry in another way (e.g. checking the table of category links rather than the "Category:" text) or suggest an alternative tool.

Blocks and unblocks

Trying to produce a list of blocks for users over 5,000 edits with a column indicating [effectively] whether or not the block was lifted, expired, or still active and a column for time between block and unblock [if applicable]. I can pull blocks and unblocks, but navigating durations and timing is more challenging. This is part of a larger project for which poking at the block log is just one component. — Rhododendrites talk \\ 21:44, 21 August 2023 (UTC)[reply]

I should have time to work on this either tomorrow or (less likely) later today, but if you want to keep poking at it in the meantime, I've taken a glance at what you've tried so far. I wouldn't attempt to parse block_params; instead, look for an unblock in logging with a timestamp later than the latest block. You can use ipblocks_ipindex to see if a user (not just an ip, despite the table name) is currently blocked, which will let you tell if a block without a corresponding unblock has expired. —Cryptic 22:02, 21 August 2023 (UTC)[reply]
I'll also have to come back to it tomorrow. My understanding of the various tables (and of SQL) is fairly limited, so I'd appreciate any help you have time for! — Rhododendrites talk \\ 01:57, 22 August 2023 (UTC)[reply]
Is it important at all that it be done in a single query? I'm thinking it would be easier to post-process this. —Cryptic 20:10, 22 August 2023 (UTC)[reply]
@Cryptic: Not necessarily, no. Certainly "column for time between block and unblock" and a conditional [block date + duration < current date]" is something I can add afterwards. Where it'd be difficult for me (not knowing Python, etc.) is being able to take the full list of blocks and unblocks and automatically pair them up, where applicable, if that makes sense. Like this unblock goes with that block. — Rhododendrites talk \\ 20:50, 22 August 2023 (UTC)[reply]
Well, that much is easy to do (slowly) - just sort first by the blocked username, then by timestamp; and don't limit log_action to 'block'. Might run into trouble with overlapping blocks and partial blocks or the like, and modified blocks might be an issue, but I expect the overwhelming majority of blocks will have exactly either 1 or 0 unblocks. quarry:query/76016 is the simplest thing that could possibly work. I'll check in on it later to see whether it completes or times out. —Cryptic 21:11, 22 August 2023 (UTC)[reply]
Ha - it did while I was typing that out. Running it again so it sorts usernames-with-underscores and usernames-without consistently; it'll be back up in under five minutes. —Cryptic 21:13, 22 August 2023 (UTC)[reply]
What do you want your final output to look like, say, for the first four users at that query ($1LENCE D00600D, $uperFan32, (CA)Giacobbe, and *drew) - they should be fairly representative? Do you still want columns for the blocking admin, block reason, and so on? —Cryptic 21:37, 22 August 2023 (UTC)[reply]
This is great, thanks! I'll fork it and poke over the next couple days. Would you indulge me by saying in human terms what this is doing:
UNION
SELECT '99999999999999', NULL, 'currently blocked', NULL, user_name, user_editcount, NULL, NULL
FROM ipblocks_ipindex
JOIN user ON user_id = ipb_user
Rhododendrites talk \\ 18:36, 23 August 2023 (UTC)[reply]
UNION combines the output of two selects. The later ORDER BY has lower precedence, so it'll sort the combined output.
The second SELECT finds all rows in the ipblocks_ipindex table (currently-active blocks - Special:BlockList is the onwiki interface). JOINing user limits the selected rows to ones with a corresponding row in the user table, ie it eliminates autoblocks and blocks of ips. It also lets us limit the output to users with 5000 or more edit counts in the WHERE clause.
UNION requires all of the select clauses it's combining to output the same number of columns, so I used NULL where there either wasn't data to match the first quarry (such as log_id) or where, even when there's an equivalent in ipblocks_ipindex, it would just repeat data from the first query (such as the blocker, actor_name in the first and ipb_by_actor in the second). user_name, on the other hand, is still included since we sort on that column later. user_editcount could've been nulled as redundant, too, I suppose, but we already had it from joining user, while we'd have to join the actor table to get a name for the blocking admin.
'99999999999999' is always output in timestamp column so that it always sorts after a real timestamp (whether sorting numerically or as a string), and 'currently blocked' was picked for clarity in the log_action column. Selecting a constant value like that instead of a table column name just outputs the constant in that column for every row.
I'm still willing to do the postprocessing if you let me know precisely which data you need. Getting the data is the easy part; dealing with all the different formats that the logging table has used over the years to indicate duration - if you indeed want the initial duration of modified or expired or still-active blocks - is what's actually hard here. —Cryptic 19:40, 23 August 2023 (UTC)[reply]
This is really helpful, Cryptic, thanks. Ok, so the thing I'm really trying to find out at the moment is to find blocks for reasons related to attacks, harassment, civility, etc., then to compare unblock %s for various edit count ranges (e.g. <10, 10-99, 100-999, 1000-9999, and 10000+). I know there's a big limitation in that we don't store the edit count at the time of the block, granted, but that's the most important. The "nice to have"s would also consider the mean/median starting block duration and mean/median actual block duration (if unblocked), but I appreciate that's harder.
I think I may be able work with the query you've put together here to get what I'll need. I'll have to think about the kinds of conditional statements to make it work. I started to try to adapt it here, though the results I'm seeing aren't what I expect ("currently blocked" lines without preceding block entries, and missing comment text when I've searched for comment text). Adding: Oops. Accidentally clicked submit again, after it took a while the first time (in case you click the query above shortly I write this).Rhododendrites talk \\ 02:28, 26 August 2023 (UTC)[reply]
The reason you're seeing "currently blocked" lines without corresponding blocked lines is that, in the first select, you filter out block log entries that don't mention certain terms; the second select shows all currently-blocked users in the given edit count range. (Be aware you're also filtering out the unblocks that don't match those terms.)
There's a field in ipblocks (ipb_reason_id) that works like log_comment_id, so you could, conceivably, filter both halves of the query. Ultimately, though, I don't think that would work very well, since - besides missing the unblocks - for users that were reblocked, it'll be the comment of the last active block, so you'll get cases both where you see the currently-active block and the reblock, but not the original block, and vice-versa. And you're still not going to get the unblock lines - the block log entries for unblocks hardly ever quote the original block. This would only be a little better than what you have now.
What you're really interested in filtering by isn't the comment field in the block log of the unblock or reblock entry in the block log, it's by the comment field in the original block. There's probably a more elegant way to do that than quarry:query/76110, but this way works. What it does is add a column, original_reason, that shows comment_text when block_action = 'block' (that is, a "Admin blocked Vandal for 3 weeks" log entry, not a reblock or unblock or one of the 'currently blocked' fake entries added from the ipblocks table); otherwise, whatever was in the previous row's original_reason column. (It also shows comment_text instead of the previous row's value if the blocked user's name isn't the same as in the previous row as a failsafe.)
I did that query for 5000+-edit-users instead of 1-100-edit-users as in your query as I wasn't interested in waiting half an hour for it to complete. I also filtered by regex instead of that list of LIKE clauses, which A) is easier, if you happen to already be familiar with regular expressions, and B) lets us make the comparison case-insensitive, so you'd see a block log entry of "Harassing [[User:Innocent]]", which LIKE '%harass%' would omit. —Cryptic 04:29, 26 August 2023 (UTC)[reply]
@Cryptic: Queried, combined, and poking away now, and noticed something. Look at your query you link above and scroll to A3RO. Note an indef, never unblocked, not expired, and no "currently blocked" line. Do you know what would be causing that? — Rhododendrites talk \\ 02:14, 29 August 2023 (UTC)[reply]
It's because, when I take the value for original_reason from the previous row instead of the current one, I'm taking that row's comment_text column instead of its original_reason column. And it's nontrivial to take the original_reason column instead. Crapola. I'll try to think of a different way to do it. Selecting all the rows in the query and then filtering by the block reason after post-processing would work for user_editcount >= 5000, but it won't for user_editcount BETWEEN 1 AND 100 - there's just too many results that would get thrown away in the middle to be practical. Probably won't be tonight. —Cryptic 03:00, 29 August 2023 (UTC)[reply]
quarry:query/76110 updated. It no longer has the safety check against the previous line's log_title like before, but it seems to work. Running up hard against the limits of my experience here. —Cryptic 04:05, 29 August 2023 (UTC)[reply]

Global file links breakdown for category

I found this Quarry query which finds per-wiki file usage for a given image: https://quarry.wmcloud.org/query/62636

I was wondering if it is possible to do a query like this, but with cumulative results for a whole Commons category of images (recursively for subcategories, if possible) rather than a single image. I was going to try to use the categorylinks for this, but did not get very far yet. I am mainly interested in Category:Images from the National Archives and Records Administration for now. Thanks! Dominic·t 20:45, 13 September 2023 (UTC)[reply]

There's currently 467285 files directly in that category, and 1549115 in the entire tree. Even if they average global usage on a single wiki each (compared to the 95 that the Messier 87 image has), Quarry can't handle that many results. I can write the query for you, but you're going to have to arrange some other way to run it and get the results, especially if I'm to take that "for now" at face value. —Cryptic 21:40, 13 September 2023 (UTC)[reply]
@Cryptic: Thanks, I do appreciate that concern! If you have a good idea about how to do this for a hypothetical category, I can plug in smaller ones or play around with the query on my own, as well. (The "for now" was just meaning, if it's feasible, I could query other categories as well.) Dominic·t 22:28, 13 September 2023 (UTC)[reply]
quarry:query/76609 has results for the tree anchored at c:Category:Wormholes.
It's also going to be slow for files with a lot of usage. One of my tests was with c:Category:High-resolution or SVG official Wikimedia logos, since the smallish subcats I picked out in the National Archives tree kept coming up with zero global usage; it turns out there's a separate row in globalimagelinks for every usage of every file on every wiki. Which, really, there has to be, but it hadn't sunk in that files like c:File:Commons-logo.svg have a couple hundred thousand uses in the mainspaces of each of dozens of wikis. It ran for half an hour before I gave up. You're likely to get similar results when running it for categories with large numbers of files, even if they're not in widespread use, or any use at all. —Cryptic 22:43, 13 September 2023 (UTC)[reply]

Fixing bare section links

I'm trying to use a regex query to help me do an AWB fixing instances of improperly formatted section links in see also sections like this. I came up with the query insource:/\*\s?\[\[[\w\s]*#[\w\s]*\]\]/, which seems in testing to properly find the strings, but it's timing out when I try to use it. Any idea how I can simplify the query enough to get it to run? {{u|Sdkb}}talk 01:03, 21 September 2023 (UTC)[reply]

That's not a query, it's a search string. You're asking in the wrong place, though I'm not aware what the right place would be. To start with, you're using entirely the wrong variant of regexes; neither \w nor \s are character classes, and # is a metacharacter that needs to be escaped. Help:Searching/Regex might be useful. —Cryptic 01:17, 21 September 2023 (UTC)[reply]
Any search for just insource:/whatever/ is likely to time out, because it has to search every page in the selected namespace(s), usually several million. Search performs optimally (i.e. completes at all) with at least one "normal" search term, optionally limited further by insource:, e.g. Apple insource:/banana/. Certes (talk) 11:18, 21 September 2023 (UTC)[reply]
@Cryptic @Certes, thanks both! I didn't realize that the search flavor of RegEx was so different from the AWB flavor, which is what I was using. I got the search query to work using insource:/\* ?\[\[[A-Za-z0-9 ]+\#[A-Za-z0-9 ]+\]\]/ and am replacing \*( ?)\[\[([A-Za-z0-9 ]+)\#([A-Za-z0-9 ]+)\]\] with *$1{{section link|$2|$3}}. Cheers, {{u|Sdkb}}talk 18:16, 21 September 2023 (UTC)[reply]
Yes, it is frustrating that the insource regex language is different from the ones used by AWB, AutoEd, etc. Here's a variant search that also times out but that should keep you busy for a while. It does not search for all possible characters that can appear in titles – see WP:TITLESPECIALCHARACTERS, which might help you construct a "any character but not these characters" regex – but it's a start. – Jonesey95 (talk) 18:20, 21 September 2023 (UTC)[reply]
There's a number of especially frustrating things here.
  • The syntax looks like PCRE or POSIX regexes, unlike, say, the flavor used in lua (which is also different, but very visibly so), so it misleads people into googling "how can I make a regex to do X" and thinking it'll work; worse, frequently it does get just enough results to look like it worked, even though the syntax differences means that what was searched for wasn't what was meant to be searched for.
  • The underlying database engine natively supports PCRE - the de-facto standard variant that everyone expects - but we can't use it in queries because page text isn't copied to the replicas we have access to.
  • The combination of not implementing ^ or $, having no syntax that matches just a newline, and including newlines in . and complemented character classes makes it verbose and error-prone to express a match that has to occur all on the same line, or one that starts at the beginning of a line - and this particular search really needs to do both.
/^\*.*\[\[[^]|]*#/m would do what you want, if we could use a sane regex engine; it would give bad output for unclosed wikilinks, but those are unlikely to survive long in articlespace anyway. Here, the closest equivalent, I think, is /[^ -􏿽]\*[ -􏿽]*\[\[[ -\\^-{}-􏿽]*\#/, which has the additional drawbacks of not matching at the very start of a page (which won't matter) and treating tabs like they're newlines (which might). —Cryptic 20:10, 21 September 2023 (UTC)[reply]
Well done. That sounds better, though titles and anchors can contain characters other than alphanumerics and spaces. If I need a one-off search which times out, I follow it with prefix:A then repeat with prefix:B etc., though not all titles are alphabetic and it might be antisocial of me to do that too often. Certes (talk) 18:23, 21 September 2023 (UTC)[reply]
@Sdkb: Taking a step back, how necessary is this task? I don't see any guideline or even essay that prefers {{section link}} to [[Foo#bar]]. As a section symbol, § may be more familiar than # to some readers (though not me personally). However, it might make certain searches harder. For example, I occasionally trawl for links to [[C#something]], which link to sections about letter C but were usually meant for C♯ (musical note) or C Sharp (programming language). I never thought to check whether any have been converted to section links. Certes (talk) 20:01, 21 September 2023 (UTC)[reply]
The MOS doesn't seem to outright forbid visible hash marks, but it does condemn them as "unsightly". —Cryptic 20:25, 21 September 2023 (UTC)[reply]
Yeah, my understanding from that section (and the example of general practice at refined pages) is that § is the preferred form for display to readers (and better display for readers always takes precedence over easier searching for editors). {{u|Sdkb}}talk 21:11, 21 September 2023 (UTC)[reply]
Fair enough then (though that section also explicitly recommends using a link and doesn't mention the section link template). We should also consider linking to the section via a redirect, in case the text moves into a stand-alone article or elsewhere, but that's obviously not a task that can be automated. Certes (talk) 22:40, 21 September 2023 (UTC)[reply]
Yeah, what I noticed during the run (about 600 pages) was lots of questionable entries in see also sections (including some stuff that should be navboxes, some stuff that should be links within the article, and some stuff that just should not have been linked. The sample of pages that showed up in the query is probably lower-quality on average than a random sample of see also pages as a whole would be. Still, there's definitely lots of cleanup to do in the area. Doing a systemic review/cleanup is way beyond the scope of what I want to get into, though.
There were also occasional instances where the query didn't pick up all the section links in a see also section (e.g. here), so a smarter query could potentially pick those up. And there were also a few erroneous entries where someone tried to redlink a title that includes a hashtag, not realizing that such a title cannot exist due to technical restrictions (which I skipped or, if I saved before catching the error, reverted, as here). {{u|Sdkb}}talk 23:58, 21 September 2023 (UTC)[reply]
I suspect your "from" AWB regex begins with something other than "\[", meaning that it can include part of any previous adjacent link and prevent that from being amended if it's also to a section. So if it's "(.*)\[\[whatever" → "$1{{section link|whatever" then, as well as being inefficient, the initial .* will grab any previous section link. Certes (talk) 10:44, 22 September 2023 (UTC)[reply]

Number of articles under protection

Hi. I'm looking into historical trends in the number of articles under each type of protection. Could anyone please point me to or create a query where for a given date, it says "On this date, x articles were under semi-protection, y articles were under full protection, and z articles were under extended-confirmed protection"?

Note that this is not the number of articles for which protection was *applied* on the date; it's the number of articles for which protection was in effect. Thanks in advance and cheers, Clayoquot (talk | contribs) 17:12, 4 October 2023 (UTC)[reply]

This is prohibitively difficult to extract from the live database - we'd have to parse not only the protection logs, but deletion and move as well, support all the data formats that have ever been stored in all three of them (which has always been for the convenience of php-based mediawiki, not sql datamining; getting durations is a particular mess), and trace every page that's been protected through moves until either the protection expires, the protection changes, or the page is deleted. That's difficult enough when you're doing it manually online for just one page.
It's more feasible to pull it from the dumps, though, particularly if you don't need page namespaces or titles. You'd just need the page_restrictions file for that - it's always been relatively small, and parsing it is simple enough that you wouldn't even need to install a sql engine. You'll be limited to what dates you can get dumps with that file for, of course.
If you do need articles in a more strict sense of the word (namespace 0, not a redirect) you're going to need the page file too; that's about 2 gigabytes compressed in the latest dump, and while still possible to parse and match up with the page_restrictions table without a real sql engine, it's a lot less trivial. (If insist on "articles" in the even stricter sense that {{NUMBEROFARTICLES}} uses - namespace 0, not a redirect, has at least one link - you need the pagelinks file too, and definitely a sql engine, and I daresay nobody's going to lift a finger to help you.) —Cryptic 01:14, 5 October 2023 (UTC)[reply]
Thank you Cryptic for this thorough explanation! Your detailed answer is very much appreciated. Clearly this is above and beyond the call of duty for volunteers. It might be a good topic for academic research - perhaps Benjamin Mako Hill would know someone interested in the challenge. Cheers, Clayoquot (talk | contribs) 20:55, 5 October 2023 (UTC)[reply]
Don't forget that disambiguation pages are generally not regarded as articles, despite being non-redirects in namespace 0. However, it's rare that they need protection. Certes (talk) 21:06, 5 October 2023 (UTC)[reply]
As it happens, Mako wrote a paper about this very topic in 2015 together with Aaron Shaw, and they also published a dataset and code to recreate it: https://communitydata.cc/wiki-protection/ (link redirects to a current one that for some reason triggers the spam blacklist.) No guarantees that this code still works, of course. Regards, HaeB (talk) 04:40, 12 October 2023 (UTC)[reply]

external links -- el_to

This query used to work, now it does not. It would print output like:

http://www.bfi.org.uk/index.html 0 British_Film_Institute

ie. the given URL is in the mainspace page British Film Institute

Is there a new/better way to list external links for a domain? -- GreenC 01:31, 12 October 2023 (UTC)[reply]

You now need to use el_to_domain_index and el_to_path. See phab:T312666 * Pppery * it has begun... 01:37, 12 October 2023 (UTC)[reply]

Got this working:

# adapted from https://quarry.wmcloud.org/query/77092
USE enwiki_p;
SELECT page_title,
       page_namespace,
       el_to_domain_index,
       el_to_path
FROM externallinks
JOIN page ON page_id = el_from
WHERE el_to_domain_index LIKE 'https://nl.kb.%' OR el_to_domain_index LIKE 'http://nl.kb.%';

Which produces:

enwiki Huwen_op_Bevel 0 http://nl.kb.kranten. /view/article/id/ddd:010229261:mpeg21:p005:a0112

Is there a way to combine the last two columns into a single URL? ie. http://kranten.kb.nl/view/article/id/ddd:010229261:mpeg21:p005:a0112

-- GreenC 04:00, 12 October 2023 (UTC)[reply]