User:Dcljr/Article counts

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by Dcljr (talk | contribs) at 07:21, 26 June 2012 (→‎Changes to article counts in other projects: + rest of n:, various notes; sep. sections; other minor). It may differ significantly from the current version.

Original information

On May 10, 2012, a bug report requesting that the "updateArticleCount.php" maintenance script be run on all Wiktionaries and Wikisources was acted upon, resulting in 60 of those wikis surpassing or falling below one or more of the article-count milestones tracked at Wikimedia News. Some of the changes were quite large and therefore questionable.

The tables below show a great many statistics for the Wiktionaries and Wikisources that showed the largest percent changes in article counts (up or down) on May 10.

Wiktionaries

Note: This information needs to be completely rewritten to reflect my latest understanding of how article counting is done.

The table below can be sorted by any column (initial order of the rows is by "stats articles pct Δ"). You can also "hover" over any of the column headings in the table to get an explanation of that column (these mostly match the explanations in the key below, but the percents are explained in more detail). The wiki names are linked to the wikis themselves, while the dates are linked to subpages of this one containing the "raw statistics" upon which these numbers are based.

Table key:

  • wiki – Wiki name
  • date before / date after – Dates dumps were made before / after the May 10th running of the "updateArticleCount.php" script (all other "before" and "after" cells are colored using the same scheme); note that dumps that happened on May 10th [e.g., Pashto] are included in the table as "before" or "after" only if their "stats articles" count [see next two items] was very close to a dump made on one side of May 10th and very different from the dump made on the other side of May 10th)
  • stats pages – Total page count given in "site_stats.sql" dump, which matches the on-wiki count of "Pages" in Special:Statistics and the one given by {{NUMBEROFPAGES}} on the wiki itself
  • stats articles – Article count given in "site_stats.sql" dump, which matches the on-wiki count of "Content pages" in Special:Statistics and the one given by {{NUMBEROFARTICLES}} on the wiki itself
  • (%) – Unless otherwise explained below, the percents are based on the numbers in the previous two columns (e.g., the first one is the percent of "stats pages" that were "stats articles")
  • stats articles Δ – Change in "stats articles" count from before to after May 10th
  • stats articles pct Δ – Percent change in "stats articles" count from before to after May 10th.
  • dumped pages – Total number of pages seen in "page.sql" dump
  • ns0 pages – Number of pages in "page.sql" marked as being in the main namespace (which, BTW, matches the number of titles seen in the "all-titles-in-ns0" dump)
  • ns0 non-redirs – Number of main-namespace pages in "page.sql" marked as not being redirects (in either that dump or "redirect.sql")
  • std. article count – Article (i.e., entry) count based on currently used criteria: "non-redirect with at least one [[wikilink]] of any type"1 (this and the other "article count" percents are all out of the "ns0 non-redirs" [since all the wikis listed here consider only the main namespace as "content"])
  • % stats off – Percent difference between "stats articles" count and "std. article count" (as percent of "std. article count")
  • conserv. article count – Article count based on more conservative criteria: "non-redirect linked to another page on the same wiki or placed in a category" (percent is out of "ns0 non-redirs")
  • liberal article count – Article count based on more liberal criteria: "non-redirect containing at least one of the following: link to another page on the same wiki, image/file link, category link, interlanguage link, interwiki link, or template call" [see note below for further explanation/context] (percent is out of "ns0 non-redirs")
  • altern. article count – Article count based on Yet Another set of criteria: "non-redirect linked to another page on same wiki, placed in a category, or containing an image/file or a template call" (percent is out of "ns0 non-redirs")

Note 1: Note that the "wikilink" in the current method of counting articles can apparently be anything starting with the string "[[", including a regular [[link]] to another page on the same wiki, a [[Category:]] link, an [[Image:]] (or [[File:]]) link, an interlanguage link (e.g., [[de:]]), an interwiki link (e.g., [[species:]]), or even a "hidden" link inside of an <!-- HTML comment --> or (perhaps?) one "deactivated" by <nowiki> tags. This analysis counts only "real" wikilinks (i.e., not "hidden" or "deactivated" links). To accomplish this, the dumps "pagelinks.sql", "categorylinks.sql", "imagelinks.sql", "langlinks.sql", "iwlinks.sql", and (only for the "liberal" and "alternate" counting methods) "templatelinks.sql" were also examined.

(20 more Wiktionaries to add to table)

Wikisources

Note: This information needs to be completely rewritten to reflect my latest understanding of how article counting is done.

As with the Wiktionaries table above, this table is initially sorted by the "stats articles pct Δ" column, and the dates are linked to the full statistics collected for each wiki. The explanation of the columns is mostly the same as for the table above, with certain additions noted below. Note that some stats included in the other table have been omitted here to limit the size of the table.

Table key: (differences from above)

  • dumped articles – Article count across all content namespaces (main [ns0] and, if appropriate, "author", "page" and "index" — see other items below) using current "non-redirect in content namespace with at least one wikilink"1 criteria, based on several relevant dumps (percent is out of "dumped pages" count)
  • author ns – Number of the namespace containing "Author:" pages, but only if that namespace exists and counts as content
  • author ns pages – Number of pages in "page.sql" dump marked as being in the author namespace (this and similar stats that follow are coded as "0" if the namespace is missing or not counted as content, so table sorting is not broken)
  • author ns non-redirs – Number of author-namespace pages in "page.sql" marked as not being redirects (in either that dump or "redirect.sql")
  • author ns articles – Number of author-namespace pages in "page.sql" that qualify as articles based on current criteria (percent is out of "author ns non-redirs")
  • page ns – Number of the namespace containing "Page:" pages, but only if that namespace exists and counts as content
  • page ns pages – Number of pages in "page.sql" dump marked as being in the page namespace
  • page ns non-redirs – Number of page-namespace pages in "page.sql" marked as not being redirects (in either that dump or "redirect.sql")
  • page ns articles – Number of page-namespace pages in "page.sql" that qualify as articles based on current criteria (percent is out of "page ns non-redirs")
  • index ns – Number of the namespace containing "Index:" pages (in some languages translated as "book"), but only if that namespace exists and counts as content
  • index ns pages – Number of pages in "page.sql" dump marked as being in the index namespace
  • index ns non-redirs – Number of index-namespace pages in "page.sql" marked as not being redirects (in either that dump or "redirect.sql")
  • index ns articles – Number of index-namespace pages in "page.sql" that qualify as articles based on current criteria (percent is out of "index ns non-redirs")

(11 more Wikisources to add to table)

Verifying the counts

How do I know that my script is giving correct article counts? Well, anyone who can program sufficiently well can redo the calculations themselves, based on the descriptions below of how I did it. For everyone else, I provide some "really raw" output showing the final, processed results of the "page hash" constructed by my script for the 2012-05-16 dump of the Tsonga Wiktionary:

/tswiktionary-20120516-raw

Some spot checks of this output didn't reveal any problems, as far as I could tell.

Note: Except there was a problem... I wasn't using the right definition of a "good" article. This "raw" output page will soon be updated to reflect the correct article count.


Complete rewrite...

On May 10, 2012, a bug report requesting that the "updateArticleCount.php" maintenance script be run on all Wiktionaries and Wikisources was acted upon, resulting in 60 of those wikis surpassing or falling below one or more of the article-count milestones tracked at Wikimedia News. Some of the changes were quite large and therefore questionable.

A preliminary investigation revealed only one obvious pattern in the count changes: most Wiktionaries lost articles while most Wikisources gained. The gains can be explained by the fact that most Wikisources now count more namespaces as "content" than they used to; in addition to articles in the main namespace ("ns0"), many Wikisources now count qualifying pages in 1, 2, or 3 additional namespaces (more about this later). The losses were harder to explain.

Neither the gains nor losses seemed to be related to the writing system the wiki was using (e.g., Latin script vs. Brahmic scripts, etc.), whether it was an older wiki or newer one, bigger or smaller, and so forth. Most worryingly, it wasn't at all clear whether the new or old counts were "more correct". I (User:Dcljr) tried to estimate the "true" article counts based on random samples of pages at each wiki (or as close to random as could be reasonably achieved). Sometimes the resulting count was closer to the new one, sometimes closer to the old, and sometimes it was right in the middle between them. This (incomplete) preliminary information is collected at Talk:Wikimedia News#May 10 article count updates.

To collect more in-depth and "reliable" information, I wrote a Perl script to download and parse relevant database dumps needed to count the articles for a given wiki. Initially it seemed that the "updateArticleCount.php" script was consistently undercounting articles, but it turns out I was using the wrong (or, more accurately, using an out-of-date) definition of what counts as an article. Once I used the right definition, I started to get the same counts as those given by "updateArticleCount.php". (For more context, see bug 37291.)

But more about all that later. First, a summary of how article counts have been determined in the past, how they are determined now, and how the article counts actually changed when the "updateArticleCount.php" script was run on May 10, 2012.

How article counting used to be done

When wiki article counting first began, it was based on whether a page contained a comma or not. This worked fine for the English Wikipedia, but once other projects in other languages started up, people realized that this method would not work for all wikis. A very quick (one week!) discussion and vote was held here at Meta in March 2003, the details of which can be found at:

Based on the results of the vote, it was decided that a page would be counted as an article if it was:

a non-redirect in the main namespace (ns0), containing at least one [[wikilink]]

Unfortunately, the implementation of this definition left a little to be desired, and it ended up counting not only 5 different types of legitimate wikilinks (1–5 below), but two types of "false" wikilinks (6 and 7), and one type of non-wikilink (8):

  1. page links: e.g., [[Babel]] or [[Talk:Babel]], etc.
  2. category links: [[Category:Software]]
  3. image/file links: [[File:Yes.png]]
  4. interlanguage links: [[de:Wikipedia:Hauptseite]] or [[:de:Wikipedia:Hauptseite]]
  5. interwiki links: [[species:]]
  6. hidden links: <!-- [[don't look at me]] -->
  7. deactivated links: <nowiki>[[look at me]]</nowiki>
  8. any text containing the string "[[": wikilinks start with "[["...

(Note that links like [[:Category:Software]] and [[:File:Yes.png]], which start with an initial colon, are regular page links of type 1.)

In fact, number 8 describes exactly what was checked for to count a page as an article (assuming it wasn't a redirect and was in the main namespace)!

Eventually, this shortcoming led some wikis to routinely place "hidden" links (of type 6) on their main-namespace pages, just to get them counted as articles.

In June 2006, the $wgContentNamespaces configuration variable was introduced (in revision 14738) to enable namespaces other than the main one (ns0) to count as "content".

At this point, the de facto definition of an article was:

a non-redirect in a content namespace, containing the string "[["

In November 2007, bug 11868 was submitted requesting that links provided by templates be counted, too. In the course of the ensuing discussion, it was pointed out that links other than page links (types 2, 3, etc.) were being counted, and that in fact three different counting methods (all of which started with "non-redirect in a content namespace") were being employed at different places in the code:

  • every time a page was saved, the "[["-string criterion was used to see whether the page would count as an article
  • when the "initStats.php" maintenance script was run, it just checked to see whether the pages were non-empty
  • when the "updateArticleCount.php" maintenance script was run, it checked whether the "page.sql" table actually contained page links originating from each page in question (type 1 only, but also type 1 links provided by templates)

In addition, when pages were imported into a wiki, the article count was not updated correctly (see bugs 2483, 5703, and 6600).

These inconsistencies allowed the on-wiki article counts (e.g., {{NUMBEROFARTICLES}}) to diverge from the "correct" count (however that was defined!) over time.

At some point, the "meat" of the "updateArticleCount.php" script was moved elsewhere.

How article counting is done now

In May 2011, a developer finally acted to "rationalize" the way articles were counted, and in revision 88113 introduced the $wgArticleCountMethod configuration variable to specify which type of (non-empty) content-namespace non-redirect would count as articles: all such pages ("any"), only those containing a true page link ("link"), or only those containing a comma ("comma"). Article.php and SiteStats.php were modified to reflect this change.

So now, assuming $wgArticleCountMethod is set to "link" for a wiki (which it is for all but the English and Portuguese Wikibooks), a page counts as an article (presumably at all places in the MediaWiki code) if it is:

a non-redirect in a content namespace, containing (after parsing) at least one true [[wikilink]] to another page on the same wiki

Note how different this definition is from the one actually in effect before the change was made! Unfortunately, the extreme nature of the change wasn't apparent to most people until the article counts were recalculated on May 10, 2012.

Because of the "after parsing" part of the new definition, one can no longer tell whether a page will count as an article simply by examining its page source; if the page contains templates, it must be fully parsed first in order for any links created by those templates to be accounted for. Fortunately, this is done when pages are saved, so as long as the "page.sql" database is maintained correctly, the article count should no longer get "out of sync" as it did in the past.

Changes to article counts on May 10, 2012

Apart from isolated requests here and there (for example, bug 34184), the article counts of the various Wikimedia content wikis have not been updated to reflect all of these changes in how articles have been counted over time. The May 10 running of "updateArticleCount.php" on all the Wiktionaries and Wikisources was the first concerted effort to "fix" the article counts across an entire project. On that day, the changes seen in article counts for these two projects are shown in the tables below.

Key for both tables:

  • wiki name – linked to the Main Page of the wiki
  • articles before / articles after – on-wiki article count at c. 00:30 UTC on 2012-05-10 and c. 00:30 UTC on 2012-05-11, respectively (collected via API request, equivalent to {{NUMBEROFARTICLES}} and the count seen at Special:Statistics on the given wiki)
  • change – after minus before
  • pct change – relative change in article count, as a percentage of the "before" count
  • level before / level after – which milestone level (tracked at Wikimedia News) the wiki would be at based on the article count
  • level change – whether there was a change in milestone level

Note that the tables are initially shown "collapsed" (to expand one, select the "[show]" link) and are sorted by the "level after" column, then "level before", then (unfortunately) alphabetically by language code. To sort by a different column, click on the "up-down" arrows next to the column heading. For help with sorting on a "secondary sort key", see Help:Sorting#Secondary sortkey.

Note: 8 Wiktionaries rose up to new milestone levels and 24 fell to lower milestone levels.

Note: 15 Wikisources rose up to new milestone levels and 13 fell to lower milestone levels.

Changes to article counts in other projects

Eventually the article counts will need to be updated on the other Wikimedia wikis. The tables below show the changes that would have occurred if the "updateArticleCount.php" script were run on each of the other "content wikis" on the day that wiki's database was most recently dumped (as of the time the tables were filled in). The columns are as in the previous section, except for the "date dumped" column, which should be self-explanatory. Unlike the tables above, initial sorting is by "articles before" in reverse numerical order.

Wikipedias

Note: Information to come...
Changes to Wikipedia article counts if they were updated on the indicated dates
wiki name date dumped articles before articles after change pct change level before level after level change

Note: The English Wikipedia is too large to include in this analysis.

Wikibooks

Note: Information to come...
Changes to Wikibooks article counts if they were updated on the indicated dates
wiki name date dumped articles before articles after change pct change level before level after level change

Note: The English Wikibooks and Portuguese Wikibooks use the "comma" article counting method, and so are not included in this analysis.

Wikiquote

Note: Information to come...
Changes to Wikiquote article counts if they were updated on the indicated dates
wiki name date dumped articles before articles after change pct change level before level after level change

Wikinews

Note: The Alemannic Wikinews and Low German Wikinews exist as separate namespaces within their respective language Wikipedias and so are not included in this analysis.

Wikiversity

Other possible article counting criteria

Clearly there are big differences between the old and new definitions of what constitutes an article. While the new definition may be closer to the original intent of the "Article count reform" voters (although even this is not entirely clear), people have gotten used to the old way of doing things and might be disturbed by large changes in article counts. In particular, some might consider it a "bug" in the new method that, say, category links are no longer considered.

For this reason, it might be time to think about what other criteria could be used to count articles.

Below are some tables containing statistics for the Wiktionaries and Wikisources that showed the largest percent changes in article counts (up or down) on May 10. As alluded to earlier, these statistics were generated by a Perl script I (dcljr) wrote to download and parse relevant database dumps. For convenience, I repeat here the list of different types of links (now including "template links") that have been used, or could be used, to count articles, along with the associated SQL databases that currently track such links (note that "page.sql" contains the page IDs that each of these other databases refer to).

link type examples database
page (on same wiki) [[Babel]], [[Talk:Babel]], [[:Category:Software]], [[:Image:Cat.jpg]], [[:File:Cat.jpg]] pagelinks.sql
category [[Category:Software]] categorylinks.sql
image/file [[Image:Cat.jpg]], [[File:Cat.jpg]] imagelinks.sql
interlanguage [[de:Wikipedia:Hauptseite]], [[:de:Wikipedia:Hauptseite]] langlinks.sql
interwiki [[species:]], [[wookieepedia:]] iwlinks.sql
template {{fact}}, {{fact|date=June 2012}} templatelinks.sql
hidden× <!-- [[don't look at me]] --> (none)
deactivated× <nowiki>[[look at me]]</nowiki> (none)
any text containing "[["× Wikilinks start with two open-brackets (<tt>[[</tt>). (none)
Note ×: Not a real wikilink, so not contained in any "links" database.

Note that a "template" link doesn't mean a wikilink provided by a template; it simply refers to any {{template call}}, regardless of whether the template provides any wikilinks (or, indeed, any content at all, since the template may not actually exist).

Now for the various definitions of what might constitute an article (all of which should be understood to begin with the phrase "non-redirect in a content namespace, containing..."):

  • "new" definition"at least one page link"
  • "conservative" definition"at least one page or category link" (note: despite the name, this will count more pages as articles than the "new" definition)
  • "standard" definition"at least one page, category, image/file, interlanguage, or interwiki link" (this is supposed to stand in for the "old" way of counting articles, but as explained above it's still different since (as with all of these dump-based counting methods) it counts wikilinks that are provided via template calls (which the old method couldn't do) and doesn't count hidden or deactivated links, nor text containing "[[" but not forming links)
  • "liberal" defintion"at least one page, category, image/file, interlanguage, or interwiki link, or any template call" (regardless of whether the template provides any links — this counts the most pages as articles)
  • "alternate" definition"at least one page, category, or image/file link, or any template call" (regardless of whether the template provides any links — the idea behind this definition is that all of these could be links to things on the same wiki, although in practice most images are hosted remotely at Commons)

If someone wants to suggest yet another definition, I can modify my Perl script to use it (as long as it uses some combination of the database-tracked link types listed above).