Multilingual MediaWiki

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by Eloquence (talk | contribs) at 03:19, 17 January 2006 (first draft). It may differ significantly from the current version.
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

These are development specifications, not documentation. This feature does not exist yet.

Rationale

Support for multiple languages in MediaWiki is a component milestone of Wikidata and WiktionaryZ development. From the standpoint of Wikidata, this is needed because MediaWiki page titles play an integral role in the Wikidata model: they can be used as keys to access resources in a set of tables, and a history of transactions related to these tables. Because page titles currently have no internal awareness of the language they represent, it is not possible to have content under the same titles in different languages without resorting to hacks (such as appending the language to the title string).

From the standpoint of a regular MediaWiki user, the current situation is that the only way to run a multilingual site is to create separate databases for each language. MediaWiki does not provide any facilities to do so. Furthermore, the administrator also has to:

  • configure the entry points for the different wikis on the web server, and set them up to use the same code base (or use completely separate installations)
  • configure the wikis to use the same account database
  • set up a shared upload repository
  • set up interlanguage links
  • manage user blocking and other site policies across multiple languages

Again, MediaWiki does not provide facilities that would make any of this significantly easier. MediaWiki also does not support:

  • getting an index of all pages across languages
  • getting a list of recent changes across all or certain languages
  • maintaining a single watchlist across multiple languages
  • indeed, operating any special page across a set of languages.

Due to the setup and maintenance costs involved, of the hundreds and hundreds of sites using MediaWiki, only a small number support multiple languages, and usually only a small number of languages as well. Language communities cannot evolve naturally on a MediaWiki; they usually have to jump through processual steps to convince the administrator that a new language has to be "set up" -- if this is possible at all.

Beyond that, there are wikis where a split into separate databases is entirely undesirable, because the wikis are inherently multilingual and centralized, and cross-language interaction on single pages is desired. Examples of this are Meta-Wiki (for votes) and Wikimedia Commons (for file description pages).

Support for multiple content languages in a single MediaWiki installation and database will address these concerns and others.

Caveats

While there may be technical reasons to split databases, such as easier decentralization and exports and easier localization (e.g. sort order, timestamps), and possibly (depending on query efficiency) better scalability, there are no reasons on the level of application logic to do so. This is because everything that can be modeled using multiple databases can be modeled using a single one.

However, since managing multiple languages in a single database makes it, theoretically, easier to add certain features, such as language filters, great attention to detail has to be given to the question how such filters and other features might affect community interaction in a wiki.

Admin choices

The site administrator has to make the following choice in LocalSettings.php:

  • Support all languages (including very minor and constructed languages)
  • Support all languages, except for specific ones (blacklist)
    • It may be desirable to provide certain preset groups, such as constructed languages.
  • Support only a certain set of languages (whitelist)

In addition, the administrator can choose which, if any, language should be used by default for viewing content. This option, $wgDefaultLanguage, could be set to a language code, or to 'auto,<fallback language code>', meaning that the browser's preferences are evaluated. If the detected language(s) is/are not supported by the wiki, the fallback code is used.

Backend

MediaWiki needs to come with information about languages. For this, the following two tables are added:

Table LANGUAGE
+----------------+-------------+------+-----+---------+----------------+
| Field          | Type        | Null | Key | Default | Extra          |
+----------------+-------------+------+-----+---------+----------------+
| language_id    | int(10)     |      | PRI | NULL    | auto_increment |
| english_name   | varchar(255)|      |     |         |                |
| native_name    | varchar(255)|      |     |         |                |
| iso639_2       | varchar(10) |      |     |         |                |
| iso639_3       | varchar(10) |      |     |         |                |
| wikimedia_key  | varchar(10) |      |     |         |                |
| dialect_of_lid | int(10)     |      |     | 0       |                |
| is_enabled     | tinyint(1)  |      |     | 0       |                |
+----------------+-------------+------+-----+---------+----------------+

Table LANGUAGE_GROUPS
+----------------+-------------+------+-----+---------+----------------+
| Field          | Type        | Null | Key | Default | Extra          |
+----------------+-------------+------+-----+---------+----------------+
| language_id    | int(10)     |      | PRI | 0       |                |
| group_name     | varchar(255)|      | PRI |       |                |
+----------------------------------------------------------------------+ 

The fields are fairly self-explanatory. The ISO keys refer to the ISO 639-3 and ISO 639-2 codes [1]. The "Wikimedia key" is the code, if any, under which this language is known in the Wikimedia projects, e.g. "en" for English. Because the languages need to be loaded into memory on each pageview if no caching is available, there should be an index on is_enabled.

The groups allow us to build certain language groups, such as all constructed languages, all languages with Latin scripts, and so forth. The user can select at setup time which languages his installation should support, or change the is_enabled flags manually later. The most common choice will probably be "all Wikimedia project languages", which can be derived from the wikimedia_key (also used for other purposes) being non-empty.

We also want to know what users can or want to do with these languages:

Table USER_LANGUAGES
+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| user_id     | int(10)     |      | PRI | 0       |       |
| language_id | int(10)     |      | PRI | 0       |       |
| attribute   | varchar(15) |      | PRI |         |       |
| level       | int(10)     |      |     | 0       |       |
+-------------+-------------+------+-----+---------+-------+

Attribute can be something like 'read', 'translate', 'communicate', 'see_ui'. Level is a numeric preference or proficiency. For now, only 'read' and 'see_ui' will be used.

For the "create this page in another language" feature, we need a set of default languages that are likely to be known by a speaker of a certain language. This is very fallible, and should really be linked to a locale rather than a language, but it will do for now, especially as it can be customized in the user's language preferences.

Table LANGUAGE_DEFAULTS
+---------------------+---------+------+-----+---------+-------+
| Field               | Type    | Null | Key | Default | Extra |
+---------------------+---------+------+-----+---------+-------+
| language_id         | int(10) |      | PRI | 0       |       |
| default_language_id | int(10) |      | PRI | 0       |       |
+---------------------+---------+------+-----+---------+-------+

Note that this is a many-to-many relationship: a language usually has multiple default languages, and any language can be part of the set of default languages for any other.

In order to connect content in different languages, we need another table:

Table LANGUAGELINKS
+-------------+---------+------+-----+---------+-------+
| Field       | Type    | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| set_id      | int(10) |      | PRI | 0       |       |
| language_id | int(10) |      |     | 0       |       |
| page_id     | int(10) |      |     | 0       |       |
| from_id     | int(10) |      |     |         |       |
+-------------+---------+------+-----+---------+-------+

Note that the set_id is not an autoincrement. It has to be assigned when a new set is created, as a set can have multiple members.

The functionality of language links is explained below. Besides this, the following tables need to have LANGUAGE_ID keys and lookup indexes that include the language ID:

  • PAGE
  • PAGELINKS
  • TEMPLATELINKS
  • CATEGORYLINKS
  • RECENTCHANGES
  • possibly QUERYCACHE

To not necessarily complicate matters in wikis which do not use multiple languages, the existing indexes should continue to exist; however, UNIQUE or PRIMARY keys need to be modified to include the language (it can be 0 for multilanguage wikis).

Logic and frontend

Content languages in MediaWiki should essentially act like meta-namespaces that exist hierarchically above all regular namespaces. Accordingly, these should be part of the page title, so any URL in a multilingual wiki would be of the form:

http://mywiki.example.org/index.php?title=en:Main_Page
http://mywiki.example.org/index.php?title=de:Talk:Hauptseite
http://mywiki.example.org/index.php?title=mult:Babel

Note that regular namespace names, at present, cannot be automatically localized, though using the new namespace manager features in MediaWiki 1.6., synonyms could be gradually created by the site manager as language communities emerge (see below).

The prefixes would be identical to the current Wikimedia key, if any, if no Wikimedia key exists, the ISO 639-3 three-letter code would be used, prefixed with "iso_" to make it unique.

The prefix "mult:" stands for pages which support multiple languages, such as votes or certain templates. These would have the language code 0.

Note that a monolingual wiki would continue to act exactly as it does now, and use no prefixes whatsoever.

Using the new title rendering code that is part of the namespace changes in MediaWiki 1.6, it will be possible to show the language code as part of the rendered page title, but to style it separately (e.g. smaller font size).

The linking behavior within a language meta-namespace should be similar to a namespace with the "prefix" option set to its own name, i.e., all unprefixed links should point to pages in the same language. So, if you created a link to [[Portada]] from a page in Catalan, it would point to the Catalan Main Page, and only if you linked to [[es:Portada]], you would be referring to the Spanish version.

Language preferences

One major new feature of multilingual MediaWiki should be the ability to set language preferences without creating an account (if cookies are enabled). A new special page could be created for this purpose, e.g. Special:Languagepreferences (that would also be embedded into Special:Preferences). It would include the current user interface language selector, showing only those languages for which there are, in fact, interface translations. In addition, there would be a form element like the following:

Thanks to the flexibility of the USER_LANGUAGES table, additional language preferences could be added later, which will be especially usefull for WiktionaryZ.

Note that the user interface deliberately does not make use of dropdown boxes, as the number of supported languages can range in the thousands (hence the link to the list of languages). The ideal UI would be an AJAX-based autocompletion interface with a repeated form, but since the necessary libraries will be implemented as part of the Wikidata UI layer, it does not make sense to be fancy at this point in time.

Innerlanguage links

MediaWiki currently supports "interlanguage links", links to a page in other languages that are displayed in the sidebar (in the MonoBook skin). However, these links relate to separate wiki databases. In order to distinguish the same feature within multilingual MediaWiki, we use the term "innerlanguage links".

One major deficit of the way interlanguage links are currently implemented is that it is necessary for each page to maintain a list of all languages to which it is connetected. If, for example, you have a page in 10 languages, each of these pages needs to have a list of interlanguage links to 9 other languages. Proposals have been made to reform this using a central database. This is a complex problem, requiring central versioning and possibly single login to be solved.

It is much easier to solve within a single database. The LANGUAGELINKS table above works different from the current interlanguage link system. Instead of having separate lists of links for each page, there are sets of pages which are connected.

The innerlanguage link syntax therefore is as follows:

[[join:<language code>:<Page title>]]

This is an instruction to the parser. For example, if you type [[join:en:Main Page]], when the page is saved, it is connected to the existing set with the member "en:Main Page". If this set does not yet exist, it is created as a pair. If you remove the join, the page is removed from the set.

There can only be one join instruction per page, all following instructions are ignored. Replacing a join is equivalent to removing the existing one and adding a new one. All pages in the set will have to be purged when a new join is made.

The page to which you join can be any member of the set. This information is stored in the from_id column of the languagelinks table. If doing direct translations, it can be used to trace which path content has taken in the translation process.

Innerlanguage links are rendered the same way as interlanguage links. All innerlanguage links are shown, regardless of language preferences. Experience has shown that it is desirable to make multilingual activity visible in this manner; it also makes caching easier.

Namespace-specific behavior

Some MediaWiki namespaces have certain functionality associated with them. This functionality is affected in a multilingual wiki.

Templates

Unless you explicitly refer to a multilingual template or one in another language, templates would be looked for in the same language namespace as the page where they are used.

MediaWiki namespace

The MediaWiki namespace currently supports multilingual content using a somewhat hackish subpage syntax to disambiguate languages. It should be ported to use the new language code system. Ideally, this could also be used to deprecate $wgForceUIMsgAsContentMsg - if a multilingual page exists in the MediaWiki namespace for a message, it is used for all languages. (Some messages could be multilingual by default.) For example, it would be possible to either create a multilingual portal as the frontpage (by creating mult:MediaWiki:Mainpage), or separate portals for each language.

File descriptions

File descriptions can be multilingual, but the page title is identical across all languages (the filename).

Links to files (displayed as an image or not) will point to the description page in the language of the linking page. If no description exists in that language, viewing the description page will show the languages for which descriptions are available. As a nice to have feature, the user language preferences could be evaluated to provide an automatic fallback, if possible; e.g., if the user speaks English, and an English language description is available, but a description in the language of the current context is not, the English description is shown.

Categories

It is possible to add innerlanguage links to categories, however, doing so does not mean that the translated category name (e.g. "en:Horse" => "de:Pferd") will inherit the category hierarchy of the original. Instead, category hierarchies can evolve separately in separate languages. In a multilingual repository like Wikimedia Commons, this means that separate file description pages in separate languages have separate categories.

In the long run, we want to complement the category system with meaning tags which relate directly to an element in the WiktionaryZ thesaurus structure, allowing for an multilingual concept structure that is identical across languages and that can be automatically rendered in languages where translations are available (see a first mock-up of this concept). This is a more desirable approach than perfectioning wiki-local category schemas, as it promotes the use of WiktionaryZ as a single, global, structured conceptual database of the world.

User interface language default

When no user interface language is set, the UI language should be identical to the content language, that is, when viewing pages in English, the UI language should be English; in German, the UI language should be German, and so on. This ensures that each language community can customize the interface messages according to their needs, and emulates the current behavior of multilingual Wikimedia projects with split databases.

Regular links

In a number of places where relations between pages are evaluated (e.g. "What links here"), the language codes will have to be shown alongside the page title. Otherwise, the behavior of regular links is not affected.

Language filtering

By default, all languages are visible, and filtering is disabled.
Once language communities have been recognized, special pages like Recent changes can be filtered by language.

At least three special pages, Special:Contributions, Special:Allpages and Special:Recentchanges, should offer the ability to filter pages by language. It would be useful if this ability would gradually be added to other pages as well.

However, language filtering should be disabled by default. There should be a special, multilingual system message, MediaWiki:Language communities, which would contain a comma-separated list of language codes (but be blank by default). These language codes would identify self-organizing communities within a wiki.

For example, if a new wiki is started in English, and people start adding content in German, these additions should initially be visible to everyone. Other users can then try to determine whether they are legitimate additions to the wiki. If a true community seems to be forming, the language can be added to the list of language communities. Only then can it be filtered.

Once the first language community exists, the default filter would still be "All languages". The individual language communities would become available as single-language filters. Only if the user has set their language preferences, a new option would appear (and become the default filter): "Languages I speak and new communities". "New communities" in this context would refer to languages which are not yet part of the list of language communities.

The mechanism of identifying language communities, and generally showing all languages by default, hopefully ensures that vandalism and spamming will only be hidden from the view of all users once a legitimate community exists to deal with it. It also promotes interaction between the existing community and newly forming language communities, allowing for guidance and advice in formulating initial policies and setting up pages.

Language proficiency

The language proficiency, if known, should be shown in two places:

  • user pages
  • user list, administrator list (it would be ideal if these could be filtered).

It serves a similar purpose as the current "Babel" templates (see commons:User:Eloquence as an example - the boxes at the bottom are language proficiency templates), but can be reliably accessed by MediaWiki itself.

Go button

The Go button would behave as it currently does; however, it would search for pages in the language of the currently viewed page, unless a language prefix is provided. As a nice to have feature, if language communities are configured (see above) these should be available as a dropdown in the Go/Search toolbox.

"Create a page in this language"

An easy interface should be provided for creating translations or versions of pages in different languages.

One unique new feature that should be part of the first implementation of multilingual MediaWiki is the ability to easily create a version of a page in another language. At the bottom of each page, a menu is shown which allows the user to select a language, enter a title (prefilled by default with the current page title), and create the page.

It is important that the word "Translate" is avoided in the user interface in this context, as the linked page is not necessarily a direct translation (and perhaps most frequently will not be).

The languages shown in the dropdown selection depend on three factors:

  • Languages where a corresponding page already exists are not listed in the selection.
  • If the user is anonymous or has not set their language preferences yet, the languages come from LANGUAGE_DEFAULTS.
  • Otherwise, they come from the USER_LANGUAGES table.

A link with the title "Customize languages" or similar should open the language preferences dialog.

The newly created page must be prefilled with an innerlanguage join to the origin page. Beyond that, it is empty.

Future ideas: Translation assignment

One feature that would make a lot of sense in many applications is a translation assignment manager (TASMAN). A TASMAN would allow each user to define from which language(s) they are willing to translate content. Individual pages could be flagged as to be translated. Based on their language proficiency setting, translators could be notified (by e-mail or wiki messaging) exactly when a page in a language they can translate from is to be translated. In order to manage assignments, "in use" templates could be used, though a task management system such as Magnus Manske's Tasks extensions [2] might be useful for this purpose.