Jump to content

InstantCommons

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by John Vandenberg (talk | contribs) at 12:16, 3 August 2007 (fix spelling). It may differ significantly from the current version.

These are draft development specifications, not documentation. This feature does not exist yet.

Enter the name of an image from Commons on any MediaWiki installation ..
... and the image is fetched from Commons and embedded into the page.

InstantCommons is a proposed feature for MediaWiki to allow the usage of any uploaded media file from the Wikimedia Commons in any MediaWiki installation world-wide. InstantCommons-enabled wikis would cache Commons content so that it would only be downloaded once, and subsequent pageviews would load the locally existing copy.

Rationale

As of February 2006, the Wikimedia Commons, a central media archive operated by the Wikimedia Foundation, contains about 450,000 files uploaded by nearly 30,000 registered users. Each of these files is available under a free content license or in the public domain; there are no restrictions of use beyond those relating to use of official insignia. Licenses which limit commercial use are considered non-free.

As awareness of the Commons grows, so does the desire of external parties to use content included therein, and to contribute new material. It is currently technically possible to load images directly from Wikimedia's servers in the context of any webpage. This is bad for multiple reasons:

  • It does not respect the license terms of the image, and does not allow for other metadata to be reliably transported
  • It does not give credit to Wikimedia
  • It consumes Wikimedia bandwidth on every pageview (unless the image has been cached on the client side or through a proxy)
  • It does not facilitate useful image operations such as thumbnail generation and captioning and is difficult to use in the context of a wiki, particularly for standard layout operations
  • It is tied to URLs as resource identifiers, which complicates mirroring
  • It creates an untrackable external usage web, where any change on Wikimedia's side necessarily affects these external users
  • It does not permit offline viewing, which is crucial in countries which have only intermittent network access.

The InstantCommons proposal seeks to address all this by providing an easy method for cached loading of images and metadata from Wikimedia's servers. The first implementation of InstantCommons will be within MediaWiki, allowing for all MediaWiki image operations (thumbnailing, captioning, galleries, etc.) to be performed transparently. However, other wiki engines can implement InstantCommons-like functionality using the API operations described below.

Basic feature set

During the installation, the site administrator can choose whether to enable InstantCommons. This could be tied to the wiki being under a free content license (see #Scalability considerations). Ideally, however, the feature should be enabled by default (provided a writable upload directory is specified) to allow the largest possible number of users to use Wikimedia Commons content.

If the feature is enabled, the wiki would behave like a Wikimedia project, that is, if an image or other media file is referred to which exists on Commons, it can be included in a wiki page like a locally uploaded file by specifying its name. Local filenames take precedence over Commons filenames.

While the Wikimedia Commons would be the default repository for images, the implementation would not be repository-specific. Instead, it would be an extension to the existing shared image repository functionality in MediaWiki (used by Commons), which currently only allows filesystem-based usage of an external image repository (though image description pages are already fetched via HTTP). A single Boolean parameter ($wgUseInstantCommons) should be sufficient to enable or disable access to Wikimedia Commons, while access to a different repository would require more configuration.

Implementation details

When a filename from Commons (or another repository) which does not exist locally is entered into the wiki and the page is parsed, the wiki sends an XML-RPC [1] request to the repository to ask whether a file with this exact name exists and what its size is. If the file exists, a response containing the file size and URL is returned. Multiple requests should be aggregated into one (using multiple methodCall and methodResponse elements).

Request example:

<?xml version="1.0"?>
<methodCall>
  <methodName>files.getInformation</methodName>
  <params>
    <param>
      <value><string>Karachi - Market.jpg</string></value>
    </param>
  </params>
</methodCall>

Response example:

<?xml version="1.0"?>
<methodResponse>
  <methodName>files.getInformation</methodName>
  <params>
    <param>
      <value>
        <struct>
          <member>
            <name>fileLastModified</name>
            <value><dateTime.iso8601>20050717T14:08:55</dateTime.iso8601></value>
          </member>
          <member>
            <name>fileSize</name>
            <value><i4>169885</i4></value>
          </member>
          <member>
            <name>fileURL</name>
              <value><string>
                http://upload.wikimedia.org/wikipedia/commons/c/c5/Karachi_-_Pakistan-market.jpg
              </string></value>
          </member>
        </struct>
      </value>
    </param>
  </params>
</methodResponse>

If the file does not exist, an XML-RPC <fault> structure should be used to describe the cause of the problem. In the first implementation, only one error code will be supported, which will indicate that the file does not exist. The request should be handled by a new special page, Special:API, which could be extended later to provide other functionality. A table should record all failed InstantCommons requests together with the page_id and the page_touched timestamp from which they were made. A failed request would only be repeated from a different page, or when the page_touched timestamp is newer than the one recorded in the table.

If the file exists, the wiki uses an HTTP request to download it (possibly using the PHP fopen method). It would be desirable for this operation to be performed in the background, so that a slow download does not delay displaying the remaining page. AJAX polling could be used to ask the server for the transfer status, and update a progress bar.

If the download is successful, the file will be treated as a local upload, that is, its metadata will be processed and added to the IMAGE table (which will require a new flag indicating that the file is from a remote server), and a copy of the file will be placed in the local directory structure. However, the file will not have an associated history or image description page (a new flag may have to be added to the IMAGE table to distinguish local uploads from InstantCommons content, or the img_user field could be set to 0 for these files). The img_timestamp field will record the time of the import.

As a consequence, future pageviews of the same page will no longer have to query Commons, as the file will now exist as a local copy.

The description page will use the existing functionality to load metadata from Commons using interwiki transclusion; however, a new caching table will be created and used in order to store and retrieve the returned HTML description once it has been first downloaded (this will reduce automated queries to Commons, especially given that we cannot rely on InstantCommons wikis doing proper search engine exclusion).

There will be no mechanism to automatically push changes to files or description pages from Commons to InstantCommons-enabled wikis. Instead, each file that has once been loaded from InstantCommons will have on its description page a link available to logged in users to re-fetch the locally cached data (image and description page). InstantCommons files must be deletable by sysops like locally uploaded files; in addition, it should be possible to prevent a once-deleted file from being re-created (e.g. by protecting a blank image description page).

Thumbnail generation

Thumbnails and scaled down versions could be generated on the side of Wikimedia Commons, and be returned using a getThumbnail method. This may be necessary in particular when the using wiki has no thumbnail generation capabilities. In addition, cached thumbnails for common sizes will often already exist on the Commons servers, so it makes sense to save the computation of additional ones, especially for very large images.

Whenever a thumbnail is downloaded for a picture, the full size version should also be downloaded by MediaWiki, given that it is reasonable to expect that the picture will be viewed as a full-size version.

Media links and non-embedded filetypes

In addition to images, MediaWiki also supports other file types. These can be linked to using [[Media:]] links, which result in a direct URL pointing to the file on the server, or using [[Image:]] links (renamed to File: in MediaWiki 1.6) where the file description page will show a link to the uploaded file instead of embedding the image. These links and description pages will be treated no differently from embedded images: When a media link is found in a page, or a file description page is viewed, and the file does not exist locally, an inquiry is sent to Commons, and if the file exists there, a local copy is made.

Maintenance script for updates and copyright issues

Since automatic purging of images is not part of this implementation (as it would increase the complexity significantly), there is the risk of copyright violations spreading out to external wikis. Wikimedia can be expected to exercise due diligence in order to prevent this from happening. In order to empower site administrators, InstantCommons should come with a maintenance script that performs files.getInformation method calls for all images that have ever been loaded from Commons. The script should be runnable in three modes:

  • Update images which have been changed on Commons
  • Delete images which have been removed from Commons
  • Both

If deletions are performed, a backup copy with a non-guessable filename should be made in a subdirectory of the "upload/" path, and a report should be given to the user whose images have been deleted. Possible future improvements to this include getting the deletion reason from Commons, but this is not necessary in the first implementation.

Site administrators should be advised to run this maintenance script regularly.

Logging

All files loaded via the InstantCommons mechanism should be registered in the upload log, along with the user name/IP and a specific remark that they have been loaded transparently from Commons. This makes it possible to detect abuses and block abusers.

Scalability considerations

Because the InstantCommons feature would allow a wiki user to download resources from the Wikimedia servers, it is crucial that there is no possibility of a Denial of Service attack against either the using wiki, or the Wikimedia Commons, for example, by pasting 30K of links to the largest files on Wikimedia Commons into a wiki page and pressing "preview".

Therefore, every successful InstantCommons request will have to be logged by the InstantCommons-enabled wiki together with the originating user or IP address and the time of the request. If an individual user overrides a generous internal bandwidth limitation (could be as high as 1 GB by default, but should be user-configurable), future images will not be downloaded within a 24 hour period. This limitation should not exist for wiki administrators (if a wiki admin wants to conduct a denial of service attack against his own wiki, they do not need to be stopped from doing so; if they want to conduct an attack against Wikimedia, they cannot be stopped from doing so except on Wikimedia's end).

In addition to the per-user bandwidth limit, there could be a limit on the size of files which should be downloaded transparently. This would primarily be because files above a certain size would delay pageviews significantly and might even cause the page request to time out. It might be desirable to use an external application for the purpose of downloading these files, so that it can be done in the background without causing the page request to continue. Finally, there could be a total maximum size for the InstantCommons cache; if this size is exceeded, no further files would be downloaded.

While it is unlikely that individual wikis using the InstantCommons feature would cause a significant increase in cost for the Wikimedia Foundation (since every file only has to be downloaded once, and there are per-user bandwidth limitations), it would nevertheless be fair and reasonable for projects using the feature to include a notice on InstantCommons description pages such as: "This file comes from Wikimedia Commons, a media archive hosted by the Wikimedia Foundation. If you would like to support the Wikimedia Foundation, you can donate here ..."

Future potential

In the future, it may be desirable to offer a publisher/subscribe model of changes, which will require wiki-to-wiki authentication and a database of images which are used in subscribing wikis. This would also open up the threat of cross-wiki vandalism, which could be addressed using a delay phase of 24 hours or more for changes to take effect.

Two-way functionality is another possibility, that is, to allow uploading free media directly to Commons from any wiki installation. However, this will require federated authentication as a minimum. It may also necessitate cross-wiki communication facilities to notify users from other wikis about Commons policies, which could be part of a larger project like LiquidThreads.

Finally, the biggest challenge for making Commons content available is making it searchable throughout all languages -- new approaches such as meaning-based tagging will be necessary to accomplish this. This functionality will hopefully be enabled by the OmegaWiki project; see a simple demonstration of the concept.

Similar functionality to InstantCommons could be offered for extensions -- if extensions like WikiTeX are run within a secure environment on the Wikimedia servers, access to them could be provided to any free content wiki. The benefit for Wikimedia would be that the generated data could be stored on Wikimedia's servers as well, and potentially useful content could be reviewed and added to the Wikimedia projects. (A subscriber database would again be useful to record the source and context of use, perhaps even allowing for a browsable library of recently generated extension-derived content on outside wikis.)