Importing a Wikipedia database dump into MediaWiki

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by 61.218.103.68 (talk) at 09:03, 13 October 2005 (→‎Import XML). It may differ significantly from the current version.

Most of the page below is outdated; still-relevant bits should be merged to Data dumps.

See data dumps for current info.


Install MediaWiki first

The MediaWiki install script will overwrite anything in the target database. Unless you are careful and do a manual installation, you should install and set up a blank database before importing the articles. You'll feel mighty sorry when you spend an hour importing everything and then come up blank. :)

See MediaWiki User's Guide: Installation

If you want access to the data but don't intend to run a local MediaWiki installation with them, then you can just install MySQL and create an empty database.

Fetch a dump

http://download.wikimedia.org/

The public dumps are made about every 1-2 months, and include the cur table (current revisions of all pages) and old table (prior revisions of every page) for each Wikimedia wiki. The size will vary a bit; the English Wikipedia is huge.

The dumps are big series of SQL statements that create and populate a table. Importing them will remove and overwrite any existing tables of the same name in the database you import them into.

In some cases (e.g., when downloading Wikipedia) you'll also want to download the images. The packed-up images are larger than 2^32 bytes, and such large files cause problems for some programs. In particular, commonly-deployed versions of "wget" will not download them correctly (it sees 16.7 GB = 4 * 4GiB + 700 MiB as 700 MiB, because that's what fits in a 32 bit value). Try using another program, or perhaps an "unstable" version of wget. Recent versions of curl have been tested and are able to download large files without problems, but note that curl needs an explicit option ("-C -") to continue from a partially downloaded file if the transfer was interrupted (can happen in a several hours download), otherwise it will start from scratch.

Also, note the legal issues, particularly for the images, and be sure you understand them.

Import

The dumps are compressed with bzip2. Windows users probably don't have bzip2 installed, but most Linux, Unix, and Mac OS X users will have it preinstalled. A Win32 executable is available from http://sources.redhat.com/bzip2/

You probably don't want to decompress the file "in-place" as this will just waste disk space. Decompress it on-the-fly while importing it into the database:

 bzip2 -dc cur_table.sql.bz2 | mysql -u myusername -p mydatabasename
 bzip2 -dc cur_table.xml.bz2 | php importDump.php
 (you may add directory infront of command and files to let it excute normally!)

This may take a while, and it's normal to get no feedback until it's finished and exits. Just sit back and watch your free space go down!

Packet size

If mysql reports the error

ERROR 1153 at line 831: Got a packet bigger than 'max_allowed_packet'

(for any line), you might have better luck by (a) increasing max_allowed_packet in my.cnf, or (b) importing directly from the client. (Possibly a combination of the two.) To import from the client, you must have a decompressed copy of the table to be imported. Use the following command to read directly from the SQL file.

mysql> \. cur_table.sql

References:

Import XML

There are two ways to import the XML dump into a MySQL database. If you are using a MediaWiki installation, run the importDump.php script from the /maintenance subdirectory:

zcat pages_full.xml.gz | php importDump.php

bzip2 pages_full.xml.bz2 | php importDump.php

If you want to do a direct import into MySQL, use the xml2sql script. To convert the XML and import without touching the hard disk, run the following:

zcat pages_current.xml.gz | xml2sql | mysql -u xxx -p wikidb

I can only find the http://de.wikipedia.org/wiki/Wikipedia:Download/xml2sql from the German site. There desperately needs to be more documentation on this in English.

Rebuild auxiliary tables

You may want the recentchanges and links tables rebuilt. Go into the maintenance subdirectory of the MediaWiki source and run

 php rebuildall.php

NOTE: Make sure your AdminSettings.php file in your main MediaWiki directory is set up correctly. Otherwise this will not work because your MySQL database can't be accessed. The MediaWiki distribution ships with an AdminSettings.sample file which needs to be copied to AdminSettings.php and edited to suit your site configuration.

This will take a while, but you'll at least get some feedback while it's working. Speed will vary depending on the size of the database and your system: the French Wikipedia database can be processed in about 20 minutes on a 2GHz Athlon; the English Wikipedia may take up to several hours.

  • Note that in wiki version 1.5rc4 there is a bug in the script so you have to run refreshLinks.php in the maintenance directory