Jump to content

Toolserver/Database: Difference between revisions

From Meta, a Wikimedia project coordination wiki
Content deleted Content added
m →‎Status files: spelling fix
Line 106: Line 106:


== Status files ==
== Status files ==
There are status-files about each cluster (and the user-dbs) at /var/www/status_s'''''clusternumber''''' or <nowiki>http</nowiki>://tools.wikimedia.de/status_s'''''clusternumber'''''. In both cases, '''''culsternumber''''' can be a digit from 1 to 3.
There are status-files about each cluster (and the user-dbs) at /var/www/status_s'''''clusternumber''''' or <nowiki>http</nowiki>://tools.wikimedia.de/status_s'''''clusternumber'''''. In both cases, '''''clusternumber''''' can be a digit from 1 to 3.

Revision as of 08:23, 3 December 2007

This page describes how to use the replicated database on the Toolserver. For general user information, please see Toolserver/For users.

There are two database servers. You should connect to them with the following names:

  • sql - for user databases and the 'toolserver' database
  • sql-s1 - for databases in the s1 cluster
  • sql-s2 - for databases in the s2 cluster
  • sql-s3 - for databases in the s3 cluster

You can find out which cluster a particular wiki is on by inspecting the toolserver.wiki database:

SELECT server FROM toolserver.wiki WHERE dbname='enwiki_p';

At the command line, you can also use the sql command, which will do this for you:

$ sql enwiki_p

Alternatively, you can use a hostname of the form: enwiki-p.db.ts.wikimedia.org.

You have a MySQL database called "u_<username>". The password for this is in your ~/.my.cnf file. Only you have access to this database. If you would like to share some tables, please ask the roots for a second public database and put the tables in that database.

Databases for most public wikis are available. These are called <lang><project>_p, where <lang> is the language (e.g. "en", "de" or "commons"), and <project> is the project ("wiki" for Wikipedia, "wiktionary", "wikibooks" etc). Access to these databases is read-only. A list of all wiki databases along with information about the content language and size is in the wiki table in the toolserver database.

The following tables are available for use (see the database layout on mediawiki.org for details):

Some of these databases have additional indexes not present on the standard schemata; these are documented in /usr/local/etc/views. If you need another index added, ask.

Format of Text in cur and text Tables

Note: the data in these tables stops at 2005-07-22.

Text in the 'cur' or 'text' tables may be compressed, or in object format.

  • Compressed text has 'gzip' in old_flags. This data is compressed with headerless zlib compression.
  • Text with "utf8" in old_flags is in UTF-8 encoding. Text without this flag may or may not be latin-1. (??? explain further)
  • Text with "object" in old_flags is encoded as a serialised PHP object. This may refer to:
    • cur stubs. the PHP object contains the cur_id for the relevant row in the cur table containing the actual text.
    • History blob stubs. this is concatenated compressed storage. (??? is this documented?)
    • Something else?
  • Text with "external" in old_flags is stored in separate external databases. This text is not yet available on the toolserver databases.

You can use the MediaWiki function Revision::getRevisionText to extract the actual text automatically (see Revision.php for details).

At some point, the missing old text will be imported from an XML dump, and compressed data will be uncompressed. However, this needs to wait until we have more disk space available.

Warning: the databases containing the data are marked as "latin1". However, they do not contain latin1 data. Most of the data is in UTF-8. This is a holdover from MySQL 4.0's (lack of) UTF-8 support. Be very careful that the MySQL client does not try to convert the data to UTF-8 for you, or you will end up with garbage.

Text that is not available on the toolserver database can be retrieved through WikiProxy.

Timestamps

Timestamps are NOT numbers, they are strings consisting of numeric characters. So enclose timestamps in quotes (e.g. rc_timestamp > '200500000000' instead of rc_timestamp > 200500000000). If you forget this, your query will run much slower (up to 50x slower or more).

SQL-Queries

  • Zombie queries: If you accidently started a query that needs too much time you can get its thread_id with SHOW PROCESSLIST and terminate it with KILL thread_id. Typing ^C at the MySQL client prompt will also kill the currently executing query.
  • Bulk insert: Inserting tab-separated files is much faster that inserting single records: LOAD DATA LOCAL INFILE absolute-path-to-file IGNORE INTO TABLE table
  • If you run slow queries on the database, you may want to read [1] for a way to make them faster.

SSH-Tunnels

To connect to phpmyadmin on the toolserver you have to tunnel the port 8080:

ssh -L 8080:127.0.0.1:80 username@hemlock.ts.wikimedia.org

Now you can access phpmyadmin via http://localhost:8080/phpmyadmin

To connect to the mysql database on zedler from a remote host you have to tunnel the port 3306:

ssh -L 3306:sql:3306 username@hemlock.ts.wikimedia.org

You can concatenate both tunnels:

ssh -L 127.0.0.2:3306:sql:3306 -L 8080:127.0.0.1:80 username@hemlock.ts.wikimedia.org

You can now connect via mysql:

mysql --user=username --host=127.0.0.1 --socket=3306 --password
or
mysql --user=username --protocol=tcp --socket=3306 --password

If you have mysql server installed on your machine (hence port 3306 being in use), you can assign a different port for the SSH tunnel:

ssh -L 3307:sql:3306 username@hemlock.ts.wikimedia.org

in this case, you also need to use this port when trying to connect with mysql. For example:

mysql --user=username --host=127.0.0.1 --socket=3307 --password

Status files

There are status-files about each cluster (and the user-dbs) at /var/www/status_sclusternumber or http://tools.wikimedia.de/status_sclusternumber. In both cases, clusternumber can be a digit from 1 to 3.