Fulltext search engines

This page should be moved to MediaWiki.org.
Please do not move the page by hand. It will be imported by a MediaWiki.org administrator with the full edit history. In the meantime, you may continue to edit the page as normal.

This is a list of Fulltext Search Engines, and technologies that could potentially be used to build them, for Mediawiki.

JODA

(ioda, because joda was already taken by some other project)

Download http://sourceforge.net/projects/ioda/
See live on http://wikipedia.rhein-zeitung.de/index.php/Hauptseite (this page demonstrate only the indexer and is not intended as a mirror for wikipedia)

From a mailing list posting of Jochen Magnus: older versions of Joda are working since 1996 as news paper archive for the Rhein-Zeitung (Koblenz and Mainz, Germany). It's also used for archive and newsdesk purposes from several other european newspapers. At the moment it is going into action as full text index for europeans biggest magazine. It is also in use for the public index of the state archive of Rheinland-Pfalz (Germany).

Last year I created two mirrors of WikiPedia, one using MediaWiki for demonstration purposes and another - our public one - using our own read-only web frontend. Joda is integrated into both mirrors:

http://wikipedia.rhein-zeitung.de/index.php/Hauptseite (MediaWiki)
http://lexikon.rhein-zeitung.de/ (our special Wikipedia interface)

At the suggestion of Magnus Manske (not related :-) I published Joda under LGPL and made serveral improvements for the Wikipedia task. I wrote tools for indexing a whole cur table either from MySQL or from a SQL dump (which is twice faster). Indexing the german Wikipedia cur table (>210.000 articles, 36 million words) lasts approx. 45 minutes. An optional database optimization lasts additional 25 minutes. Both on a dual Athlon 2800+ machine with 1 GB RAM (the indexer is a multi threaded perl program).

Joda can erase or update entries on the fly and can handle queries with parantheses and word distance operators like http://lexikon.rhein-zeitung.de/?((Albert OR Alfred) AND.1 Einstein) NEAR Quant*) NOT Gravitation. See more features under http://ioda.sourceforge.net/

Joda kernel is written with the Free Pascal compiler (http://sourceforge.net/projects/freepascal/). The tools are written in Perl. There a libraries for using joda directly from C, Perl, Python and PHP, all published under LGPL. The joda binaries are: command line program, TCP socket driven server and CGI.

Lucene-search

Lucene is a text search engine written in Java, sponsored by the Apache project.

A Lucene-based search server is now up and running experimentally to cover searches on the English Wikipedia. It is compiled with GCJ, so is free software and does not rely on Sun Java VM.

Using a separate search server like this instead of MySQL's fulltext index lets us take some load off the main databases.

To compare our options Brion did an experimental port to C# using dotlucene; some benchmarking showed that while the C# version running on Mono outpaced the Java version on GCJ for building the index, Java+GCJ did better on actual searches (even surpassing Sun's Java in some tests). Since searches are more time-critical (as long as updates can keep up with the rate of edits), we'll probably stick with Java.

More information on this implementation can be found on the Wikitech LiveJournal and at User:Brion VIBBER/MWDaemon

At the moment the drop-down suggest-while-you-type box is disabled as GCJ and BerkeleyDB Java Edition really don't get along. Brion has said that he will either hack it to use the native library version of BDB or just rewrite the title prefix matcher to use a different backend.

Here are some step-by-step instructions on how to install this kind of search on a wiki.

Solr

http://incubator.apache.org/solr
A lucene based search server with XML/HTTP interfaces, caching, replication, web admin.

DBSight

http://www.dbsight.net/
J2EE application
Database + Lucene + Display Template, with Scheduler
Scalable, online demo http://search.dbsight.com holds 1.2G data, 1.7 million records
Work on live systems, new or old legacy systems, without changing existing code.
Customizable crawl, customizable indexing, customizable searching, customizable results templates

Pylucene

http://pylucene.osafoundation.org/
can be GCJ-compiled which avoids the "non free" java issue above.

Plucene

perl port of lucene
http://search.cpan.org/perldoc?Plucene

Google Search Appliance

Hardware box made by google
http://www.google.com/enterprise/gsa/index.html
proprietary, closed-source, etc, etc.
- but may be able to recieve this w:en:gratis.
- but Kate says: "the current situation appears to be that non-free software is not allowed, but software contained on other embedded devices is okay (e.g. switch firmware). given this i don't think there would be an issue with using one of the google devices." (wikitech-l Wed, 30 Mar 2005 08:08:16) gmane official archives
- According to Google, the basic single-slot GSA only does 500,000 documents (but can be licensed to search up to 1.5 million documents at a rate of 300 queries per minute). For perspective, here is the totals for each of the english projects hosted by WikiMedia:

Project name	Number of pages
Wikipedia	1,487,491
Wiktionary	81,836
Wikibooks	23,643
Wikiquote	7,397
Wikisource	26,565
Wikinews	7,396
Wikispecies	4,906
Commons	100,207
Meta	15,092
sep11	1,627
Grand total:	1,754,533

Note that that's just English! I have not gathered stats on any other of the large languages. Considering that there are several languages in with 6-digit figures for articles, the total number of pages hosted by WikiMedia could easily be triple or quadtruple this number! I hope Google is willing to give you more than just hosting. --Astronouth7303 20:16, 10 Apr 2005 (UTC)

Lupy

http://www.divmod.org/projects/lupy

Sphinx

Very fast
Plugs directly into MySQL and Postgresql if desired.
Handles some major sites such as ljseek.com (> 100 million records, 120+GB database) and rss-spider.com

http://sphinxsearch.com/

Installation guide : http://www.mediawiki.org/wiki/Extension:SphinxSearch

ZEND Framework

Lucene Class of the Zend Framework (http://framework.zend.com/manual/en/zend.search.html).

100% PHP
Lucene Binary Compatible Index

swish-e

Very fast
Easy to setup
Can index almost everything
Differential indexing capabilities
http://swish-e.org

Sphider

An easy to set up and install PHP web-application on top of MySQL that implements a web-spider for indexing and a flexible search page. Will index a complete wiki and can easily replace the built in search functionality.

Ksana Search For Wikipedia

Ksana Search For Wikipedia (剎那搜尋維基百科) is GPL.

points to consider

efficiency is key
- we already have full text search, but it uses the databases and isn't efficient. any alternative needs to be sufficiently "cheaper" in terms of hardware to make it worthwhile

http://www.google.com/search?q=site:en.wikipedia.org+&q=search
- we can link to google for free.
- not as fresh, as google won't update as often as wikipedia does
- not 100% coverage

do we want to be able to search across older versions / diffs?
- if yes, this content should probably not be searched by default. Namely, default is to just search the current content

can we take the index off-line when we need to update entries?
- swish-e 2.2.0 now supports this feature, lucene as well

do we want to update the index in small chunks (e.g. if only a single file has changed)?
- swish-e can do this but its somewhat hackish (you would use mulitiple indexes) while Lucene is designed for this.

outstanding question

if we include a summary, like Google, for each result, what should be shown?
- the google style : the section of the document that contains the search terms
- some short meta description of the article
- the first paragraph, or first N words

should titles be given more weighting?
- namely, if I search for the term "red wine", and there are two identical documents, except one contains "red wine" as a section title while another simply mentions it in the text ... should we return the first doc first, or should they be truely equal?
- is text in a title more important than other text

do we want a page rank style link analysis?
- eg, a wikipedia article that is linked to more often within the context of wikipedia suggests it is more important

an alternative is length/edit-rank
- article with more edits, or that are longer, get boosted in the results?

Discussion

Why not find an efficient database solution?
- Because databases aren't the best solution for high volume free text search. In the same way Excel could do tax returns, but there is much better software for cracking that nut in many cases.
I don't agree with that. Keeping the searching as close to the data as possible makes sense, and there are plenty of solutions out there (e.g. tsearch2) that seem efficient enough. Most of them are basically applications that have been joined to the database already, which certainly reduces a step for us.
- tsearch2 is a PostgreSQL feature, afaik. do you have an equivalent thing that works with MySQL?
  - MySQL surprisingly does full-text search. Many PHP-based bulletin boards make use of this. It's certainly convenient, but I don't think it's as powerful or flexible as an external engine like Lucene.
    - We already support MySQL's fulltext search. Its uselessness is mainly what inspired me to write the Lucene support :)

Thunderstone makes a product similar to Google's Search Appliance but it appears to be substantially less expensive. Another option to consider. --TidyCat 14:53, 9 December 2005 (UTC)[reply]