CERSEI: Difference between revisions

From Meta, a Wikimedia project coordination wiki
Content deleted Content added
No edit summary
No edit summary
Line 8: Line 8:
If either the data source or the import/scraper code are updated and generate more details for an entry, the old revision is stored, allowing for reconciliation of just the new entries with Wikidata, and analysis of the changes between revisions.
If either the data source or the import/scraper code are updated and generate more details for an entry, the old revision is stored, allowing for reconciliation of just the new entries with Wikidata, and analysis of the changes between revisions.


At the moment, CERSEI is not intended to allow matching of entries to Wikidata items from within the tool; it is rather a repository of automatically curated data to be used by other tools. Wikidata matches are either imported from Wikidata, or from the respective data source. A Mix'n'match "bridge" is planned.
At the moment, CERSEI is not intended to allow matching of entries to Wikidata items from within the tool; it is rather a repository of automatically curated data to be used by other tools. Wikidata matches are either imported from Wikidata, or from the respective data source. A Mix'n'match "bridge" is in place.


Please feel free to suggest more sources to import, or even write a new scraper ([https://github.com/magnusmanske/cersei/blob/main/src/scrapers/scraper_1.py example]) yourself.
Please feel free to suggest more sources to import, or even write a new scraper ([https://github.com/magnusmanske/cersei/blob/main/src/scrapers/scraper_1.py example]) yourself.

Revision as of 09:30, 7 December 2023

CERSEI screenshot

CERSEI is a tool that can import or scrape third-party data sources. It uses source-speific Python code for each source, and can even use a "headless browser" to scrape complicated websites that rely on eg JavaScript to navigate. It can therefore access data sources that can not be accessed via eg Mix'n'match. The data from sources can be updated regularly, either for everything, or just changed entries (if the source has a "recent changes" equivalent).

CERSEI stores the scraped results in an "extended" WikiBase-compatible JSON format, that can be filtered into Wikidata-compatible items, for easier comparison and import. There is an API endpopint with MediaWiki-compatible path and output format, to allow processing by existing MediaWiki clients.

Properties can be queried and filtered via a simple syntax to retrieve entries with specific values.

If either the data source or the import/scraper code are updated and generate more details for an entry, the old revision is stored, allowing for reconciliation of just the new entries with Wikidata, and analysis of the changes between revisions.

At the moment, CERSEI is not intended to allow matching of entries to Wikidata items from within the tool; it is rather a repository of automatically curated data to be used by other tools. Wikidata matches are either imported from Wikidata, or from the respective data source. A Mix'n'match "bridge" is in place.

Please feel free to suggest more sources to import, or even write a new scraper (example) yourself.