User:Zache/Wikimedia Hackathon 2024: Difference between revisions

From Meta, a Wikimedia project coordination wiki
Content deleted Content added
Line 10: Line 10:


* [https://sophox.org/#PREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0APREFIX%20sdc%3A%20%3Chttps%3A%2F%2Fcommons.wikimedia.org%2Fentity%2F%3E%0A%23GLAM%20CSI%0ASELECT%20%2a%20WHERE%20%7B%0A%20%20%20%20%20SERVICE%20%3Chttps%3A%2F%2Fimagehash-sparql.wmcloud.org%2Fsparql%3E%20%7B%0A%20%20%20%20%20%20%20SELECT%20%3Fpage%20%3Fphash%20WHERE%20%7B%20%0A%20%20%20%20%20%20%20%20%20%3Fpage%20wdt%3AP9310%20%3Fphash%20%0A%20%20%20%20%20%20%20%7D%20GROUP%20BY%20%3Fpage%20%3Fphash%20LIMIT%20500%0A%20%20%20%20%20%7D%0A%7D%0A%0A List hashes using Federated query]
* [https://sophox.org/#PREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0APREFIX%20sdc%3A%20%3Chttps%3A%2F%2Fcommons.wikimedia.org%2Fentity%2F%3E%0A%23GLAM%20CSI%0ASELECT%20%2a%20WHERE%20%7B%0A%20%20%20%20%20SERVICE%20%3Chttps%3A%2F%2Fimagehash-sparql.wmcloud.org%2Fsparql%3E%20%7B%0A%20%20%20%20%20%20%20SELECT%20%3Fpage%20%3Fphash%20WHERE%20%7B%20%0A%20%20%20%20%20%20%20%20%20%3Fpage%20wdt%3AP9310%20%3Fphash%20%0A%20%20%20%20%20%20%20%7D%20GROUP%20BY%20%3Fpage%20%3Fphash%20LIMIT%20500%0A%20%20%20%20%20%7D%0A%7D%0A%0A List hashes using Federated query]
* [https://w.wiki/9yxz Find duplicate image pair using hashes]
* [https://w.wiki/9yy9 Find duplicate image pair using hashes]
* [https://w.wiki/9yxf Merge imagehashes to Commons Query service query]
* [https://w.wiki/9yxf Merge imagehashes to Commons Query service query]



Revision as of 12:33, 5 May 2024

Improving Wikimedia Commons image hashing

The project idea is to calculate perceptual hashes for Wikimedia Commons images so that it is possible to reliably detect if a photo is already in Wikimedia Commons and match photos to photos in other image repositories. (Finna, Europeana, Flickr ...) This will allow for the updating of the image metadata and image files. It will also help for preventing uploading of duplicate images.

Speed improvement

Before the hackathon, the indexing speed was 15000 images per hour—i.e., 10 million per month. With that speed, indexing all 100 million Wikimedia Commons photos would take a year. So, in this hackathon, I moved the indexing code from Toolforge to a virtual server in wmlabs, which tripled the indexing speed to 30M+ photos per month. Indexing is expected to be ready in the summer.

Ontop SPARQL

We also installed the ontop server for querying hashes/duplicate images using SPARQL. This work is still ongoing, but currently we are able to query hashes located postgresql database using SPARQL.

STATUS: There is still missing pieces in our Ontop SPARQL -> SQL translation configuration and setup is pretty far away from being practical.