Wikirank/Details

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by Robert Važan (talk | contribs) at 13:53, 8 July 2021 (→‎Prior art: comparison with existing list projects). It may differ significantly from the current version.

This page considers Wikirank proposal in more detail from various angles.

Prior art

Wikirank differs from previously proposed lists in two ways. Firstly, Wikirank assumes that lists are already easily obtained by querying Wikidata. Wikidata actually has a whole query namespace reserved for lists, just waiting for T67626 to be completed. Wikirank instead focuses on filtering out insignificant items in these lists. Wikidata lists can be sorted too, but all factual sort keys (e.g. sorting things by size) are only trying to approximate desired sorting by relevance to the reader. Wikirank aims to collect and use popularity data, which is a direct measurement of relevance to the average reader. Secondly, Wikirank encourages frequent personalization of lists while showing these personalized lists under the same searchable URL as public lists, which considerably improves usefulness and usability of lists while at the same time encouraging contribution to the public list.

The concept is not new by any measure. Just about every Web 2.0 site allows users to vote on just about everything. Toplists make up a significant portion of web content. Websites like alternativeto.net turn every list item into its own list, sidestepping pitfalls of ontology. It's about time for WMF to introduce its own voting solution that is aligned with Wikimedia movement's principles and goals and integrated with other Wikimedia projects.

Minimal version

Sections below describe Wikirank in detail, but most of the functionality is not required immediately. This section describes minimal functional version of the project.

The main requirement is a new MediaWiki extension that adds support for vote data. Its simplest implementation saves private lists in relational database and uses SQL queries to compute public lists. There's no voting history, timestamps, or IP address data. No special privacy measures. No spam filters either. The site can start as English-only. Pages initially contain only page title and the list itself. List items initially show only item name and popularity. Private lists work from the beginning, but they cannot be shared. Login is required. Minimal version just ranks existing Wikidata Q-items without creating any new items. Sorting by popularity is complemented with secondary sort key to compensate for low vote count.

Estimated cost of the minimal version is under 100 man-hours.

Relationship with Wikidata

Every Wikirank page contains a list. Every list has a topic, which is identified by Wikidata Q-item. Every list item is also identified by Wikidata Q-item. Using Wikidata Q-items is the key to Wikirank's ability to merge private lists of its contributors into popularity-sorted public lists. Presence of list item in the list is interpreted as "instance of" relationship, i.e. list topics are classes in Wikidata parlance while individual list items are their instances. This "instance of" relationship does not have to be recorded in Wikidata. Wikirank just expects list items to have properties compatible with list topic. Users are able to add any item to any private list, but public lists will usually filter out incompatible list items.

Wikidata usually does not have classes like "GMail alternative" or "2020 sci-fi movie". If such lists are created in Wikirank, corresponding class Q-items will be created in Wikidata. There is however no need to manually add "instance of" statements to their potential instances. Class "2020 sci-fi movie" would just declare that it is "subset of" "science fiction film (Q471839)" and that it has "publication date" "2020". All Q-items in Wikidata with compatible properties are automatically accepted by Wikirank as possible list items.

Site and page structure

TODO: UI mocks

Every list is a page. Page URL references Q-item of the list topic (e.g. www.domain.org/wiki/Q12345), so that the URL remains the same regardless of selected language. Page title is a natural language label for the list. Page content is normal wikitext, but it is mostly generated by templates and Lua scripts that exist for every kind of list. The following describes default page layout. Page starts with short summary of list's topic, usually formatted as property-value table. Most space on the page is dedicated to the list. List items are sorted by votes. If votes are scarce, secondary sort key may be used. Every list item has three columns. First column contains vote count, voting buttons, and other Wikirank features. Second column contains summary of the list item, usually formatted as property-value table. Third column contains optional image.

There is no manually added free-form content, because it would make the site language-specific. All information is retrieved from Wikidata and automatically translated using Wikidata language information (lexemes and item labels/descriptions). If there is any free-form text, it must come from Wikidata, for example item title and quotations. Wikirank may synthesize short text from Wikidata statements using techniques similar to Abstract Wikipedia. Additional information may be obtained from Wikipedia and other sources linked from topic and item summaries. Some content, for example Wikipedia links, may be automatically selected to match current language.

List pages may offer several views of the same list. It should be possible to switch between at least two layouts: the above described 3-column layout and plain table view. Table view places vote count and buttons in one of the columns. Further alternative views may include temporal and geographic segments. Subtopics and related topics should generally have their own pages, but current page can link to them prominently.

Upvoting an item moves the item to the top and visually marks it as upvoted. Downvoting an item moves the item to the bottom or completely hides it. Upvoted items are sorted among themselves using public vote count, but there is a way to override that order. Search widget at the end of the list allows users to easily find new items and add them to the list. Some pages may also include a form to add new list item by creating formulaic Q-item in Wikidata. The form shows a list of similar items in order to discourage users from creating duplicate Q-items.

Besides list pages, there are item pages and category pages. Item pages, which are linked from list items, contain longer summary, links to lists featuring the item or related to the item (e.g. alternatives), links to sister projects, and additional useful links. Category pages are really just special list pages with list items pointing to other lists. Category pages allow voting on their list items, i.e. other list pages, but they also visually mark and bring to top lists, in which the user has casted a vote. This makes it easier to find one's private lists. There is one all-encompassing category containing all lists. This top-level category can be used by users to find all of their private lists.

Users can publish their private lists under special URL, e.g. www.domain.org/share/6fR3tZhSHPp1tnUjY6ko. This URL is secret, which allows users to provide access only to people who know the URL. Unsharing the list and sharing it again generates new URL. This allows users to revoke access to their private list. Shared lists may be optionally frozen, so that they do not change after publishing even as user's private list continues to change.

Notability

Wikirank will generally allow creating lists of anything that already meets Wikidata notability policy. Quantitative surveys are generally useful for everything, even abstract concepts and unremarkable objects. While voting on such things as exoplanets seems useless at first, these votes are actually likely to point to exoplanets that are in some way interesting. Similar argument can be made about most other topics. There will be only a few exceptions, for example adding people as items in defamatory lists. Wikirank will tend to extend beyond current Wikidata scope, which is largely defined by what Wikipedia considers notable. This is because, in a sense, Wikirank is a way to discover notable items and to separate notable items from uninteresting ones. It is therefore natural that a large fraction of items in Wikirank lists will not be considered notable by other projects.

Data quality

Wikirank will not attempt to censor "wrong" choices. The idea is to reflect public opinion even if public at large is wrong about something. Clarifying notice may be placed above lists that have high risk of misinterpretation. List items may show additional properties pulled from Wikidata that complement Wikirank's popularity data.

All data added to Wikidata in the course of editing Wikirank pages still has to be factually correct and meet Wikidata policies.

Little attention is paid to which item is ranked #1. Wikirank only aims to separate serious contenders from uninteresting alternatives. Precise order near the top of the list is not important. Users are expected to review the highest ranking items and make their own choices.

Collaboration model

The usual collaboration model on wikis is that everyone edits the same content and disputes are resolved via discussion. Wikirank collaboration model is more subtle. Everyone has their own version of every list, but list items preferred by other contributors are shown underneath user's private list, sorted by popularity, which encourages users to add popular items to their own list. In order to keep their private list clean and useful, users are most likely to upvote unreasonably unpopular items buried near the bottom of the public list and downvote unreasonably popular items that they consider to be clutter. This creates self-regulating system that converges toward interests of the majority while still allowing minority users to adjust the list to their liking.

Wikirank should be home to several kinds of contributors, sorted here by amount of time invested:

  • Supporters do not build lists. They just drop random upvotes here and there to show support.
  • Curators build private lists, mostly by upvoting. They find or create new items to add and cast the first vote for these items.
  • Editors create new lists and make other structural changes to the site as well as non-trivial edits in Wikidata.
  • Developers create templates, Lua scripts, and special lists, which mostly define new list types.

Login is optional, because supporters don't need it. Curators who use the site only temporarily, for example to build a list for immediate use, don't need login either.

Technical details

TODO: MediaWiki extensions, querying and data downloads, integration with sister projects

Schema

Wikirank vote would have the following structure:

username class instance timestamp IP action

For example:

WikirankUser7 Q123 Q456 20210704112564 1.2.3.4 upvote

Fields:

  • username - who cast the vote, username (logged in users) or cookie (logged out users)
  • class - Wikidata Q-item identifying the whole list
  • instance - Wikidata Q-item identifying single list item
  • timestamp - time when the vote was cast
  • IP - IP address, from which the vote was cast
  • action - one of: upvote, downvote, withdraw vote

Timestamps and IP addresses allow for interesting list views (regional segments, trending) and they also help fight spam. Users can opt out from IP address storage, in which case the IP address will be anonymized (last bits zeroed). Doing so will slightly increase chance that the vote will be filtered by spam filters.

Vote data is private and visible only to the user. Wikirank will expose it only in aggregate form:

class instance upvotes downvotes filtered

For example:

Q123 Q456 1070 60 145

If the user changes their mind several times, their upvote/downvote is of course counted only once. The filtered field counts users whose votes have been filtered out by spam filters. Where needed, this aggregate data will be also available segmented by time and/or region.

Performance

Assuming the site grows very popular and amasses one trillion votes, votes will take up a few dozen terrabytes. With spam and indexing overhead, up to a petabyte of storage might be consumed. Most data will be historical and inert. Active data will be much smaller. Ingestion rate of up to 100,000 votes/second can be handled by current hardware and suitable data layout. Query cost should be negligible with suitable data layout and caching.

Privacy

Existing voting systems usually do not offer any guarantees of privacy. Social networks either publish the whole list of voters or at least reveal which friends (connected users) voted for the item. Votes are used to target advertising and they are thus indirectly revealed to advertisers. Wikirank sets higher standard of privacy by never showing private lists to anyone other than the user.

Wikimedia servers already store some private data in the form of password hashes and IP addresses. Wikirank is different in that most of its data will be private. This is essential for the project to function, because private lists will end up containing sensitive information, e.g. drug lists, medical condition lists, lists related to intimate life. Leaks of this data would erode trust and undermine the project. High standards of privacy protection and data security are therefore essential to the success of the project.

But before considering data protection, let's first consider why is the data needed in the first place. Why cannot Wikirank perform spam filtering at the moment of voting and then just store vote statistics? If preventing duplicate votes is a concern, why cannot Wikirank use one of the anonymizing counters that perform lossy hashing of IP addresses and/or user names? The reason Wikirank needs to keep detailed vote data is that Wikirank is a curated list service, not just the usual "like" or "thumbs up" service. You cannot curate a list if you cannot see it. You cannot see the list if it is not recorded in database in full detail. Wikirank might offer anonymized vote casting in the future, but its core function will always be to collect full curated lists.

Users would be of course given control over their personal data. Wikirank would allow users to examine, download, and delete their private data or any part of it. Secure and caution-encouraging procedure would allow migration of private data to another service upon user's request. Option to minimize data collection would be also provided. This would cause Wikirank to anonymize IP addresses (by zeroing the least significant bits) and timestamps (by rounding) and to keep only current state of private lists. This option would come with a few downsides: loss of undo and versioning functionality, removal of one's historical votes from older time segments, and slightly higher chance that one's votes would be flagged by spam filters. Some aspects of this opt-out functionality would be also activated automatically where required by privacy laws.

Protecting private data for several decades is challenging. Servers and processes touching raw Wikirank data would have to meet security standards that are not needed in other Wikimedia projects. Developers and administrators would not have access to private lists. Site administrators would be only able to examine summary statistics when flagging suspicious votes. Spam filters would have full access to raw data, but they would run in isolated environment, unable to communicate with outside world.

Private lists of logged out users are tied to browser cookie rather than IP address. Besides being a good match for user's expectations of site behavior, this prevents accidental leaks of private lists when user's IP address is rotated by their ISP.

Since the first votes on a list tend to be correlated with public activities like creating the list page, its Q-item, and Q-items for all list items, the first vote for most items can be deanonymized. This is particularly problematic on fringe lists with naturally low vote count. Wikirank will use two mechanism to protect users against such deanonymization. Firstly, counting of new votes will be randomly delayed for long enough to make it unclear who casted the vote. Secondly, Wikirank will by default hide vote count of list items with too few votes (say, less than 3). Users will be asked for permission before their vote is shown on such items. It is expected that people interested in curation more than privacy will grant the permission. Minimum vote count will be randomized a bit to prevent active deanonymization by casting fake votes.

Spam filters

Early versions of Wikirank will run without spam filter, because the initial site will not be popular enough to attract spammers. Spam filters should be developed in response to actual spamming strategies observed in practice. This section provides some hints on how such spam filters might work.

Spam filters work on user level, not vote level, because single user is either a spammer or regular user. Decisions about individual users however cannot be made, because privacy protection means nobody can look at user's private lists. Users are instead automatically scored using a number of trust signals and votes from users below some trust threshold are ignored. Trust signals, positive and negative, include:

  • casting a suspicious vote (see below)
  • activities other than voting (editing, activities on sister projects)
  • passive technical signals: User-Agent, presence of cookies, etc.
  • passing or failing an occasional CAPTCHA (only when trust is low)
  • voting patterns similar to other users with high/low trust

Votes can be flagged as suspicious, but this is not done individually per vote. Administrators can instead review suspicious activity, especially temporal and spatial spikes, and flag entire sets of votes as suspicious.

Since no signal is reliable, only total trust score matters. Scoring model is trained on hard signals (CAPTCHA, non-trivial edits, bot User-Agent, known fraud cases).

Bugs and flaws in spam filters do not cause permanent damage. All filtering scores are periodically recalculated in order to reflect improvements in spam filter logic. Some information may be discarded for privacy reasons. Such information loss events should leave behind a flag, so that spam filters know that some data is missing.