Wikidata:Dataset Imports/Lichess player IDs

From Wikidata
Jump to navigation Jump to search

You may find these related resources helpful:


Guidelines for using this page[edit]

Documenting the import[edit]

  • Guidelines on how to import a dataset into Wikidata are available at Wikidata:Data Import Guide.
  • Please include notes on all steps of the process.
  • Once a dataset has been imported into Wikidata please edit the page to change the progress status from in progress to complete.
  • It is strongly recommended to use Visual Editor when making changes to this page, particularly for editing any of the tables.

Creating a Wikidata item for the dataset[edit]

  • Please create a Wikidata item for the dataset, this will allow us to improve the coverage of datasets on Wikidata and understand what datasets are available on that topic and which of them have been added to Wikidata.
  • If you are working with very large dataset you can break it into smaller Mix n' Match catalogues, but only create one Wikidata item.
  • Link the dataset Wikidata item to this page using Wikidata Dataset Imports URL (P5195)

Getting help[edit]

  • If your dataset import runs into issues please edit the page to change the progress status from in progress to help needed.
  • You can ask for help on Wikidata:Project chat.

Overview[edit]

Dataset name[edit]

Lichess player IDs

Source[edit]

lichess.org

Dataset description[edit]

Player names of players with a rating of >1200 on Lichess and their full names

Additional information[edit]

I used a dump of the games that included only players rated at least 2200 (link). Using the Unix tools sed, sort and uniq I extracted the usernames and removed all dublicates. I wrote a crawler in C (link, thanks to user:madmaurice for help fixing memory leaks) that queries the lichess api for a given set of usernames. I extracted the names of all users from the database results. I used OpenRefine for reconciliation and QuickStatements for upload.

Progress of import[edit]

before reconciliation[edit]

I successfully run the crawler, didn't import to wikidata yet. The dataset contains 4036 players that provided their full names on their profile. Some of them are in another script, like Cyrillic or Arabic, which I am not checking for. Some of them don't have a Wikidata item yet and no new items will be created from this dataset. I'd expect the number of reconcilable items to be between 10 % and 20 %, so at least 400 items should have a lichess player id after this import is finished.

reconciliation[edit]

813 items have been reconciled automatically (at least 95% match of label, sport (P641) chess (Q718)). Some manual matching has been done on arbitrary items, to scan for systematic biases, that where introduced by the reconciliation method.

biases, inaccuracies[edit]

A systematic bias is introduced by this reconciliation method that disadvantages Spanish names and names in non-latin scripts. This is because people with Spanish names tend to specify only the paternal name, while the Wikidata label usually includes paternal and maternal name. I don't see a possibility to fix this issue without reducing the confidence in the matches or a lot of manual matching. Wrong data might be imported if players erroneously provide the name of another chess player in Lichess. I cannot see how this could be avoided. The imported data only contains accounts that are not closed and that had at least a rating of 2200 and have therefore played a decent number of games on the site. Lichess bans accounts that fraudulently impersonate other players. It can therefore be expected, that the number of missmatches is very low.

Edit history[edit]

The edits have been made with quickstatements, temporary batch 1609068825745.

Discussion of import[edit]

Queries and expected results[edit]

Query linkDescriptionExpected results
https://query.wikidata.org/#SELECT%20DISTINCT%20%3Fitem%20%3FitemLabel%20%3Flichess_url%20%3FtitleLabel%20%3Ffide_url%20WHERE%20%7B%0A%20%20%23%20Lichess%20URL%0A%20%20%20%20%3Fitem%20wdt%3AP8976%20%3Flichess.%0A%20%20%20%20wd%3AP8976%20wdt%3AP1630%20%3Flichess_format.%20%0A%20%20%20%20BIND%28IRI%28REPLACE%28%3Flichess_format%2C%20%27%5C%5C%241%27%2C%20%3Flichess%29%29%20AS%20%3Flichess_url%29%20.%20%0A%20%20%23%20Fide%20URL%0A%20%20%20%20%3Fitem%20wdt%3AP1440%20%3Ffide.%0A%20%20%20%20wd%3AP1440%20wdt%3AP1630%20%3Ffide_format.%20%0A%20%20%20%20BIND%28IRI%28REPLACE%28%3Ffide_format%2C%20%27%5C%5C%241%27%2C%20%3Ffide%29%29%20AS%20%3Ffide_url%29%20.%20%0A%20%20%23%20Title%0A%20%20%20%20OPTIONAL%20%7B%0A%20%20%20%20%20%20%3Fitem%20wdt%3AP2962%20%3Ftitle.%0A%20%20%20%20%7D%0A%20%20%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cde%22.%20%7D%0A%7D%20ORDER%20BY%20%3FtitleLichess username (P8976)this query only shows players, that already have a fide ID present. No new items should be created by this batch, so most items will probably stem from FIDE imports. In the end there should be ~830 results

Schedule of new data released[edit]