User:Magnus Manske/Large Datasets

From Wikidata
Jump to: navigation, search

This page is intended as a status overview for my Large Datasets project, as alluded to in my blog entry.

To help fix problems that were found when synchronizing these datasets with Wikidata, check out the reports.

Rationale[edit]

Mix'n'match has lots of third-party datasets (referred to as "catalogs"), but it is not ideal for truly large datasets, with millions of entries. Yet, Wikidata should make use of such data, where they are available. I am therefore creating tables in a database on ToolForge, importing such datasets from various sources, as an easy-to-access (for tools and bots) resource.

Datasets[edit]

  • BNF
    • Persons only, for now
    • 1,637,195 entries (2017-10-24)
    • 308,533 of those matched to Wikidata (2017-10-24)
  • VIAF
    • 30,676,161 entries (2017-10-24)
    • 1,436,343 of those matched to Wikidata (2017-10-24)
  • GND
    • 14,168,197 entries (2017-10-24)
    • 591,037 of those matched to Wikidata (2017-10-24)

Actions performed[edit]

  • Associated BNF person entries with Wikidata items, via Wikidata BNF (secondary: VIAF) statements
  • Associated VIAF entries with Wikidata items, via Wikidata VIAF (and several secondary) statements
  • Associated GND entries with Wikidata items, via Wikidata GND (secondary: VIAF) statements
  • Used VIAF entries to add missing VIAF and third-party IDs to associated Wikidata items
  • Used GND entries to add missing GNF and VIAF IDs to associated Wikidata items

Tech[edit]

  • Any dataset table is dropped and created anew on import, via a script. That will allow for clean re-imports, in case of better import scripts, or updated data dumps.
  • Any ToolForge tool should be able to access the database as s51434__mixnmatch_large_catalogs_p
    • catalog table contains meta-data about the datasets
    • thing table contains secondary data (e.g., place names) for primary datasets that can be matched to Wikidata
    • report table contains mismatches and other events that occurred during Wikidata sync operations. Plan is to write an interface for them at some point...
    • bad_gnd table contains GND IDs that were marked (in Wikidata or during matching) as unsuitable
    • In the primary data tables,
      • ext_id is the primary identifier in that dataset (e.g. VIAF ID in the viaf table)
      • q is the known Wikidata item (integer, without 'Q'), or null
  • The scripts I use for import and Wikidata syncing are on BitBucket (under manual_lists/large_catalogs)