Wikidata:Property proposal/OpenSanctions ID

From Wikidata
Jump to navigation Jump to search

OpenSanctions ID[edit]

Originally proposed at Wikidata:Property proposal/Organization

Descriptionidentifier of persons, companies, luxury vessels of political, criminal, or economic interest at www.opensanctions.org
RepresentsOpenSanctions (Q110087116)
Data typeExternal identifier
Domainpersons, organizations, companies, political parties, luxury vessels
Example 1Birgit Honé (Q20606538)eu-cor-2030915
Example 2André Viola (Q2848794)eu-cor-2032180
Example 3Semion Mogilevich (Q471862)Q471862
Example 4Irbis Air Company (Q3396960)ch-seco-4058
Example 5Илья Романович Абрамович (new) → rupep-person-16554
Example 6Pavel Evgenyevich PRIGOZHIN (new) → NK-knteb2sJu9NwaJ79aK6D9T
Sourcehttps://www.opensanctions.org/datasets
Number of IDs in source167028 on 24.01.2022; Over 204,469 on 14.04.2022 (ref https://opensanctions.org/datasets/default/)
Expected completenesseventually complete (Q21873974)
Formatter URLhttps://www.opensanctions.org/entities/$1
Robot and gadget jobsimport from OpenSanctions; 53% already have WD ID
Distinct-values constraintyes
Wikidata projectWikidata:WikiProject_Organizations Wikidata:WikiProject_Companies

Motivation[edit]

OpenSanctions is a database of persons and companies of political, criminal, or economic interest. The project has received financial support from the German Federal Ministry for Education and Research (Bundesministerium für Bildung und Forschung, BMBF). The data sources are listed at https://www.opensanctions.org/datasets. RShigapov (talk) 15:48, 24 January 2022 (UTC)[reply]

Discussion[edit]

Hi! This is Friedrich, I'm the principal maintainer of OpenSanctions.org. Since I'm still somewhat new to WD (and the social processes around it), please bear with me. The weakness (to me) of adding OpenSanctions IDs to WD is that we're at the same time trying to converge on QIDs for entities where available - so the "OpenSanctions ID" would only ever make sense in scenarios where our database is incomplete/unreconciled. Maybe the better thing for us (at OpenSanctions) to do would be to add a form on each entity that's not identified with a QID yet to let people submit whatever WD item they think this entity corresponds with - so that we can then re-write the ID on OpenSanctions to match. --OpenSanctions (talk) 14:26, 26 January 2022 (UTC)[reply]

  • I like the idea with a form! RShigapov (talk) 08:25, 27 January 2022 (UTC)[reply]
  • I also like that idea. But @OpenSanctions: there are 2 reasons to also have this external-ID: 1. Do you believe all your entities should be created in WD (I do), and will be created in a short period of time (I don't)? 2. Even if you use WD for ALL of your entities, this external-ID will say which WD entities are in OpenSanctions, and provide a link from WD to OpenSanctions. We have a precedent on WD: Altmetric DOI (P5530) is a subset of DOI, but was created as a separate property (despite the fact that "DOI" already has over 10 formatters to various external sites), to indicate which papers have altmetrics, and because the altmetrics is a new distinct piece of info, so somehow "more valuabble" than the other alternative formatters
  • You've done an amazingly good job with OpenSanctions, and your reuse of external identifiers and your efforts to link to WD are much appreciated. But a link from WD to OpenSanctions is also valuable, and will increase the positive exposure of OpenSanctions significantly. WD has maybe 6-7k external-ID properties (links to external datasets), so is the world's most significant coreferencing hub --Vladimir Alexiev (talk) 11:28, 14 April 2022 (UTC)[reply]
  • If you agree with these arguments (and my "support" arguments below), please vote "support" --Vladimir Alexiev (talk) 11:28, 14 April 2022 (UTC)[reply]

As an alternative, what I would like to suggest is an "OpenSanctions Dataset" property. This could be used to show which datasets in the OpenSanctions corpus a particular Wikidata item is part of (e.g. `us_ofac`, `ch_seco_sanctions`, and `sanctions`). One item (e.g. "Saddam Hussein") could be part of many OpenSanctions datasets. We could upload such claims as part of our ETL pipeline in the future for all OpenSanctions entities using Wikidata QIDs. It would allow users to see a) that there's additional details about an entity on OpenSanctions and b) that they are -- according to us -- sanctioned or in some other way a person/entity of interest. --OpenSanctions (talk) 14:26, 26 January 2022 (UTC)[reply]

  • In principle, that can be modelled without extra property. The items representing the OpenSanctions Datasets can be created (or may be they already exist). Then to any entity from a sanctions list you could add a statement with part of (P361) and those items. You could add a reference to OpenSanctions additionally. RShigapov (talk) 08:38, 27 January 2022 (UTC)[reply]
  • @OpenSanctions: I don't think "OpenSanctions Dataset" is an alternative to "OpenSanctions ID". If you want, make a separate proposal for "OpenSanctions Dataset", but I think I agree with Shigapov that it's not necessary --Vladimir Alexiev (talk) 11:28, 14 April 2022 (UTC)[reply]

Vladimir Alexiev (talk) 11:28, 14 April 2022 (UTC):[reply]

OpenSanctions is a truly excellent resource:

  • Incorporates 44 datasets, including
  • Tracks a variety of identifiers, from passport numbers to IBANs.
    • Reuses WD IDs whenever available
  • Offers download in 4 formats (eg see EveryPolitician):
    • just names
    • simple CSV
    • FollowTheMoney JSON (application/json+ftm)
    • enriched "Targets as nested" JSON (application/json)
  • Has provenance info for every statement.
  • Has a very excellent search, eg
    • Slobodan+Milošević finds not just him, but also 6 members of his family
    • "Bi Sidi Souleymane" finds "Bi Sidi SOULEMAN"
  • Has OpenRefine reconciliation API

I did some count on the biggest collection https://opensanctions.org/datasets/default/ as CSV (https://data.opensanctions.org/datasets/latest/default/targets.simple.csv):

$ wc -l targets.simple.csv
196347 
$ grep -cP '\bQ\d+\b' targets.simple.csv
112513
$ grep -cP '^"Q\d+"' openSanctions-default.csv
112503
  • 57% of its entities have WD ID: about 109,024 of 190,258 entities have WD ID.
  • Almost all of them use WD as their main identifier
  • 81,234 or 43% don't have a WD identifier

Out of 204,469 total targets in OpenSanctions, this collection has 190,258 or 93%. 2 datasets are missing from this collection (https://github.com/opensanctions/opensanctions/issues/199 asks for an "all" pseudo-collection):

  • Russian terrorists & extremists list. "it's nutty": @OpenSanctions: does that mean useless?
  • INTERPOL missing children (Yellow Notices): ok, maybe this doesn't belong in WD (for privacy reasons)

Voting[edit]

@RShigapov, MasterRus21thCentury, So9q, OpenSanctions, Nikola Tulechki, Borko1990: I'm currently importing 112504 OS ids that are the same as WD id: https://quickstatements.toolforge.org/#/batch/82849

csvtk freq -f schema openSanctions-no-WD.csv
schema,frequency
Airplane,269
Vessel,415
Company,2455
CryptoWallet,7457
Organization,4335
LegalEntity,4581
Person,57754

--Vladimir Alexiev (talk) 09:44, 15 April 2022 (UTC)[reply]