Wikidata:Recoin

From Wikidata
(Redirected from User:Ls1g/ReCool)
Jump to: navigation, search

logo

Recoin ("Relative Completeness Indicator") is a script that extends Wikidata entity pages with information about the relative completeness of the information. Relative completeness refers to the extent of information found on an item in comparison with other similar items.

Recoin adds a status indicator (top right) and two expandable lists of important absent properties and IDs to Wikidata (center). Here shown for Abbey Road, for which data is very detailed.

The indicator aggregates the extent of information into a colored progress bar, showing 5 possible color-coded levels of completeness that range from very detailed information to very basic information.

Recoin is intended to both help authors to know where to potentially focus their attention, and to make data consumers aware of the degree of information found in a specific article.

New: As of December 2017, the status indicator is now enabled for virtually all Wikidata items.

Arno Kompatscher: Basic information


Motivation[edit]

Recoin is intended to assist both authors and consumers of Wikidata.

For users (consumers), it provides a handy summary of the degree of completeness of information in Wikidata, which may help them in deciding whether to rely on Wikidata or not in order to satisfy their information need. This is because judging purely by article length may not always be a good idea, as for instance the chess player Jeff Sarwer (Q3494327) has a long article due to lots of statements about his Elo rating, but until recently was missing even very basic information such as citizenship or family name.

For authors, similarly it provides information about which persons information is more complete than others, thus allowing to focus attention on more incomplete persons. For an individual person, it allows to see the most important properties that are missing, which authors then might focus on completing, or, if no values for these properties exist, might mark this with a novalue assertion.

What it shows[edit]

Recoin can add two kinds of information to Wikidata pages:

  • A 5-level status indicator icon, ranging from very detailed to very basic, summarizing the extent of information compared with other, similar entities;
  • Two expandable lists of most relevant absent properties and external IDs are added to the top of entity pages.

How it works[edit]

Architecture[edit]

Architecture of Recoin as of December 2017

The architecture depicted in the figure to the right shows both javascript modules recoin-core.js and recoin-explanations.js that send request to the getmissingattributes.php located on the Toolforge server. In turn this php script does the computation by making requests, first to the Wikidata SPARQL endpoint to get occupations for the given entity, and then by queries to databases on ToolsDB, to retrieve the attribute frequencies for the (previously computed) occupations (humans) or class (all non-humans). The results (completeness and the missing properties) are returned back in JSON serialisation and are used by the javascript modules to render the page.

Computation[edit]

The script so far does computation for all classes contained in the table wikidatawiki_p.wbs_propertypairs [1]. Furthermore, it gives more refined results based on the 1000 most frequent professions of humans, by treating professions like classes.

Determination of absent properties and IDs[edit]

We first describe the case of an entity belonging to a single class/profession, and discuss multi-class-membership later below.

Given an entity that belongs to a certain class, we compute the properties most frequently occurring in that class, and check how many of those are absent for the entity. The top-10 missing properties are shown by the core script (a second script shows also external IDs). For classes contained in wikidatawiki_p.wbs_propertypairs, we use all properties available there. For professions of humans, we use the 100 most frequent properties per profession.

For instance, Jimmy Wales (Q181) misses, among other things, the properties languages spoken, written or signed (P1412), member of political party (P102) and position held (P39), which are specified for 13.435%, 9.347% and 8.376% of people of same occupation.

Status indicator computation[edit]

To determine the relative completeness on the 5-level scale, we compute the average frequency of the top 5 missing properties (if there are less than 5 missing properties, we assume their frequency to be zero). We then set the level as follows:

  • Level 5 (most complete) 0%-5% average frequency @ top 5 missing properties
  • Level 4 (quite complete) 5%-10% average frequency @ top 5 missing properties
  • Level 3 (medium complete) 10%-25% average frequency @ top 5 missing properties
  • Level 2 (low completeness) 25%-50% average frequency @ top 5 missing properties
  • Level 1 (least complete) 50%+ average frequency @ top 5 missing properties

For example, Arno Kompatscher (Q15074414) is missing

  • P39 (position held) - 54.33%
  • P1412 (languages spoken, written or signed) - 49.93%
  • P102 (member of political party) - 46.62%
  • P1559 (name in native language) - 31.14%
  • P937 (work location) - 30.67%

Thus, the average frequency of the top 5 missing properties is 42.53%, and thus his level of completeness is 2 (low).


Treatment of multi-class-membership[edit]

For entities belonging to multiple classes (see e.g. Dresden (Q1731)) or persons with multiple occupations (e.g. Arno Kompatscher (Q15074414)), Recoin does the computation based on the weighted frequency of each class/profession.

For instance, Arno Kompatscher (Q15074414) is both a politician and jurist. There are 297,370 politicians and 12,635 jurists in Wikidata. If among politicians, 40% do have the property position held (P39) set, while among jurists 20% do have, the final computed frequency is the weighted average of 39%.[2]

Special cases[edit]

  • For humans, the properties place of death (P20) and date of death (P570) are strictly filtered out, as they are frequent yet frequently undesired for living humans;
  • In the case of an entity belonging to a single class that does not have data in wikidatawiki_p.wbs_propertypairs, nothing is shown;
  • In the case of an entity belonging to multiple classes or professions, with one having no data, the frequency of properties in that class is assumed to be zero
  • Properties having a frequency of less than 0.01% in a class are assumed to have frequency zero
  • For entities that have a profession that is not among the 1000 most frequent ones, missing properties are computed based on general humans



Installation[edit]

Recoin can be enabled by using the following line your Special:MyPage/common.js:

 importScript( 'User:Vvekbv/recoin.js' );

A version showing only ID-properties is available as user script at

 importScript('User:Vvekbv/recoin_id.js');

We hope Recoin will become available as gadget, to be enabled at Special:Preferences under the section "Gadgets".

Where you maintain a global common file, the code to use in m:Special:MyPage/global.js:

 mw.loader.load('//www.wikidata.org/w/index.php?title=User:Vvekbv/recoin.js&action=raw&ctype=text/javascript');

API[edit]

Recoin can also be accessed via an API available at

 https://tools.wmflabs.org/recoin/getmissingattributes.php?subject=Q15074414

and

 https://tools.wmflabs.org/recoin/getmissingattributes_id.php?subject=Q15074414

(substituting the desired entity Q-code).

Further information[edit]

Contact:

  • Simon Razniewski - srazniew@mpi-inf.mpg.de (Conceptual lead)
  • Vevake Balaraman - balaraman@fbk.eu (Technical lead)

Further reading:

  • Talk at WikidataCon 2017 "How to know what Wikidata knows"
  • Scientific paper "Assessing the Completeness of Entities in Knowledge Bases" by Albin Ahmeti, Simon Razniewski, Axel Polleres, ESWC P&D 2017 (link)

Related projects:

  • Wikipedia article quality assessment using ORES
  • Wikidata property suggester, a tool that uses aggregated association rules for the suggestion of properties to add
  • COOL-WD, a tool that allows to assert the completeness of individual properties directly inside Wikidata.

Acknowledgment: This work is partially supported by the project TaDaQua, funded by the Free University of Bozen-Bolzano.

  1. 42078 as of November 15, 2017; query
  2. This is not the most precise way, as entities that are both politicians and jurists this way have twice the weight of other entities, but a precomputation of all combinations of professions/classes is infeasible both on the fly or a priori, and this weighting is a reasonable approximation.