Wikidata:WikiProject Source MetaData/ToDo

From Wikidata
Jump to: navigation, search

This is a page listing open tasks within the scope of Wikidata:WikiProject Source MetaData. Tasks on this page should be relatively self-contained and ready to be claimed and executed, this page is not meant to host complex data modeling discussions, which should go to the project's talk page or to a dedicated subpage.

Open tasks[edit]

Create missing items for scholarly journals[edit]

(via wikicite-discuss) I am looking at reducing the number of items without published in (P1433) statements, and the main inhibitor here is that in many cases, the corresponding journal does not have an item yet. There is a nice list of ~46k journals at ftp://ftp.ncbi.nih.gov/pubmed/J_Entrez.txt , which is about twice the number of journals currently in Wikidata. Most of the latter have an ISSN (about half seem to have two, some oddly have more, e.g. Q72225#P236), and the list from Entrez also has ISSN (separate for print and online, though both fields may be empty). --Daniel_Mietchen

Possible data sources:

Add missing titles to scholarly journals[edit]

About 6K items that are an instance of scientific journal (Q5633421) are still using obsolete property no label (P357) instead of the correct title (P1476) for "title". Looks like the perfect job for a bot--DarTar (talk) 17:11, 6 November 2016 (UTC)

Add missing DOI prefixes to publisher items[edit]

Currently, DOI Prefix (P1662) is used in 40 statements. As of November 6, 2016, there are over 5K prefixes that have been active in the last 12 months. @Magnus Manske: do you think we could plug the list into Mix'n'Match to allow semi-automatic matching and importing via the publisher name?--DarTar (talk) 00:40, 7 November 2016 (UTC)

@Daniel_Mietchen:^

Add funder data for papers in the Zika corpus[edit]

I've been looking at coverage of funder information for papers on Zika virus (Q202864) and Zika fever (Q8071861). Coverage is very limited, and in many cases, the Crossref API doesn't return a value for Crossref funder ID (P3153) even when the paper has an associated funder with a known ID, e.g.:

Aedes hensilli as a potential vector of Chikungunya and Zika viruses (Q22330738)sponsor (P859)Centers for Disease Control and Prevention (Q583725)

@Daniel_Mietchen, Andrawaag, I9606: I was wondering if you guys had any thoughts on heuristics and semi-automated processes to retrieve/scrape funder information for papers in this corpus, while making sure that the funder itself is correctly linked to its Crossref funder ID (P3153).--DarTar (talk) 00:53, 7 November 2016 (UTC)

A fully crowd-sourced effort like the one we did earlier this year on biological database licenses, might work.Within three weeks approx 300 licenses were collected and added by the community (approx 10 - 15 wikidatians). This would however need some design and modeling to be effective. The clickable distance between the paper and its wikidata entry should be one-click away to work. I don't mind reading through 20-30 papers to identify funders and add those to wikidata, as long as it can be done within a minute. --Andrawaag (talk) 15:55, 7 November 2016 (UTC)

Creating a cancer variant corpus[edit]

In preparation for the upcoming hackathons (1. CIViC and 2. SWAT4LS), where we will explore Wikidata in the context of cancer gene variants, which provide links between genes, diseases and drugs, we have a corpus of about 1000 pubmed entries. What would be the best workflow to add these entries to Wikidata? I can start adding them as stubs like I did with Q27777801, but what would be the procedure to get them completely annotated? --Andrawaag (talk) 16:10, 7 November 2016 (UTC)

Import the Wikimedia Research corpus[edit]

The editors of the Wikimedia Research newsletter maintain a Zotero library of all references cited in individual issues. Importing this corpus will make this metadata more widely available to the community at large, help us explore Wikidata ↔ Meta data reuse strategies and help build a link corpus annotated with topics, venues etc. @HaeB, Fnielsen: cc'ing you for info.--DarTar (talk) 23:55, 15 November 2016 (UTC)

As a subtask, import the research output of the Wikimedia Research team, see phab:T144575.

Fix authorship inconsistencies[edit]

Albeit ORCID adoption is growing, most sources still provide most authorship information simply as text strings, rather than by way of identifiers. To cope with this situation, we are using two properties for authorship on items about a publication: author name string (P2093) is used if only the text string is known, and author (P50) is used if the author's Wikidata ID has been identified. In both cases, authorship order is indicated by way of series ordinal (P1545). The switch from P2093 to P50 can be done through a Wikidata game or a dedicated tool, both of which move the P1545 information automatically. Of course, the switch can also be done by hand or by way of bots (at least in principle — I don't know of any that are actually doing this), in which cases P1545 information should be taken into account.

On a related note, we currently do not have a good way of tracking those P2093-to-P50 conversions in terms of what the original text string was. For this, stated as (P1932) could be used, but while this is reasonably common for books, it is basically absent for journal articles.

For such P50 and P2093 statements as well as their P1545 and P1932 qualifiers, the following issues appear frequently, often as a result of a merge:

We need some mechanisms to clean up such cases on a regular basis. One way to deal with that could be to use queries like the above in conjunction with {{Complex constraint}}, so as to introduce these issues into the normal constraint violation workflows. However, this is likely not going to be sufficient, and we will probably need some bots that can take this on. --Daniel Mietchen (talk) 01:45, 18 March 2017 (UTC)

Use machine learning to help with author disambiguation[edit]

There are a number of machine learning efforts going on within the Wikimedia community (e.g. Wikidata:ORES), and there are non-Wikimedia initiatives around author disambiguation that use machine learning techniques (e.g. The Academic Family Tree). We should think about how these could be brought together and leveraged for improving author disambiguation here at Wikidata, either directly or by way of tools like Mix'n Match. --Daniel Mietchen (talk) 14:12, 21 March 2017 (UTC)

New property proposals[edit]

Pending discussions[edit]

  • Possible properties linking academic conference with their proceedings, and conference series with their conference proceedings series (read more). — Finn Årup Nielsen (fnielsen) (talk) 10:56, 12 November 2016 (UTC)
  • n/a

Created[edit]