Wikidata:WikiProject Source MetaData/ToDo

From Wikidata
Jump to navigation Jump to search

This is a page listing open tasks within the scope of Wikidata:WikiProject Source MetaData. Tasks on this page should be relatively self-contained and ready to be claimed and executed, this page is not meant to host complex data modeling discussions, which should go to the project's talk page or to a dedicated subpage.

Open tasks[edit]

Create missing items for scholarly journals[edit]

(via wikicite-discuss) I am looking at reducing the number of items without published in (P1433) statements, and the main inhibitor here is that in many cases, the corresponding journal does not have an item yet. There is a nice list of ~46k journals at , which is about twice the number of journals currently in Wikidata. Most of the latter have an ISSN (about half seem to have two, some oddly have more, e.g. Q72225#P236), and the list from Entrez also has ISSN (separate for print and online, though both fields may be empty). --Daniel_Mietchen

Possible data sources:

Add missing titles to scholarly journals[edit]

About 6K items that are an instance of scientific journal (Q5633421) are still using obsolete property P357 (P357) instead of the correct title (P1476) for "title". Looks like the perfect job for a bot--DarTar (talk) 17:11, 6 November 2016 (UTC)Reply[reply]

Add missing DOI prefixes to publisher items[edit]

Currently, DOI prefix (P1662) is used in 40 statements. As of November 6, 2016, there are over 5K prefixes that have been active in the last 12 months. @Magnus Manske: do you think we could plug the list into Mix'n'Match to allow semi-automatic matching and importing via the publisher name?--DarTar (talk) 00:40, 7 November 2016 (UTC)Reply[reply]


Add funder data for papers in the Zika corpus[edit]

I've been looking at coverage of funder information for papers on Zika virus (Q202864) and Zika fever (Q8071861). Coverage is very limited, and in many cases, the Crossref API doesn't return a value for Crossref funder ID (P3153) even when the paper has an associated funder with a known ID, e.g.:

Aedes hensilli as a potential vector of Chikungunya and Zika viruses (Q22330738)sponsor (P859)Centers for Disease Control and Prevention (Q583725)

@Daniel_Mietchen, Andrawaag, I9606: I was wondering if you guys had any thoughts on heuristics and semi-automated processes to retrieve/scrape funder information for papers in this corpus, while making sure that the funder itself is correctly linked to its Crossref funder ID (P3153).--DarTar (talk) 00:53, 7 November 2016 (UTC)Reply[reply]

A fully crowd-sourced effort like the one we did earlier this year on biological database licenses, might work.Within three weeks approx 300 licenses were collected and added by the community (approx 10 - 15 wikidatians). This would however need some design and modeling to be effective. The clickable distance between the paper and its wikidata entry should be one-click away to work. I don't mind reading through 20-30 papers to identify funders and add those to wikidata, as long as it can be done within a minute. --Andrawaag (talk) 15:55, 7 November 2016 (UTC)Reply[reply]

Creating a cancer variant corpus[edit]

In preparation for the upcoming hackathons (1. CIViC and 2. SWAT4LS), where we will explore Wikidata in the context of cancer gene variants, which provide links between genes, diseases and drugs, we have a corpus of about 1000 pubmed entries. What would be the best workflow to add these entries to Wikidata? I can start adding them as stubs like I did with Q27777801, but what would be the procedure to get them completely annotated? --Andrawaag (talk) 16:10, 7 November 2016 (UTC)Reply[reply]

Import the Wikimedia Research corpus[edit]

The editors of the Wikimedia Research newsletter maintain a Zotero library of all references cited in individual issues. Importing this corpus will make this metadata more widely available to the community at large, help us explore Wikidata ↔ Meta data reuse strategies and help build a link corpus annotated with topics, venues etc. @HaeB, Fnielsen: cc'ing you for info.--DarTar (talk) 23:55, 15 November 2016 (UTC)Reply[reply]

As a subtask, import the research output of the Wikimedia Research team, see phab:T144575.
See also WD:Zotero; and comments on Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:43, 3 April 2018 (UTC)Reply[reply]

Fix authorship inconsistencies[edit]

Albeit ORCID adoption is growing, most sources still provide most authorship information simply as text strings, rather than by way of identifiers. To cope with this situation, we are using two properties for authorship on items about a publication: author name string (P2093) is used if only the text string is known, and author (P50) is used if the author's Wikidata ID has been identified. In both cases, authorship order is indicated by way of series ordinal (P1545). The switch from P2093 to P50 can be done through a Wikidata game or a dedicated tool, both of which move the P1545 information automatically. Of course, the switch can also be done by hand or by way of bots (at least in principle — I don't know of any that are actually doing this), in which cases P1545 information should be taken into account.

On a related note, we currently do not have a good way of tracking those P2093-to-P50 conversions in terms of what the original text string was. For this, object named as (P1932) could be used, but while this is reasonably common for books, it is basically absent for journal articles.

For such P50 and P2093 statements as well as their P1545 and P1932 qualifiers, the following issues appear frequently, often as a result of a merge:

We need some mechanisms to clean up such cases on a regular basis. One way to deal with that could be to use queries like the above in conjunction with {{Complex constraint}}, so as to introduce these issues into the normal constraint violation workflows. However, this is likely not going to be sufficient, and we will probably need some bots that can take this on. --Daniel Mietchen (talk) 01:45, 18 March 2017 (UTC)Reply[reply]

Another tool to switch from P2093 to P50 is Orcidator. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:42, 3 April 2018 (UTC)Reply[reply]

Use machine learning to help with author disambiguation[edit]

There are a number of machine learning efforts going on within the Wikimedia community (e.g. Wikidata:ORES), and there are non-Wikimedia initiatives around author disambiguation that use machine learning techniques (e.g. The Academic Family Tree). We should think about how these could be brought together and leveraged for improving author disambiguation here at Wikidata, either directly or by way of tools like Mix'n Match. --Daniel Mietchen (talk) 14:12, 21 March 2017 (UTC)Reply[reply]

Unsorted tasks[edit]

  • Clean up communication media (manifestation) and work trees
    • Restructure as appropriate
    • Fix items which should not appear in tree since they are actually instances, not subclasses of the work.
      • For instance, almost all of the items in the typeface tree, are instances, not subclass. Remove the property "subclass of" typeface and replace with "instance of" typeface. There are also similar issues in religious texts and others.
  • Annotate this table of standard citation source types (i.e. CSL, BibTex) with the appropriate Wikidata item.
    • If no item corresponds, create a new one.
    • Add a "subclass of" property pointing to an item from either the work or manifestation (communication media) trees.
  • Discuss and collaborate with the community to bring in more participation and input to relevant source metadata
  • Improve quality, coverage, and analysis of existing tools used to manage source metadata, identifiers, references, citations, etc. Especially of interest are ways users and communities currently manage information that may ultimately be moved to tools and workflows that leverage Wikidata. Analyses of tools and protocols may thus imply potential equivalent replacements, or new tools, which should strive to improve the overall experience for these users and communities.

New property proposals[edit]

Pending discussions[edit]