Wikidata talk:Automated finding references input

From Wikidata
Jump to navigation Jump to search

Questions

[edit]

Here are suggested questions to help you giving us feedback.

  • Did you encounter any issue with the suggested references?
  • Would you take them over into Wikidata or is there anything in them that you would not consider appropriate or correct?

Discussions

[edit]
General idea (not specific to this batch)

Ideally, Wikidata should have an integrated system to:

  • Store a list of external IDs (not limited to those linked to Wikidata)
  • Fetch useful data from external websites
  • Compare it with Wikidata and make a report

And user will be able to:

  • Add references to existing statements from this system
  • Add new statements Wikidata does not have
  • Fix errors in Wikidata
  • Find errors in external database and report them
  • Match external IDs to existing items
  • Create new items from external IDs with data provided

This will be a combination of:

  • Automated finding of references based on semantic markup
  • Checks against 3rd party databases
  • Feeding back data quality issues to data providers
  • Adding more statement with source from external datasets (what Primary Sources Tool did)
  • Pulling data from existing databases
  • Matching data between external databases and datasets and Wikidata (currently a task of Mix'n'Match and OpenRefine)
  • And many others, since the set of data from 3rd party databases may be reused by others

Of course, such a system may have various issues of implementation, e.g. copyright issue.

Comments of this specific batch

Not all external databases are reliable sources. Some are user-generated, some are just pulling data from Wikidata or Wikipedia. Wikidata should develop some mechanism to denote quality of sources.

--GZWDer (talk) 13:23, 8 April 2020 (UTC)[reply]


Concrete Feedback: Some of the sources do not actually contain the Information in a user-visible way that they contain in the meta-data. Most often this is the case if the site managers decide that a value is trivial to know/deduce. An example is tennistemple.com from which you extract the metadata sex or gender (P21)male (Q6581097). But this information is not displayed anywhere on the site – therefore it should not be used a a source for this fact. Detecting these cases automatically will probably be tricky.

More General Feedback: I do like this project and think that Wikidata will benefit a lot from it. But since several years I see the general issue that both the Wikidata Community (parts of it) as well as WMDE/WMF focus very strongly on bot-edits, web crawling and data imports and – in my opinion – neglect the importance of manual, human edits. This is very understandable since imports can add vast amounts of facts in a short time – more than any human would ever be able to do. But we should all be aware that these by definition only add information to the database that are already available in a machine readable form, otherwise you couldn't parse/crawl/import it. Only manual human edits can create result in new facts that were not preexisting in machine readable form. Imports like this can make data better machine readable and allow users to easier access them but the vast majority of human knowledge that sofar does not exists in any machine readable version can only included and made public available by human edits. Therefore we have to make editing easier, quicker and more fun! Especially for sources, it is currently very hard, prosy and annoying to add them by hand. Please provide these aspects more attention, then in the last years. Let's avoid Wikidata to become a botpedia! -- MichaelSchoenitzer (talk) 14:59, 1 May 2020 (UTC)[reply]

Only use a strict, community-curated whitelist of sources

[edit]

Generally a very useful feature. But as mentioned above, not all sources are equal, and some are much more reliable than others. I would recommend to implement this feature only with a limited set of whitelisted (or non-blacklisted) sources that are approved through a meticulous community process.

The example batch uses three sources from a domain that I know very well: the visual arts, and this is how I'd assess these sources for a potential whitelist or blacklist:

  • RKDartists ID (P650) - OK as a source, quite reliable, as this database is managed by a (non-commercial) Dutch government agency, with mainly scholars as employees. Individual references would still need vetting, as they are not infallible of course.
  • Artnet artist ID (P3782) - NOT OK as a source, not reliable enough IMO, as it's a commercial initiative and I've seen a lot of erroneous data coming from them via Mix'n'match for instance.
  • MutualArt artist ID (P6578) - NOT OK either, exact same reasoning as for the example above.

Curious what others think, especially people who are subject experts in very different domains. Spinster 💬 15:51, 8 April 2020 (UTC)[reply]

Thanks, Spinster :) I agree we need a blacklist. (I fear a whitelist is a bit much.) I've created phabricator:T249862. --Lydia Pintscher (WMDE) (talk) 19:21, 9 April 2020 (UTC)[reply]

ArthurPSmith comments

[edit]

(Feel free to rearrange my comments into an appropriate heading here, I didn't see quite where to put this given the preceding comment sections).

I assume the "Value" column is the current value in Wikidata, as opposed to the "Values given in the external identifier". Was there any attempt made to match the two columns to see if they agree or disagree? It might be nice to have another column showing whether your tool thought things matched or not (assuming there were cases where it thought not). There appear to be several types of disagreements in this sample set:

  1. Nouveau théâtre de Montreuil (Q57988) the geographic coordinates differ in the last 2 digits; I don't think we should import such a reference where the disagreement is much larger than the claimed precision - however the claimed precision is about 1 meter which seems fairly ridiculous, so maybe depending on the type of object involved we could allow it. The differences seem to be less than 10 meters so certainly they are very close.
  2. Kto rasskazhet nebylitsu? (Q62386) the years are completely different - 1982 and 1997. This should be clearly flagged and certainly not used as a reference for that information!
  3. The Picture of Dorian Gray (Q82464) the languages are clearly different - 'English' vs 'heb'. There are other examples like this in the table. Maybe the id should be on a translation instead of on the main item for this work? Also, why is the same entry for The Picture of Dorian Gray (Q82464) listed 9 times in your table?

But overall I was impressed that there were many cases where things did seem to match, and it would certainly have been useful to add those as references (if there were none for the stated information up to that point). ArthurPSmith (talk) 18:11, 8 April 2020 (UTC)[reply]

Thanks for spotting these. We'll analyze them. --Lydia Pintscher (WMDE) (talk) 19:24, 9 April 2020 (UTC)[reply]

Source blacklist

[edit]

While I totally agree with Spinster on building a curated whitelist of reliable sources, here's an available blacklist that can be useful to filter out bad sources by domain: Wikidata:Primary_sources_tool/URL_blacklist. Hope this helps!

Cheers,

Hjfocs (talk) 10:29, 9 April 2020 (UTC)[reply]

Thanks :) That was also on my radar though I am not sure how best to integrate it into the current workflow centered around external ID Properties. Will discuss it with the team. --Lydia Pintscher (WMDE) (talk) 19:25, 9 April 2020 (UTC)[reply]
  • Not sure if it should be "source" blacklist. It seems odd that Primary Sources blacklists imdb and newspapers .. maybe it's just a enwiki issue that should be handled by selecting preferred references on retrieval. --- Jura 12:31, 15 April 2020 (UTC)[reply]

Insert statement with parameters

[edit]

Moin Moin User:Lea Lacroix (WMDE), from Germany, like we say. I would like to make a few remarks about the itemizations. It would be nice if the found sources/literature could also be inserted directly and in such a way that they could then also be reused in the Wikipedias. Example:

today maybe tomorrow ;)

or

If the second column is created as a reference, the values can be passed on to Wikipedia using Module:Wikidata. The corresponding templates would then display a correct itemization with all minimum information. See Template Cite Web/News or Vorlage Internetquelle. Just food for thought ;) Regards --Crazy1880 (talk) 17:00, 14 April 2020 (UTC)[reply]

Hello, and thanks for your comment. We generated our first batch based on the structure that is the most commonly used on similar existing references on Wikidata. Of course, we could discuss with the community about changing it. Lea Lacroix (WMDE) (talk) 13:40, 17 April 2020 (UTC)[reply]

Feedback on first batch by Jura1

[edit]

Interesting (re-)start. Reminds me of the useful additions we had a while back for films cross-referencing cast/director to various other databases. It probably went a step further than this batch, as it might also have checked the identifiers on the items being used.

About the sample batch: maybe it could be worth looking at it by property referenced:

Source/P31:

  • I'm not entirely convinced of the usefulness of sourcing items for films from library catalogues, e.g. National Library of Spain ID (P950) in the list. In the past, I experimented with adding VIAF to film items, but that was rather inconclusive.
  • Similarly Musicbrainz and place name info might not be the ideal combination, even if we have MusicBrainz place ID (P1004)

Some sources:

It might be worth simplifying setting "deprecated rank" for incorrect statements or statement where the reference might not be sufficient. Also, if we improve the visual presentation of such statements, the actual input might be easier to digest for users. --- Jura 12:31, 15 April 2020 (UTC)[reply]

Thanks for checking and giving feedback!
About number of episodes (P1113), number of seasons (P2437): I guess the reference should still be there and as new episodes are released the value gets updated and the reference too? It seems wrong to me to not have a reference simply because it's ongoing.
title (P1476): I'm not sure what you mean there. Could you elaborate?
Musicbrainz: We should probably just add it to the blacklist.
NNL item ID (P3959): I'm not sure what you mean there. Could you elaborate?
Cheers --Lydia Pintscher (WMDE) (talk) 14:04, 17 April 2020 (UTC)[reply]

Good start — keep going

[edit]

I like the general idea, the first execution and also the comments received so far. Things I was missing were comments about archiving those external statements systematically (e.g. via Internet Archive), and examples from scientific sources. --Daniel Mietchen (talk) 21:43, 11 May 2020 (UTC)[reply]

User:InternetArchiveBot seems to be handling this. Is there anything special we need to do still on top of that? --Lydia Pintscher (WMDE) (talk) 08:35, 25 May 2020 (UTC)[reply]

Looks great!

[edit]

I looked through the data, and it is really, really promising. This is no trivial data, and it is really good to compare these values with values from external sites. Thank you! I am very excited about where this will go. --Denny (talk) 23:11, 19 May 2020 (UTC)[reply]

Automated finding references: new data and a distributed game

[edit]

Hello all,

As previously announced, the Wikidata development team is working on ways to automatically extract references from external websites, so editors can check them and add them on Items. On this feedback loop page, we presented our first batch of references and collected many useful comments.

Following up on the topic, we improved our data batch based on your feedback. On top of that, we created a distributed Wikidata game called Reference hunt!. With this game, you get a suggestion of an Item and a reference based on structured data from an external website. You can accept the reference if it is good (it will then be added on Wikidata), reject the reference if it is not fitting, or pass if you are not sure.

Feel free to try the game and give us feedback on this talk page! We will also track the edits made by the game with the tag reference-game and monitor the results (how many times people click on accept or reject on a suggested reference) to analyze the overall quality of the data batch.

Cheers, Mohammed Sadat (WMDE) (talk) 11:44, 25 May 2020 (UTC)[reply]

We are working on this, but we first want to get a bit better understanding of how good or bad the current batch is and where it's flaws are and which parts of it are potentially perfect and can be imported right away. This information is pretty crucial for us to understand how to further improve the system. It will come in the next days. Mohammed Sadat (WMDE) (talk) 09:14, 28 May 2020 (UTC)[reply]
The game is good. Could it also add what the "property" was on the external website, and not only "extracted data"? That would make me feel more secure in making a call whether the data is a good fit or not. Ainali (talk) 15:30, 25 May 2020 (UTC)[reply]
Thanks for your feedback Ainali, we're taking that into account. Mohammed Sadat (WMDE) (talk) 10:48, 28 May 2020 (UTC)[reply]

Automated adding references

[edit]

Really, we need a bot to mass add (confident) references. A Wikidata Game may not be scalable for size of Wikidata.--GZWDer (talk) 19:35, 25 May 2020 (UTC)[reply]

  • @GZWDer: Collecting data on what people think is good or not in this way sounds like an important way to evaluate the confidence of such references. Maybe at some point some sources and data types will be so clearly good that they can be automated but I don't think we're close to that yet. ArthurPSmith (talk) 21:38, 25 May 2020 (UTC)[reply]
Yes absolutely. It will come. See reasoning above. --- Mohammed Sadat (WMDE) (talk)

Idea: Primary records wiki

[edit]

A continuation of Wikidata_talk:Automated_finding_references_input#Discussions, added here just not to forget:

A new Wikibase instance for "records", with federation (at least property) with Wikidata; each record is a set of statement with same reference, correspondding to an external entry (identified by an external ID); they may be matched to items; possible to automatically scrap all records; allow defining statement status (right or wrong), which is independent from what records say.--GZWDer (talk) 08:14, 17 June 2020 (UTC)[reply]

Bug when adding new reffrences

[edit]

Error message: generic_action failed: {"action":"wbsetreference","statement":"Q206249$29BBD9E0-31D1-4DC3-9ACC-64C2C0E750AC","tags":"reference-game","snaks":"{\"P7859\":[{\"snaktype\":\"value\",\"property\":\"P7859\",\"datavalue\":{\"type\":\"string\",\"value\":\"viaf-144773750\"}}],\"P248\":[{\"snaktype\":\"value\",\"property\":\"P248\",\"datavalue\":{\"value\":{\"entity-type\":\"item\",\"numeric-id\":76630151,\"id\":\"Q76630151\"},\"type\":\"wikibase-entityid\"}}],\"P813\":[{\"snaktype\":\"value\",\"property\":\"P813\",\"datavalue\":{\"value\":{\"time\":\"+00000002020-05-13T00:00:00Z\",\"timezone\":0,\"before\":0,\"after\":0,\"precision\":11,\"calendarmodel\":\"http:\\/\\/www.wikidata.org\\/entity\\/Q1985727\"},\"type\":\"time\"}}]}","summary":"The Distributed Game (73): Reference hunt! #distributed-game"}/

Browser: Firefox 78.0.2 (64Bit) OS: Win 10 10.0.19041

This has happend multiple times in a row

(Is this the right place for a bug report ?) --Ferdinand0101 (talk) 13:56, 15 July 2020 (UTC)[reply]

Hello @Ferdinand0101:, thanks for reporting this issue! We created a ticket on our bug tracket and we will investigate on it as soon as possible.
For the future, you can use the page Wikidata:Contact the development team for bug reports :) Lea Lacroix (WMDE) (talk) 16:24, 15 July 2020 (UTC)[reply]

Automated finding references: dump and dashboard

[edit]

Hello all,

As you may know, in May 2020 we released new data for automated references, as well as a game that you can use to associate references with statements. We released this game containing 4200 potential references (see statistics). In the meantime, we parsed many more websites and collected 529K potential new references.

These new references will not be added to the game, because they are too many for their relevance to be checked by hand. As requested by some of you after the previous announcement, we published the list of all references in a dump available here.

Subsets of this dump can be reused by bots and tools, however, we advise you to be careful when using it and to not mass import them to Wikidata without careful review: it is quite raw, some references may be wrong or irrelevant. In order to help you analyze these references and filtering the most useful ones, we are also providing a dashboard containing an overview of the judgements made in the game so you can see which parts are more likely to be of higher or lower quality.

We’re happy to release the dumps and the dashboard just in time for the Wikidata birthday :)

If you have any questions or encounter issues with the dump or the dashboard, please let us know. Cheers, Lea Lacroix (WMDE) (talk) 08:55, 29 October 2020 (UTC)[reply]