Wikidata talk:Primary sources tool/Archive/2017

From Wikidata
Jump to navigation Jump to search
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Suggestions at Q11634

Thousands of false positives in P170. d1g (talk) 13:28, 22 February 2017 (UTC)

ID gathering

It would be nice if the tool could import Crunchbase person ID (P2087), Crunchbase organization ID (P2088). Both of them could also provide additional data.

WikiTree person ID (P2949) and Find a Grave memorial ID (P535) would be nice to have but I wouldn't automatically import additional data due to UGC concerns. ChristianKl (talk) 13:56, 28 September 2016 (UTC)

What are UCG concerns? Thanks, GerardM (talk) 09:39, 7 May 2017 (UTC)

MusicBrainz redirects

Hello! Where does the tool draw the MusicBrainz IDs from? It seems like it often suggests IDs which are redirects to the actual MusicBrainz entity and people will approve the claims without thinking further, which leads to items having redundant IDs. See for example Muscle Museum (Q786253) where a whopping nine MusicBrainz release group ID (P436)’s are suggested, all of which are redirects to 8c99a0a1-bfaf-3c21-b871-b56af6ac7b01. You can tell they’re redirects to said ID by clicking the title or one of the tabs on the MusicBrainz page. According to MusicBrainz docs: ”An entity can have more than one MBID. When an entity is merged into another, its MBIDs redirect to the other entity.” –Kooma (talk) 19:06, 8 January 2017 (UTC)

The data we have is dated and the fact that MusicBrainz moves on makes it even more infuriating that we did not import this data in the first place. There are bots that can fix redirects so including them even as a redirect is an improvement. Thanks, GerardM (talk) 09:39, 7 May 2017 (UTC)

Primary Sources gadget tweaking required

Would whomever shunts data and match terms into Primary Sources gadget please look to remove "{Q|21}}" and "Scotland (Q22)" from the country of citizenship suggestions, and replace those with "Kingdom of England (Q179876)" and "Kingdom of Scotland (Q230791)". The items that are England and Scotland are not countries that offer citizenship at this particular time, and incorrect for people of the 17thC and prior. It would also be really useful that for 19thC and 18thC Brits that they were offered United Kingdom of Great Britain and Ireland (Q174193) and Great Britain (Q161885) respectively rather than the modern United Kingdom (Q145). Thanks.  — billinghurst sDrewth 10:39, 6 May 2017 (UTC)

There is no intelligence in what is proposed. Do remember that there is sufficient crap in Wikidata itself to appreciate the standard improvement it provides. With Scotland and a date we can query and fix things. With no data even this is not in the cards. Thanks, GerardM (talk) 09:37, 7 May 2017 (UTC)

I am not sure what particularly is happening with this particular case, however, some of the proposals are way out there.  — billinghurst sDrewth 03:39, 7 May 2017 (UTC)

Yes there are items where there is a lot of crap. Delete and move on. They are the result of a faulty import not of an underlying problem. Thanks, GerardM (talk) 09:35, 7 May 2017 (UTC)

Adding new datasets?

The section about data donors mentions that there were once plans to open up the primary sources tool to other data sources. Has any progress been made on this? It would be good to update this section in this regard. Thanks a lot! − Pintoch (talk) 00:34, 11 March 2017 (UTC)

Indeed. @Hjfocs: any progress with this tool ? --I9606 (talk) 17:54, 8 May 2017 (UTC)
@I9606: the IEG renewal for the primary sources tool uplift will be kicked off at m:WikiCite_2017. Stay tuned. --Hjfocs (talk) 14:18, 20 May 2017 (UTC)

I prefer to have 4 shortcuts to resolve all "PS" requests in bulks

  • A/ENTER = approve
  • D/DELETE = disprove
  • W/UP = select item above
  • S/DOWN = select item below
  • ESC - exit SP tool and stop using such shortcuts if they conflict with other Wikidata shortcuts

d1g (talk) 10:57, 5 July 2017 (UTC)

Quality of sources

While the PST is fantastic overall, I and others have noticed some very poor sources included as potential sources, including the widely lambasted Conservapedia. True we have a blacklist, and Fuzheado has been adding some bad sources people have recently found, but that requires people to find them in the first place. There should be more transparency about what is going into the tool in the first place so editors can evaluate these sources. I don't advocate pre-approval, but there should be an opportunity to either object or place sources on that blacklist. Gamaliel (talk) 01:27, 2 August 2017 (UTC)

Thanks Gamaliel for bringing this up. @Tpt, Hjfocs: This should be relevant to your interests. Let's talk here and at WikidataCon about a meaningful and quick review process? --LydiaPintscher (talk) 11:36, 3 September 2017 (UTC)
@Gamaliel: the bad sources you mentioned probably come from Freebase: it seems to contain a lot, due to the Google Knowledge Vault, which was merged into the dataset.
I think that's the main reason behind the need for a a posteriori URL blacklist, but Tpt certainly knows more.
After Freebase, efforts like StrepHit did it the opposite way: a whitelist was outsourced before generating the datasets.
@LydiaPintscher, Tpt: what are your thoughts about making the dataset provider responsible for proposing such whitelist to the community? --Hjfocs (talk) 09:04, 4 September 2017 (UTC)
That seems sensible to me. --LydiaPintscher (talk) 09:22, 4 September 2017 (UTC)
I confirm that the "bad" sources comes from the Freebase datasets and it was to filter the "worst" ones that the blacklist was created. Not sure that the whitelist idea is the most sensible if a lot of different sources, each one used for a few claims, are provided but for specific data extraction tasks like StrepHit it makes definitely sense. Tpt (talk) 09:35, 4 September 2017 (UTC)

Open Library identifiers

Hoi, it would be really helpful if we could get an extraction of all the Open Library identifiers. What is needed is the Wikidata / OL / VIAF identifiers and a label. This extract will be reviewed at Open Library. They will be considered for redirects. Any duplicates will be removed and their current identifier will be provided. This file can be included in the existing items at Wikidata. The Wikidata and VIAF identifiers will be included in Open Library. Thanks, GerardM (talk) 10:36, 4 September 2017 (UTC)

@GerardM: this seems like the main goal of the soweego project proposal: m:Grants:Project/Hjfocs/soweego
--Hjfocs (talk) 09:33, 28 September 2017 (UTC)
P.S.: please do ping me when you write here, as I'm not getting notifications otherwise.
@Hjfocs: I am looking for existing data in the Primary Sources Tool. Nobody can provide it. Currently it is a black box.. Thanks, GerardM (talk) 11:20, 28 September 2017 (UTC)
@GerardM: you can use the current API to get existing statements about what you are looking for:
--Hjfocs (talk) 13:21, 28 September 2017 (UTC)

Rejecting primary sources suggestions based on constraint violations?


I came across Sir Weldon Dalrymple-Champneys, 2nd Baronet (Q23308798) where the PST suggests some nonsensical claims that should be rejected just based on the constraint violations that they would generate if accepted. Now that we have more infrastructure to deal with constraint violations, I wonder if it would be possible to detect in advance that these claims violate constraints, and therefore to filter them out automatically.

@Lucas Werkmeister (WMDE): How hard would it be to evaluate in advance the constraint violations generated by a set of candidate statements, without adding these statements to Wikibase? I have no idea how the backend for constaint violations currently works, so I have no idea how hard it would be.

Pintoch (talk) 14:34, 15 October 2017 (UTC)

@Pintoch: I think the technology behind it would be the same as for phab:T168626 – add an additional API parameter that takes a serialized snak or statement instead of a statement ID. I don’t think it’s very difficult (in terms of the code required, it’s mostly a new Context subclass, plus some code in the API to instantiate it), but the task isn’t high priority for now. --Lucas Werkmeister (WMDE) (talk) 13:41, 16 October 2017 (UTC)
@Lucas Werkmeister (WMDE): Great! Thanks for the link, I'll watch that. − Pintoch (talk) 15:21, 16 October 2017 (UTC)


Hi, I've noticed that sometimes this tool will suggest to add "religion>Mormonism" to items. However, Mormonism isn't a religion, but a set of religious traditions, kind of like Protestantism. I think it's more useful to state which denomination of Mormonism a person belongs to. For pages I work on it's usually "religion>The Church of Jesus Christ of Latter-day Saints" I don't know how the tool decides to offer the religion field--maybe when it sees "Mormon" in a source? I can completely understand the confusion. Do you think it's better to have "religion>Mormonism" on a page and then someone can come and make it more specific later? Thanks. Rachel Helps (BYU) (talk) 19:27, 6 December 2017 (UTC)

Suggestions with no references

I think this tool should not suggest any statement without a reference that can be verified by an editor. - PKM (talk) 20:12, 7 December 2017 (UTC)

Suggesting new datasets?

Hello! From the information pages about this tool, it is a bit unclear to me whether it's possible to suggest additional datasets to be included. In any case - I manually add quite a bit of statements that are sourced to the following (all pretty reliable) datasets: RKDartists (Q17299517), RKDimages (Q17299580) and Online Dictionary of Dutch Women (Q13135279). I was wondering whether these datasets can be added to the corpus from which the Primary sources tool draws newly suggested (referenced) statements? By the way, I agree with PKM above, that suggested statements should preferably be referenced. Many thanks! Spinster 💬 18:15, 10 December 2017 (UTC)