User talk:Pintoch

Jump to: navigation, search

About this board

Previous discussion was archived at User talk:Pintoch/Archive 1 on 2017-06-21.

By clicking "Add topic", you agree to our Terms of Use and agree to irrevocably release your text under the CC BY-SA 3.0 License and GFDL
Richardostler (talkcontribs)

Hi Pintoch, I've removed the ringgold ID for this item as 15552 points to a different organisation (Escola de Policia de Catalunya). Using the Zenodo source you cited, there is a different ringgold ID for Rothamsted Research, but it also lists an incorrect alt_name and ISNI so I'm not confident in the reliability of this ID. I've also corrected the ISNI property based on the VIAF value (URL cited).

cheers

Pintoch (talkcontribs)

Awesome, thanks a lot!

Reply to "Rothamsted Research identifiers"
Paulbe (talkcontribs)

Hallo Pintoch, i just saw that you claimed on Wikidata in Q4310949 that Amsterdam is the seat of the government of the Netherlands. That is not true and that was never the case. Since around 1580 The Hague (in Dutch: Den Haag or 's-Gravenhage) is the seat of governmental and parliamentary things of the Netherlands. ... My dear city, Amsterdam, has never been capital of a county or province. In 1814 the Dutch constitution named Amsterdam as nominal capital, but that meant allmost nothing. The only thing that is legally and actually taking place is that new kings (including reigning queens) should be inaugurated in Amsterdam. However the king/queen though nominally being part of governement, has not a direct role in government. And no king or queen ever really resided in Amsterdam. And most people in Amsterdam are for a republic, while outside Amsterdam most people in NL support the moderate nominal monarchy.

Yours, ----

Pintoch (talkcontribs)

Thanks a lot for the fix! That was a mistake indeed!

Reply to "Government seat of the Netherlands"
Wurgl (talkcontribs)

It is not your fault, VIAF mixes both, SUDOC & BNF seem to be the correct person. Anyway, thanks for your thanks! --~~~~

Reply to "The georgian President …"

How confident are you in the Ringgold ID's?

31
Summary by Pintoch

Ringgold ids imported from ORCID were wrong, this is now fixed.

ArthurPSmith (talkcontribs)

Hi - I've been happily OpenRefine-reconciling some of our institution records and using the extension feature to fetch stuff - anyway, out of curiosity I did a comparison between the Ringgold ID's in wikidata and the ones we had from Ringgold several years ago. The surprise: almost NONE of them agreed - just 1 common value out of almost 2000. For example, for Argonne National Laboratory (Q649120) wikidata has the id 1251, while the one I had is 57856. For Arizona State University (Q670897) the wikidata Ringgold id is 7206, while the number I had is 3357, which turns out to be in wikidata as the RInggold id for McMaster University (Q632891). The numbers seem to be almost completely uncorrelated... any explanation??

ArthurPSmith (talkcontribs)

and the number we had for McMaster University (3710) seems to be in wikidata as the id for Wesley Biblical Seminary (Q7983850). I suppose this could be a fun game, but I really don't understand why these things are so different. While we had access we did some cross-checking with Ringgold's online "Identify" lookup and I thought our numbers were good. Did they change the whole database at some point?

Pintoch (talkcontribs)

hmmm, that's a very interesting issue! I'll investigate :-)

Pintoch (talkcontribs)

It's very curious. I tried to devise the correct id for Arizona State University (Q670897) with the official Ringgold lookup interface (as guest).

According to that, the correct id is 7864 (different from both yours and mine). The number I had was 7206 and resolves to "Allan Hancock Joint Community College District", whereas your 3357 resolves to "Arizona Board of Regents" (which is probably related to Arizona State University (Q670897), but still incorrect).

Same issue for Argonne National Laboratory (Q649120) - 1251 on Wikidata, 1291 on Ringgold, 57856 for you…

Do they just periodically reallocate their ids????

ArthurPSmith (talkcontribs)

I have no idea what the reason is - I guess I was concerned that something had gone wrong with your import from the OpenISNI/ORCID dump, but it sounds like there's a more serious issue. Maybe they completely change their numbers based on who their customer is who is using/querying? Weird.

Pintoch (talkcontribs)

I'll try to check again the consistency between ORCID's ringgold ids and the resolver's with fresh data.

Pintoch (talkcontribs)

I've emailed ORCID about the issue:

Sorry to bother you again with institution identifiers. I have observed a bug that I cannot explain.

The Ringgold identifiers that ORCID uses seem to be completely uncorrelated from the ones which are available at http://ido.ringgold.com/identify_new/cfm/si_pd.cfm?PID=1 .

For instance, consider "Arizona State University". When looked up inside ORCID, we get 7190. On ringgold.com, we obtain 7864. I have tried with a bunch of institutions and for each of them, the numbers were different.

Why is this a problem? Consider ORCID integrations, such as the one we have in Oxford. They add an Oxford affiliation to the ORCID profiles of their users. To do so, they use the Ringgold identifier 6396, which is correct according to ringgold.com. However, when adding manually the same institution to a profile (via ORCID's web interface), we get a different identifier: 5818.

So, the use of institution identifiers is inconsistent even within ORCID itself. This seems to be a rather serious issue.

This issue seems to add up on top of an existing one, that I had reported earlier :

https://github.com/ORCID/ORCID-Source/issues/3297#issuecomment-274579956

I assume it is not your fault as Ringgold ids seem to be very unstable but I find it quite concerning.

ArthurPSmith (talkcontribs)

Thanks for contacting them - I hadn't looked at that github issue previously, but I'm assuming it's as explained there caused by the ORCID use of an autocomplete UI, and unrelated to Ringgold issues. Looking forward to hearing what ORCID has to say about this!

Pintoch (talkcontribs)

I was aware of the GitHub issue before doing the import, so I did avoid to run into that particular bug. However, I have now understood what I have been doing wrong. ORCID internally stores institutions with their own internal ids, which look a lot like Ringgold's because they are roughly in sequence with Ringgold's because they imported the whole database. I think I have figured out a way to extract the correct Ringgold ids out of their internal ORCID ids.

I should have checked the ids I was adding against the official lookup service long ago… it was quite dumb of me to assume these numbers were the same things! Now I'll take care of the cleanup.

ArthurPSmith (talkcontribs)

Ah! Ok, that will be interesting to cross-check when you've done it. You'll update the zenodo file then?

Pintoch (talkcontribs)

Yes, I will also update the Zenodo file. I have started to fix the identifiers, it should still take a few days.

Jura1 (talkcontribs)

I had been wondering why these made no sense ..

Pintoch (talkcontribs)

Yeah, I am really sorry about that - I have no excuse not to have checked carefully! Out of curiosity, what do you use Ringgold ids for?

Jura1 (talkcontribs)

I had just clicked a few links back when you first added them.

No worries. Only people who don't contribute make no errors.

Jura1 (talkcontribs)

The new QuickStatements and PetScan can be used to remove statements.

ArthurPSmith (talkcontribs)

Pintoch has been fixing the id's, check his recent contributions...

Jura1 (talkcontribs)

yes, but it might be easier to remove some and the add new ones.

Pintoch (talkcontribs)

Jura1: I have the feeling it's cleaner to reuse the statements, it makes less of a mess in the history, only adding one tiny change… But I will indeed remove some statements at some point (those for which the true Ringgold id cannot be found).

RP88 (talkcontribs)

I just noticed this edit in which the City of Antwerp's ringgold ID was changed from 77797 to 81818, however the reference was left as DOI 10.5281/zenodo.268334. The Zenodo reference definitely lists Antwerp's ringgold ID as 77797 (it lists Monroe City Public Library as 81818). If the Zenodo reference is incorrect about the value of 77797, shouldn't you also be removing the reference in addition to updating the value to 81818?

Pintoch (talkcontribs)

Hi RP88, I am going to update the Zenodo dataset as soon as its generation is complete. There was a bug in my script to generate the first version of it, the ids it currently contains are inaccurate. Updating Wikidata takes a long time so I have started changing the statements before the dataset is updated, which is why the references are currently inconsistent - sorry about that!

RP88 (talkcontribs)

Thanks for the explanation, I appreciate it.

ArthurPSmith (talkcontribs)

I see the updates are continuing - maybe you should get a bot account! :)

By the way, I've processed the latest GRID database release - added almost 800 new items, and an additional 200+ existing items had GRID ID links added.

Pintoch (talkcontribs)

Brilliant!

Yeah, I should get a bot account, although for this particular task the editing rate is really not an issue - it's just that I download the data slowly on ORCID's side.

ArthurPSmith (talkcontribs)

Just to follow up - I am guessing you're not quite finished with this, but it's looking good. I re-ran the extension in OpenRefine to pull Ringgold ID's from wikidata and now I'm getting 1522 matching ID's with our records, and only 271 non-matches.

Pintoch (talkcontribs)

Yeah, it is taking ages. There are still about 2500 erroneous Ringgold ids on Wikidata awaiting for their fixes. I'll make sure all of them are either fixed or deleted.

Pintoch (talkcontribs)

The fix is complete and I have updated the Zenodo record: https://zenodo.org/record/844869 .

RP88 (talkcontribs)

Will you now be doing a pass in which the references for the values you changed will also be changed from DOI 10.5281/zenodo.268334 to DOI 10.5281/zenodo.844869?

Pintoch (talkcontribs)

I'm not sure - it would be better indeed, but is it really necessary? The new version of the record is linked from the old one.

RP88 (talkcontribs)

Doesn't an explicit link to the DOI for version 1, rather than a link to the DOI for version 2 or a link the DOI for "latest" version imply that the data is from version 1, which is now no longer the case? Why did you originally use a link to the version 1 DOI rather the "latest" DOI if you didn't plan on changing it if you updated to a newer dataset? I think you should either be using the DOI for the version the ID actually appears in, or if you plan on future updates and don't like the idea of having to update the reference DOI in the future you could switch to using the "latest" DOI.

Jura1 (talkcontribs)

You could try template:Autofix on the property

Pintoch (talkcontribs)

RP88: the "latest" DOI was only generated when I created a second version (DOI versioning is a new feature in Zenodo, I had not realized it would work like that. If I was aware of it, I would have done things differently indeed). Jura1: that's a nice idea, I'll give it a try.

Reply to "How confident are you in the Ringgold ID's?"
Summary by ArthurPSmith

I tried it, it works!

ArthurPSmith (talkcontribs)

I ran through my somewhat lengthy update process for the latest (July 12 2017) GRID release (grid.json -> db tables -> compare wikidata qids from GRID with wdqs query and fix discrepancies -> export as CSV for openrefine & match by name, country, ISNI, URL -> import the rest as new items). Anyway, if you could run the update jobs you've been doing to add other elements from the latest GRID data that would be great, thanks!

ArthurPSmith (talkcontribs)

PS there are about 1100 new items added, and I think an additional 300-400 or so GRID ID's added to existing wikidata items.

Pintoch (talkcontribs)

That's great! But I think it's time to automate that a bit more… It might wait until I (or someone else!) implement an "export to Wikidata" feature for OpenRefine (https://github.com/OpenRefine/OpenRefine/issues/1213). By the way if you have any ideas about how it should best done, feel free to chime in… My end goal would be to have just one OpenRefine JSON change history that we could apply to any new version of the dataset and which would generate required edits. But we're not quite there yet!

ArthurPSmith (talkcontribs)

If OpenRefine could import directly from json files (selecting specific fields to import) that would be a helpful step... Also is there a way to merge in data from a separate query somehow (wdqs select all grid id's - could be in a separate json file, linked by grid id)? I'm not sure it's entirely the right tool to do everything though...

Pintoch (talkcontribs)

By the way, I see you are now merging some of the newly-created items. That is a good occasion to try to understand where the reconciliation service fails.

First, I would be careful about including the URL as column during the reconciliation process because it increases the risk of 1/ either missing existing items which do not have a URL set 2/ generating false positives when the URLs differ by some small subdomain.

Second, there is still the problem that the search APIs provided by Wikidata are just not good enough for that sort of matching. You might have noticed that the reconciliation is a bit slower now: that is because I have started to query on both search APIs for matches (both the standard Mediawiki search API and the Wikibase-specific autocompletion for the top-right input field).

But if you run into any case where the duplicate item could be found using the search functionality of Wikidata I would be very interested to look at it because there can always be some bug that I am not aware of.

ArthurPSmith (talkcontribs)

I think it's my own fault, I should have run a match based on name alone, I'll try that next time. A lot of the items that would have matched that way did not have a country or URL or ISNI in wikidata so they were missed in the reconciliation. The other issue is name matches in other languages - GRID often provides native-language labels but I'm only pulling the main label (English nominally) into OpenRefine...

Pintoch (talkcontribs)

Concerning JSON files: actually, it can! If you just try to open grid.json with OpenRefine, then you get to select which level in the JSON tree it should treat as records, and then it automagically imports everything in a pretty neat way. However, there is a bug that is quite annoying and forbids the use of the records mode.

To get rid of the GRID ids already imported, directly in OpenRefine, I see two solutions.

The first one (that I am trying now, but it is quite slow) is to reconcile only using the name and the GRID id, then use the latest data extension feature to fetch all the GRID ids Wikidata stores for the matched rows, and remove all the rows where that succeeded.

The second would be to get all the GRID ids stored on Wikidata via WDQS, import that as a separate project, and then do a join inside OpenRefine using [the "cross" function]. I haven't tried that but it might be more efficient.

Pintoch (talkcontribs)

About ISNIs: given the current scoring in the reconciliation interface, it never hurts to include a unique id. They do not reduce the matching score if they are not found. But yeah, country and URL do. However, you should still see (unmatched) reconciliation candidates with a lower score.

And yes, doing multiple reconciliation passes (by with different parameters on different facets) gives pretty good results in my experience.

Pintoch (talkcontribs)

By the way, I forgot to add a link to explain what the data extension feature is: https://github.com/OpenRefine/OpenRefine/pull/1210

ArthurPSmith (talkcontribs)

Oh! That's very nice! I'll have to play around with it!

ArthurPSmith (talkcontribs)

maybe this would be better with an email exchange - you can email me at arthurpsmith AT gmail.com - anyway, I'm really not understanding the scoring mechanism in OpenRefine reconciliation. Usually it seems to work nicely that scores of 100 indicate a complete match, around 70-80 indicate things mostly matched, and less than 50 indicates a poor match. So why did my reconciliation get a score of 36 in this case (cutting and pasting a line from OpenRefine here):

4275. 628485 TAKAQS-JP National Institutes for Quantum and Radiological Science and Technology

[null]  National Institutes for Quantum and Radiological Science and Technology(36)

[null]  Create new topic

[null Search for match]

    Takasaki JAPAN Takasaki a-ja JPN [null] taka.qst.go.jp

The English label for Q24067079 is identical to the name string I provided (and that's what I was doing the reconciliation on) and the only thing I added in reconciliation was to use the 3-letter country code (JPN here) to match the SPARQL query for P17/P298 (country / ISO 3-letter alpha code). And Q24067079 has P17 to Japan which has P298 "JPN". So this should have been a 100-score. What happened?

Pintoch (talkcontribs)

That is very curious, because when I run the query manually I get a perfect score. (Use the following TinyURL : ycq74ylo . I hate the spam filter.)

(we should have a web interface to try this out and see a breakdown of the score).

Here is a tentative explanation.

  • first, in the reconciliation dialog, the property was not correctly selected: that is probably because of a bug in OpenRefine's auto-completion UI that I have noticed a few times. Even if you click correctly on the proposed property ("SPARQL: P17/P298"), it does not store the identifier of that property ("P17/P298") but just the text you entered ("SPARQL: P17/P298"). Because of that, the reconciliation service does not recognize the property. It is easy to check if this bug was triggered in your case: just look at the JSON representation of your change history, and look for the reconciliation metadata: there should not be any "SPARQL: " in that metadata.

This causes a drop of about 30% in the reconciliation score (so, we are at about 70%). This is definitely an important bug that needs to be solved.

  • second, you were probably reconciling against a narrow type, and this particular item was not an instance of that type. Normally, items which are not of the desired type are not included at all in the list of suggestions, but when we cannot find any candidate we do include the ones which have an incorrect type, dividing their score by two. This way they look like bad matches and the risk that the user matches them by mistake later on is smaller. This rule is of course entirely arbitrary and I'd be very happy to change that to anything you find more sensible. Anyway it's easier to fix: just make sure your type is broad enough (if my guess is correct).

I hope my explanations are not too far off from the reality!

Pintoch (talkcontribs)

Hmmm, actually my diagnostic does not look very accurate… I would be interested to have a look at the JSON of your edit history.

Pintoch (talkcontribs)

One last thing: I notice you added the P31 and P17 claims to the item only yesterday. Items are cached for up to one hour in the reconciliation service, so if this reconciliation happened yesterday it might be due to that.

ArthurPSmith (talkcontribs)

Here's the JSON of the reconcile step this was from:

[

  {

    "op": "core/recon",

    "description": "Reconcile cells in column NAME to type Q43229",

    "columnName": "NAME",

    "config": {

      "mode": "standard-service",

      "service": "https://tools.wmflabs.org/openrefine-wikidata/en/api",

      "identifierSpace": "http://www.wikidata.org/entity/",

      "schemaSpace": "http://www.wikidata.org/prop/direct/",

      "type": {

        "id": "Q43229",

        "name": "organization"

      },

      "autoMatch": true,

      "columnDetails": [

        {

          "column": "ISO country code",

          "propertyName": "SPARQL: P17/P298",

          "propertyID": "P17/P298"

        }

      ],

      "limit": 0

    },

    "engineConfig": {

      "mode": "row-based",

      "facets": []

    }

  }

]

the propertyID looks fine. However in this case I think the problem was the country (and institution type) weren't added until yesterday, AFTER I'd run the reconcile. So it would probably be just fine today... oh well. Anyway, as a general note maybe items with no P31 shouldn't be scored down by 50%, only if it's a conflicting (not subclass) P31?

Pintoch (talkcontribs)

Hmm, so you would include all the items without P31 without discounting their score? Doesn't that sound a bit dangerous? For instance, take a string like "New York". Surely there are tons of things which have a label that is close to that (many music albums, bands, films, brands, whatever) and there is a high risk that one of them does not have any P31. So if you reconcile "New York" to a city, the risk is that this random item pops up with a 100 score and prevents the auto-match with the actual city.

But I definitely agree that items with a conflicting type should also be included in the results (with some discount as well) - currently they aren't at all. And the 50% discount is probably too strong… I just wanted to make sure they aren't auto-matched.

ArthurPSmith (talkcontribs)

So at least we understand what happened here, thanks!

Conflicting P31's aren't shown at all? I guess that's ok, if you really want to see them you can just remove the type part of the match. I agree no type cases should be discounted somewhat. Maybe just subtract 20 from the score if there's no type?

want to do a proposal together for Wikidatacon?

3
Summary by Pintoch

OpenRefine session for Wikidatacon: ArthurPSmith will propose one

ArthurPSmith (talkcontribs)

It looks like I'll be there - if I can get one of the remaining 50 tickets!

I notice the new "Ideas List" included "OpenRefine (workshop or demo)" as a suggestion - I could share some of my experiences with it, though you're the expert. Or we could do something on organization records?

Pintoch (talkcontribs)

I would looooove it (any of your proposals) but I won't be able to attend - it's too far and my PhD needs some attention… But yeah, go for it, we need to sell OpenRefine to the masses! Hopefully by then I'll have implemented QuickStatements export.

ArthurPSmith (talkcontribs)

Oh, sorry you're missing it! Yes, PhD is important...

ArthurPSmith (talkcontribs)

Hi - I noticed you added some "properties for this type" statement to "chemical element". Which is fine with me, but I think you might want to add your perspective on User_talk:ArthurPSmith#chart_of_nuclides (discussion particularly with TomT0m) and User:ArthurPSmith/Draft:Elements, Nuclides, Chemicals Ontology - the main debate seems to be whether a chemical element should be considered essentially the class of "atoms" (in which case it makes sense to talk about it as a class, as it has clear instances which are physical manifestations in the world) or as something more abstract that is not a class and does not have physical instances in itself, but rather has various manifestations (as atoms, as chemical substances, within more complex molecules, etc). For example, one can't say that an atom has a "boiling point" or a "density", so if "chemical element" really refers only to atoms, attaching "boiling point" as a property is wrong. On the other hand, treating a chemical element as a more abstract concept, the "boiling point" may not make sense directly, but it can clearly be understood to refer to the usual simple substance composed purely at that temperature under standard conditions. So I think it's ok to have these properties in the latter case, but not really in the former.

Pintoch (talkcontribs)

I see it's quite complicated… Yes I completely agree some properties I have added might fit better on some sub- or superclasses of that particular class. As to the discussion about P31 vs P279, I am not sure I have much to say. Intuitively I would have thought that "carbon is a chemical element" (so P31) but I can see why you would also want to mark it as P279 (so that all the isotopes are themselves subclasses of carbon).

ArthurPSmith (talkcontribs)

thanks for your comment - I think you're suggesting settling on roughly what I interpreted as TomT0m's proposal. I may work on fleshing that out a bit more, maybe we can actually come to an agreement here...

There are no older topics