User talk:Magnus Manske

Jump to: navigation, search

About this board

Previous discussion was archived at User talk:Magnus Manske/Archive 9 on 2015-08-10.

By clicking "Add topic", you agree to our Terms of Use and agree to irrevocably release your text under the CC BY-SA 3.0 License and GFDL
PKM (talkcontribs)

I've discovered several cases where muliple locations in different US states were automatically matched to a single location of the same name in a different state. I am cleaning these up as I find them, but you might want to look at the code.

As an example - locations called "Inspiration Point" in several states were matched to Inspiration Point in Wyoming (Q14714736). I think I fixed all of these.

PKM (talkcontribs)

I just found another set - Tuckahoe Creek - I have NOT cleaned these up so you can see that three of them are matched to Q7851024.

Magnus Manske (talkcontribs)

Sorry for being slow - these are all "automatic matches" in mix'n'match, right? Those are just matched by name, and they are not written to Wikidata unless someone manually confirms them. I can't really see how automatic matching could be improved, unless I write description-parsing code specific for TGN, which I would rather avoid given that there are now ~600 catalogs...

PKM (talkcontribs)

Oh, okay, I didn't realize they were not loaded until confirmed. I can deal with that. Thanks for the explanation!

Reply to "Poor automatic matches in TGN"
Jason6698 (talkcontribs)

Always give the result of "502 Bad Gateway". It has happened for around a week already.

Magnus Manske (talkcontribs)

I just tried, works for me. There was an issue with the database replicas at the WMF, but that was apparently resolved days ago. What are you trying to do?

Jason6698 (talkcontribs)

When filling any site name (e.g. zhwiki) in "Has none of these site links" in "Wikidata" tab, it gives the result of "502 Bad Gateway".

Magnus Manske (talkcontribs)

Thanks, found and fixed a bug, should work now.

Jason6698 (talkcontribs)

Seem still not working properly. Tried a few cases involving zhwiki and zh_yuewiki, but loaded for a very long time without giving results.

Magnus Manske (talkcontribs)

Could you give me a full set of parameters to test? As in, exactly what do you ask for? Just "stuff doesn't work" is never helpful.

Jason6698 (talkcontribs)

When I try the same cases again today, the results are normal and no "502 Bad Gateway" anymore. Thank you for fixing. Maybe just the server was not so stable yesterday when I was trying.

Vladimir Alexiev (talkcontribs)

Maybe Quora topics belong to WD, but Quora questions don't. Just try this search https://tools.wmflabs.org/mix-n-match/?#/search/balloon to see plenty of ad-hoc questions that can't possibly belong to WD. Here's one https://tools.wmflabs.org/mix-n-match/?#/entry/22531602 and even more, some are already deleted on Quora! https://www.quora.com/topic/When-a-Balloon-Is-Rubbed-with-a-Sweater-Which-Charge-Does-the-Balloon-Has

Somehow I also don't think MOMA paintings belong to MnM. What do others think?

Magnus Manske (talkcontribs)

How do I tell questions from topics? Do all questions have a "?"?

Is that actually a problem, since you can exclude catalogs from search?

Reply to "Remove Quora questions from MnM"

“named as“ vs “title” in database references

7
MisterSynergy (talkcontribs)

Hey Magnus

I have recent seen that User:Reinheitsgebot adds a lot of references of database type (according to Help:Sources#Databases) to existing claims. Very useful in general, however I am wondering why it uses named as (P1810) instead of title (P1476) to indicate the database entry title. The latter is suggested by the (somewhat official) help page, and users will likely look for it if they want to use the reference—but they will probably not look for the former.

“named as” is also defined as “qualifier only”, but this is not qualifier use.

I have taken notice of Property talk:P1810#Usable on database identifiers ? (@Jheald), but I cannot see that there was concensus to change this.

Magnus Manske (talkcontribs)

I remember there was a discussion about it, which resulted in me switching to named as (P1810), but I can't find it right now. Come to think of it, stated as (P1932) says "use as qualifier to indicate how the value was spelled or printed in the source", which seems even better suited. I don't really have a favourite here... What say you?

Hsarrazin (talkcontribs)

for bibliographic records, I would much prefer P1932, thank you... :))

Jheald (talkcontribs)

P1932 is a qualifier to indicate how the relationship or the value in the statement was originally stated.

P1810 should be used to indicate how the subject of the statement was originally stated.

This is the difference between the two. Some statements (eg film credits) can have both.

MisterSynergy (talkcontribs)

… and I can meanwhile remember why the title (P1476) property was probably chosen to be used in those references:

In most cases, Wikidata items and external database entries are mapped on a 1:1 basis. With that approach, one could simply use the label as a title of the reference link if used in Wikipedias, and omit any title qualifiers in the reference. The exact spelling of the database entry title does not really hold much information beyond the Wikidata item label.

However, there is no requirement for a 1:1 mapping, and database entries can for instance contain information about the Wikidata item, although being about a different (yet related) concept. IIRC there was an example of chemical substances given in the past, which can be classified according to different rules in a way that multiple Wikidata items are described by the same external database entry.

Anyway, with that approach the string we discuss about tells us what the external database entry is actually about, which is another information that a plain Wikidata item label. P1810 and P1932 do not seem to be fully suitable in that case. However, I can’t tell how rare such cases are.

MisterSynergy (talkcontribs)

Actually I don’t really have a preference, and I am explicitly not really happy with title (P1476) anyway due to the monolingualtext data type. The database language should be defined in the corresponding database item and pulled from there.

However, what bothers me is the fact that this is a non-standard format at the moment, and Wikipedias will likely not be able to deal with it properly and thus fail to display a useful database entry title for references using named as (P1810). There are some modules out there which check Wikidata references, filter out things such as “imported from: some Wikipedia”, and then figure out how to display the reference based on source types defined on Help:Sources.

stated as (P1932) (with string data type) does indeed look good to me (also “best”? — not sure right now), but we should make sure

  1. that there is consensus to list it instead of title (P1476) at Help:Sources#Databases,
  2. that we have a repair job for the old references, and
  3. that Wikipedias are informed by this significant change to update their modules.

Not sure what to do now, maybe start a discussion at Wikidata:Project chat or Help talk:Sources. Maybe Jheald, already pinged above, and all other interested users can also comment here.

Magnus Manske (talkcontribs)

Well, either way there will be a bot cleanup job (which I might eng up writing/running), so I'll just continue as it is now, and wait for consensus. Please keep me informed.

Reply to "“named as“ vs “title” in database references"

wrong data imported to Wikidata for birth/death values

30
Nono314 (talkcontribs)

Hi Magnus,

I'm sorry to report that Reinheitsgebot seems to be importing some wrong data into Wikidata for birth/death values and their references.

For example, there seem to be some severe parsing issues when you import this from that...

I also thought you would implement some basic consistency check to avoid things like this that are bound to raise alerts.

I really hope you can fix your script soon enough to avoid too many wrong data that will need to be cleaned up.

And still thanks for fueling so many ideas into the community!

Jc3s5h (talkcontribs)

Make the bot get Julian and Gregorian dates correct or stop the bot.

Hsarrazin (talkcontribs)

Indeed, I corrected quite a batch of wrong P569 added from RKD deceased dates ? a mapping problem perhaps ?

Magnus Manske (talkcontribs)

The bot should ignore all dates marked as Julian on Wikidata.

I will look into the RKD issue.

Magnus Manske (talkcontribs)

Found a subtle bug in the date parsing, fixed now. Restarted date/reference import.

Nono314 (talkcontribs)

So now the right ones are added, but the wrong ones still need to be manually cleaned up? I guess that's most of the latest upticks here?

Magnus Manske (talkcontribs)

Both birth and death should be the same year (at least), and both should have RKDartists as the source. Here is the list, it's 120 items at them moment. I'll take care of them.

Nono314 (talkcontribs)

That's the general idea yes. Thanks for having fixed a bunch of them! There are however some more where the right one was already there and has no stated in (P248) in ref (see eg Cristoforo Zavattari (Q3002905) imported some time ago by Multichill)

Magnus Manske (talkcontribs)

I have manually worked on some on the P569=P570 list, but at some point, it ceases to be a "big bad bot booboo" and just fades into the background of daily maintenance.

Nono314 (talkcontribs)

Sure manual work if needed can be shared.

But given the surge in that report correlated with your bot run there was a good chance there were more left. And indeed I think I've unearthed another case of wrong parsing, with Benezit this time. Going down the list I found this imported from there, and the root cause obviously sits there. (Looks like you probably found it too in the meantime, given this one is now fixed)

Jc3s5h (talkcontribs)

I think it's very unlikely that a date in 1578 is really a Gregorian calendar date.

Hsarrazin (talkcontribs)

noticed all items with death date added as birth date was with "imprecise" birth date on RKD... was that it ?

Magnus Manske (talkcontribs)

Not sure. I tweaked the regular expression, that fixed the problem in all examples (quite a few) I looked at.

Jc3s5h (talkcontribs)

"The bot should ignore all dates marked as Julian on Wikidata" misses the point. You must have proof of the conventions that a source uses for dates before assuming that any date before 1 March 1923 is Gregorian. (That's the date Greece switched from Julian to Gregorian.) Unless you have proof that a date before 1 March 1923 is Gregorian you should not enter it into Wikidata, nor should you add a reference to a date that already exists in Wikidata, because there is a strong possibility that the calendar assigned to the existing date in Wikidata is false.

Magnus Manske (talkcontribs)

There is no way to do that for me, as you probably know all too well.

Jc3s5h (talkcontribs)

If you cannot verify the calendar of dates, you should not add them. If you continue to do so I will have to seek an administrator to resolve the problem.

Magnus Manske (talkcontribs)

Do you have an example where a Julian date was added as Gregorian, by the bot? More than one example, if possible, that could help indicate "bad" sources.

Jc3s5h (talkcontribs)

In these edits dates were imported from Encyclopedia Britannica Online for Edward Coke, an English judge who was born before the Gregorian calendar was created and died after it was adopted in Catholic countries, but before it was adopted in England.

In this edit a date of birth was added for Edward Nicholas from SNAC. The item already had birth and death dates sourced to the Oxford Dictionary of National Biography. Both the bot and the editor who added the values from Dictionary of National Biography faile to recognize that the values were stated in the Julian calendar because that was the calendar in force in England at the time Edward Nicholas was born.

This series of edits adds sources in support of existing birth dates and death dates for Oliver Cromwell that are purported to be Gregorian calendar dates, but are really Julian calendar dates. (It turns out that when the birth and death dates were added back in 2013, the dates were reversed; the birth date was given as the death date and vice versa. Bots should check that the date not merely appears in the item, but is actually the kind of date it is supposed to be.)

Your concept of "bad" source is bad. All the good sources I've found, such as Encyclopedia Britannica and Dictionary of National Biography, state dates in the calendar used at the time and place where the birth or death occurred. I have verified this practice by comparing dates in these publications with contemporaneous dates from gravestone pictures, online images of birth or death registers that were written near the time of the birth or death, contemporaneous newspaper accounts, and the like. The only change these good sources make is to always treat January 1 as the beginning of the year, even though England treated March 25 as the beginning of the year until 1752, and other countries had other dates for the beginning of the year.

If you are not willing or able to take these factors into consideration, just recognize dates before 1 March 1923 and skip them.

Magnus Manske (talkcontribs)

The sources I imported from, including the EB online article in your example, do not specify the calendar. I can therefore not verify automatically which calendar is used. If EB online is of substandard quality for you, maybe the issue are your expectations...

I am uncertain what you want to tell me with the Edward Nicholas and Cromwell examples; in both instances, the bot added references, not the dates. And you left the references for Cromwell in place during your edits, so I assume they are still OK? But the SNAC in Edward Nicholas were not?

Skipping all dates before 1923 would mean to skip most, so I would rather not do that. As much as I like data paranoia, this seems a bit excessive. Is there an earlier year I could use as a cutoff? 1582 seems like a good one. That won't catch all of them, but maybe most of them?

I believe there is already a validation drive for Julian/Gregorian dates under way. Those will eventually fix any mistakes imported from other sources. Is there a qualifier I could add to help with that? Something that says "needs checking"?

Jc3s5h (talkcontribs)

EB online is not of substandard quality. But it is written for people to read, not bots. EB online, the Dictionary of National Biography, and other quality sources place upon the reader the responsibility of understanding what calendar was in effect at the time and place of an event, and interpreting the date accordingly. A bot that cannot fulfill that responsibility is functionally illiterate and should not be allowed to read quality sources.

In the Edward Nicholas and Cromwell examples I looked at the sources. Being a literate human being, I interpreted the dates as being Julian calendar dates, and changed the Wikidata calendars from Gregorian to Julian. Similarly, I corrected the entries for Cromwell by swapping the birth and death dates, and changing the calendars from Gregorian to Julian. (I was misreading the dates, they were not swapped after all.)

For Edward Nicholas, I hadn't seen enough SNAC examples to infer what their date policy was, so I just deleted it. I could see that there was a perfectly adequate reference to the Dictionary of National Biography, but that whoever had used that source did not understand their date convention, so the calendar would have to be corrected. Now that I've seen more SNAC examples, I infer their date convention is similar to Dictionary of National Biography.

As for an earlier cutoff date, you might be able to consider the source. If you know the source only describes people from a certain country, you could use the cutoff date for that country.

Using 1583 would be a bad choice. Large numbers of items come from either the English or Russian Wikipedias, which in turn concentrate on people from those countries. Russia adopted the Gregorian calendar on 14 February 1918 and England and its colonies adopted it on 14 September 1752.

Magnus Manske (talkcontribs)

Just for numbers, Reinheitsgebot has created 1176 dates (not references to existing dates!) at the time of me writing this. All of them will be marked "Gregorian". 72 of those are before 1582:

Q2444899,Q41262585,Q41272160,Q2444899,Q6236001,Q41272160,Q4998007,Q9066316,Q5576608,Q21542943,Q14198216,Q19569809,Q41272022,Q9070303,Q22235765,Q2216917,Q41342131,Q19569806,Q17394787,Q1617909,Q41257656,Q41271435,Q16836094,Q41265158,Q41272022,Q3183654,Q29422181,Q41263297,Q6836655,Q29436248,Q19569688,Q22969354,Q42178195,Q21055305,Q41262064,Q41265883,Q41308737,Q2216917,Q5880200,Q41271917,Q29436256,Q41339120,Q22668380,Q41342131,Q2998233,Q41262647,Q29436235,Q41265158,Q21544170,Q41342967,Q41304920,Q29436224,Q21455299,Q29436291,Q7288287,Q7288287,Q29421989,Q41301222,Q1610325,Q7789898,Q2370244,Q16661285,Q21714921,Q29436084,Q6836655,Q41269201,Q3140456,Q6544482,Q21637076,Q41261860,Q18608036,Q41341551

Hsarrazin (talkcontribs)

then could it be inferred that reliable English sources use 1753 as julian/gregorian calendar dates (with special attention for dates between 14 September 1752 and 1/1/1753), and similarly for Russian sources ?

Jc3s5h (talkcontribs)

I think the point is not that the source is written in the English or Russian language, but rather the nationality of the person being described by the source. If the person is English, it is probable that the person was born and died in England, and thus, the dates will be given in the Julian calendar for dates before 1753. I would expect the source to use the calendar for the time and place of the event, so if an Englishman were born in France in 1600 I would expect the date to be given in the Gregorian calendar.

Hsarrazin (talkcontribs)

the main problem being... you can only "expect", because, before wikidata tried to differentiate calendars on each and every item... nobody really bothered to do so...

and so, absolutely no source is ever able to tell for sure you which calendar is used on any person between 1582 and 1923 !

For millions of people across the world, the dates used and given for birth/date in various catalogs, biographies, authoritative works, databases, have been computed, at a time, once, by a researcher (sometimes)... and after that, dates have been propagated and admitted "as is"...

Now, wikidata is just putting the foot in the middle of quicksand, and asking everybody to only give "certain dates" and always give calendar... and do NOT make mistakes !

and I fail to see how this can even be achievable :(

Jc3s5h (talkcontribs)

"Absolutely no source is ever able to tell for sure you which calendar is used on any person between 1582 and 1923" isn't quite true. Primary sources such as grave markings made near the time the person was buried, birth and death certificates, and contemporaneous newspaper accounts eliminate any uncertainty about the calendar, so long as the reader can establish which calendar was in effect. That's pretty easy for countries that switched all at once, not so easy for places like Germany that, at the time of the conversion, was a bunch of principalities and dutchies.

If the uncertainty about the calendar cannot be resolved, the precision could be reduced. Dates near the middle of the month could have the precision reduced to month. Dates near the beginning or end of the month could have the precision reduced to year. To be really thorough, one could give the date verbatim with the {{P|1319)) qualifier, marked as Gregorian calendar. If it's actually a Julian calendar date, the Gregorian date would be later (for dates in or after the 1st millennium) so do a Julian to Gregorian conversion and put the result as the value of latest date (P1326), also marked as a Gregorian date.

Hsarrazin (talkcontribs)

I know about primary sources... I meant no biographic dictionary, database, etc.

do you mean wikidata will have to check every single date for 3 and a half century ? or reduce precision to year for without further exploration for each and every person...

people I'm interested in (public domain authors) are 95% in this time period :((

Jc3s5h (talkcontribs)

Yes, all those dates need to be checked. The people who designed the Wikidata data structure and supporting software deserve to be life members of the calendar hall of shame, together with the authors of ISO 8601 and XSD 1.0. When a data model is poorly thought out, massive repetition of work is inevitable.

Magnus Manske (talkcontribs)

I changed the bot to add a instance of (P31):statement with Gregorian date earlier than 1584 (Q26961029) qualifier for dates before 1584, as in Q2444899. That appears to be the tag I was looking for.

Jc3s5h (talkcontribs)

The purpose of statement with Gregorian date earlier than 1584 (Q26961029) was for a one-time run of a bot to mark suspicious dates after Wikidata added support for the Julian calendar; see Phabricator task 105100. In the comment at Sep 26 2016, 06:00 it's explained that the run was to be done only once, because adding dates in the wrong calendar after the bot was run should be regarded as errors on the part of editors rather than limitations of the Wikidata software.

This property should not be added any more. Rather, dates that were marked during the bot run should be examined, and the sources read, to determine the correct date. Since it isn't possible to do a conversion on a date with less than day precision, there is no reasonable alternative to regard a pre-1582 date written in a source with just a year, or just a year and month, as Julian. Since most quality sources use the calendar in force at the time and place of the event, if the date in Wikidata matches the date in the source, but the Wikidata calendar is Gregorian and the calendar for the source is not explicitly stated, it's probable that it's a Julian calendar date and marking it as Gregorian in Wikidata was an error.

Magnus Manske (talkcontribs)

OK, I am not adding the qualifier any more. A SPARQL query will do the same.

Reply to "wrong data imported to Wikidata for birth/death values"
Edgars2007 (talkcontribs)

Hi!

When viewing this (or any other D. link) link (cliking it trough Petscan, although that doesn't matter), I alwys get redirected to lvwiki mobile version for that article.

Maybe you could take a look?

Sjoerddebruin (talkcontribs)

It's a small screen, it was designed to show the mobile interface.

Reply to "Duplicity bug?"

Reinheitsgebot and poor VIAF additions

2
Billinghurst (talkcontribs)

Hi MM. I am doing my daily cleanup of VIAF and I see that the bot is adding data to the wrong people, though people with closely matching details. The VIAF data added already exists against the correct person. An example

from VIAF itself I am unsure how any person could make a match, however, the addition by the bot is clearly wrong. Thanks for looking into this.

Magnus Manske (talkcontribs)

Hi, I have added another layer of checks now, that should stop any such wrong additions.

Jura1 (talkcontribs)

Hi Magnus,

Somehow the check on other items using the same identifier didn't work at and the same identifiers got added to another item (the item is for one of two persons who seem to get mixed up everywhere, including most library catalogues and dewiki: Talk:Q1064802).

Magnus Manske (talkcontribs)

I put in another check now, just in case...

upload the Hungarian people's Authority control records at Petőfi Literary Museum into Wikidata.

3
Texaner (talkcontribs)

Hi Markus

I have two question to you:

– I am making a (for me big) project: upload the Hungarian people's Authority control records at Petőfi Literary Museum into Wikidata. The Museum has 252 000 record, I have select all Hungarian people from Wikidata (about 8000) and now I am matching this data and write a cvs file for import. I have made a “Data import request” but no any reflection. Will be the data import fully automatic if I have upload somewhere my csv file, or not? I must make some more special request?

– I see, your Reinheitsgebot bot are regularly changing the OSZK-id (P951) to VIAF id. The Hungarian National Library authority id is on pages [http://nektar.oszk.hu/auth/…]. I think that it is absolutely senseless to have the same VIAF identifier in two different property for one person! In this OSZK-id (P951) property should be the correct identifier from Hungarian National Library and not the VIAF id. I hope after uploading the PIM-id (P3973), will have the opportunity make the same work for Hungarian National Library too and upload the correct id-s.

~~~~

Magnus Manske (talkcontribs)

Hi Texaner,

  • I actually don't know a lot about the formal process to import into Wikidata. Where did you make a request? It is usually discussed, and if people agree, someone should volunteer to do the import. I do have a separate tool that can be used to identify entries already in Wikidata, but if you have done so already, that won't be necessary.
  • My bot does not (or at least, should not) change P951 to VIAF. It can add P951 and add a reference to VIAF, because that's where I got the information from. Is that what you mean?
Texaner (talkcontribs)

* I make my request on the Data Import Hub page. My csv file has only the parsed Wikidata identifier Q…. and the PIM-id (P3973). I hope this is enough.

* I think it should be better not add more VIAF id to P951 priority, wile all VIAF id should be uploaded and changed to the original ids from Hungarian National Library.

Reply to "upload the Hungarian people's Authority control records at Petőfi Literary Museum into Wikidata."

still bad Quickstatement edits

6
Summary by Masegand

They'll keep on coming

Masegand (talkcontribs)

they are with kindred ID

for example (Q234147)  (Q317164) ‎

Masegand (talkcontribs)

as well with hansard-ID for example (Q6221832) (Q7373509)

Magnus Manske (talkcontribs)

I checked your example for Q317164. The Kindred data I have in mix'n'match does say "died 2017". It is possible that the Kindred entry was matched to the wrong person, is that the case? Because we should remove the Kindred ID from Q234147, and the match from Mix'n'match. Same for Hansard. Or is the match correct, but Kindred/Hansard just have the wrong date?

Masegand (talkcontribs)

The source http://kindred.stanford.edu/#/kin/full/none/none/I26421// does not say "died 2017".

Masegand (talkcontribs)

Also 2 more with SNAC ID, in that case the external DB seems to have wrong data

Magnus Manske (talkcontribs)

I have fixed the Kindred "died 2017" ones (other thread on this page).

As for "upstream wrong", Wikidata just shows what other sources say, we don't decide what's true ;-) All I say is "person born/died THEN, according to THAT".