Topic on User talk:Magnus Manske

Jump to navigation Jump to search
Summary by Jura1

Manual import for 2844:

Matching

  • some items corrected/completed (missing Latin script statements, description of Cyrillic script items not including spelling)
  • some identifiers added to Wikidata, then "manual sync" run
  • some new items created in Wikidata (label and description in several languages to avoid duplicates), then "manual sync" run
  • problem with automatic matching (useful for people, but not for name items)
  • some manually matched from MxM

Status

  • 34712 in MxM, 34623 in Wikidata
  • few still need work (e.g. Cyrillic, Japanese, script statements, cr)
  • currently <100 unmatched
  • new features at Wikidata: talk pages of all family name items offer a few queries (not yet as detailed as given name items): click on any of the red or blue links to try
  • files will be made available regularly for manual import

Thanks to the members of the team at Digital Dictionary of Surnames in Germany (Q61889795) for bringing this to Wikidata, to Magnus for doing the imports in MxM

Julian Jarosch (digicademy) (talkcontribs)

Hello Magnus,

we (@Ckubosch) tried out MnM and did a test upload: https://tools.wmflabs.org/mix-n-match/#/catalog/2844 Then we realised that we can’t update the catalog with more entries? Currently, we could upload 30959 entries. But our catalog grows every two weeks, so a one-time fix would only be a stopgap.

So our two main questions are:

  • Can you transfer catalog ownership to the shared account used by our institution, User:AdW_Mainz? (Later I can log in to that account and confirm that it’s really ours.) – Would this actually be useful, or do all accounts have the same permissions for every catalog?
  • Since there are regular updates to our database, and if there really is no easy possibility on our end to update the catalog (besides the regex-based scraper) – Would you be interested in regularly updating this catalog from a CSV file hosted on our website? We could make a full list and/or a list of the most recently added entries available, formatted as specified for your MnM import tool. Our updates are quite regular, e.g. loading the CSV on the fifth and the twentieth day of the month would be right most of the time.

If it’s less work for you if we started a new catalog with the right owner and the full current list of entries, and you simply delete 2844, let us know.

Sorry to bother you with a maintenance request! And I’m looking forward to hearing what you think of updating from CSVs.

Viele Grüße!

Magnus Manske (talkcontribs)

Ich kann das automatisch von CSV updates, gib mir die URL :-)

Magnus Manske (talkcontribs)

Und der Katalog "gehört" jetzt AdW_Mainz, aber das bedeutet nicht viel...

Jura1 (talkcontribs)
Jura1 (talkcontribs)

I run one of the jobs on 1497: unmatched ones went down from 100,000 to 4,000. Not really a surprise as the same census had been used here before. Supposedly, one could easily have generated the links directly on Wikidata.

It seems that the fuzzy matching that helps for people isn't optimal for name items. The same probably applies for lexemes.

Ckubosch (talkcontribs)

Hallo Magnus,

unter http://www.namenforschung.net/alle.csv befindet sich die Gesamtliste der Namensartikel des DFD. Unter http://www.namenforschung.net/neu.csv stehen die alle 14 Tage neu veröffentlichten Namensartikel bereit. Leider gibt es noch keine Routine mit der sich die Listen alle zwei Wochen automatisch aktualisieren. Wenn es soweit ist, werden wir uns nochmal bei dir melden.


Wir danken dir sehr für deine Hilfe!


Viele Grüße


@Julian Jarosch (digicademy) und Celine

Julian Jarosch (digicademy) (talkcontribs)

Hallo Magnus,

eine kleine Ergänzung hierzu: Wir aktualisieren die beiden Listen schon regelmäßig, nur ist noch ein manueller Schritt im Ablauf. Die Aktualisierung erfolgt deshalb momentan noch nicht immer genau am 1. und 15. des Monats.

Das heißt, wenn du möchtest, könntest du den Import nach MnM schon testen oder einrichten – nur falls du einen cronjob verwendest, wären ein paar Tage Puffer zur Zeit noch gut.

Viele Grüße!

Magnus Manske (talkcontribs)
Magnus Manske (talkcontribs)
Magnus Manske (talkcontribs)

Oder auch nicht. Mache das mal rückgängig...

Magnus Manske (talkcontribs)

Neuer Versuch mit dem "präzisen" matcher...

Julian Jarosch (digicademy) (talkcontribs)

Vielen Dank! Für einen genauen Abgleich der Katalog-Lemmata mit den native labels haben wir einen externen Workflow benutzt bzw. den werden wir auch weiter benutzen, um über QuickStatements eindeutige matches hinzuzufügen.

Jura1 (talkcontribs)
Jura1 (talkcontribs)

It's mostly done. I will try to do some checks once the reports are updated (property constraints notably).

Jura1 (talkcontribs)

There are some 2067 left in "unmatched". I ran various MnM-jobs, but these didn't get matched, despite items being available and there not being any visible difference to the matched ones. Is there a way to improve unmatching for these? (there seem to be too many for manual matching and, even if done so, we would have to double check any new ones).

Sample unmatched: "Ackermanns", item here: Ackermanns (Q83338447)

Julian Jarosch (digicademy) (talkcontribs)

I’ve started a manual sync from Wikidata to MnM. The number of unmatched entries has already decreased slightly.

Jura1 (talkcontribs)

Thanks. Looks like a missed a few batches .. done that now.

BTW, when creating items manually through the tool, an English description gets added as German

Julian Jarosch (digicademy) (talkcontribs)

Ah, I suppose the description language corresponds to the catalog language set in MnM – which should probably stay “de”. Maybe we should change the “description” field in MnM to »Familienname«.

Jura1 (talkcontribs)

I think it's mostly  Done ..

Jura1 (talkcontribs)

I tried clicking "refresh" to update the catalogue from the website, but that doesn't seem to work.

From , it seems there should be 34274 entries. We have 32723 in MxM (1500 less) and some 500 are only on Wikidata.

Magnus Manske (talkcontribs)

"Refresh" only updates the matching stats in case they get out of date. It does not import new entries from the source.

Magnus Manske (talkcontribs)

OK I am now using both "alle.csv" and "neu.csv". That yielded ~2000 new entries.