Topic on User talk:Magnus Manske

Jump to navigation Jump to search

Suggestions/wishes/observations for Mix'n'match

1
Tpalonen (talkcontribs)

Hello, I am a information specialist at the National Library of Finland. I am using Mix’n’Match to link Finnish general ontology YSO to Wikidata. I have mainly used Preliminary matches and Mobile Match and my remarks are based on these. Altogether, I think Mix’n’match is a very useful tool with many fine features. While most of my observations below are ideas for further development, I want to emphasize here that I do think it’s very important that such a tool exists and I very much appreciate all the work that has been put into it. Thank you very much for developing the tool!


WISHES / SUGGESTIONS IN BRIEF

-         Way to filter out non-relevant Wikidata item types (e.g. humans, organizations, Wikinews articles etc.)

-         Way to utilize all languages of the source dataset in matching

-         Set Q feature extended to Mobile Match and Visual Tool

-         A category or tag for difficult cases that could be browsed separately to ease work flow

-         A non case-sensitive match

-         Match based only on Wikidata item’s label and Also Known As labels (ie. ignoring description and other item information)

-         Mobile match extended to Wikipedia article name/headline and recognition of plural/singular form of name/headline

-         Mobile match restricted to Wikipedia article name/headline and the first paragraph of the article text

-         Mobile match: multi-word terms matched only when all the same words occur one after the other in the same order

-         Mobile match: disregarding words that are links


GENERAL REMARKS

-         YSO is a vocabulary maintained according to ISO 25964 thesaurus standard. Key difference to Wikidata labels is that terms are written in plural when they refer to countable things/objects (abandoned buildings, accidents, adults etc.). This applies to more than 11,000 YSO concepts, ie. roughly one third of the vocabulary. Wikidata equivalents, on the other hand, are almost always in the singular.

o  Consequently, when linking is based on perfect label matches, this results in many unwanted link suggestions. The plural form of the term in Wikidata may refer to all kinds of entities (movie name, book name, band name etc), but very likely not to the same concept as the YSO term. However, YSO in fact has the singular forms of plural labels as hidden labels. These were not used when matching YSO to Wikidata. A new matching based on YSO’s singular forms would most likely bring better results. This is something we should perhaps look into at the National Library of Finland.

-         YSO is a general ontology that consists mostly of general concepts, not so much individuals/instances. Some exceptions exist (names of buildings, for example), but YSO does not include names of persons, places, organizations, or works of art/literature/music/cinema. Large amount of Mix’n’match suggestions, on the other hand, have been instances, ie. persons, places, organizations etc. Filtering these out would most likely bring more accurate matches. One way to do this could be a simple yes/no query when importing a dataset to Mix’n’match: 1) does your dataset include person names, 2) does your dataset include place names, and so on. Answering“no” to question 1 could then filter out all Wikidata items with the statement “instance of: human”, and so on.

o  In Mobile match, for example, there were cases where all ten suggested matches were instances, when the YSO concept was a general concept (e.g. YSO’s liikuntatapahtumat [eng. sports events].

-         Mix’n’match (or at least Preliminary match) may suggest Wikinews articles, Wikidata categories, or Wikidata disambiguation pages as matches. These too could possibly be filtered out in the kind of query suggested above. Wikinews articles in particular are most likely not relevant as matches, unless the dataset itself contains similar news articles from the same time period.

-         YSO is a trilingual vocabulary with terms in Finnish, Swedish, and English. However, Mix’n’match only uses one language of the dataset, in YSO’s case Finnish. On the other hand, it seems to use data from any language on Wikidata’s side. When linking a multilingual vocabulary, results would most likely be better if several/all the vocabulary’s languages would be used. Wikidata items may not include terms in a particular language and a semantically equivalent item might be missed if matching is based on one language only. Moreover, a label that matches in several languages would be a strong indicator of semantic equivalence.

o  Example: YSO’s tiedejournalismi [eng. science journalism] was not matched because the corresponding Wikdata item (Q1505283) did not have any term in Finnish even though it did have a matching term in both English and Swedish.

-         Preliminary match has the feature “Set Q” when a match has been removed. This is a great feature and would be a good add to Mobile Match and Visual Tool as well, in my opinion.

-         From a work flow point of view, it would be a great feature if concepts could be tagged or removed from the Unmatched / Matched division to a third category, “In process”, or something like that. In many cases, the Wikidata items are inaccurate and need some editing before they can be matched. Also, some cases are complicated and cannot be decided on the spot.


PRELIMINARY MATCH

-         Statistics based on a test set of 200 YSO concepts.

o  First glance of whether suggested matches were generally considerable or non-considerable (ie. evidently referring to non-relevant content): 44% of preliminary matches were considerable, 56% were non-considerable. Examples of non-considerable suggestions:

§  YSO: menot [eng. expenditures] > Wikidata: Lenaick Menot [a person]

§  YSO: taide-elämä [eng. arts life] > Wikidata: Taide pitkä, elämä lyhyt [a film, notice the separation of taide and elämä]

§ YSO: palestiinalaiset [eng. Palestinians] > Wikinews: Palestiinalaiset hakivat YK-jäsenyyttä [eng. Palestinians bid for UN membership]

§  YSO: väärennökset [Eng. forgeries] > Wikidata: Category:Literary forgeries

o  More substantial go-through of 200 matches gave following results: 26% matches confirmed, 7,5% skipped (need further investigation or editing), 64,5% rejected. For the rejected 129 matches (64,5% of 200), a new match was found and added non-automatically for 53 concepts, leaving 76 concepts (38% of 200) unmatched altogether. Consequently, of the 105 matches made (52,5% of 200), 52 were suggested by Mix’n’match and 53 were matched non-automatically.

o  Statistics are based on the first/default Preliminary match suggestion for each entry, not the dropdown alternatives that are offered in some cases.

-         Observations:

o  Preliminary match doesn’t always prefer an exact label match even if such a match exists. Also, it may suggest matches when the source word appears only as part of the Wikidata item’s label. It’s possible that the use of capital letters on Wikidata causes some problems here as many Finnish labels start with a capital letter on Wikidata. Examples:

§ YSO: myrkyllisyys [Eng. toxicity] > Wikidata: akuutti myrkyllisyys [Eng. acute toxicity]. There is, however, also a Wikidata item Myrkyllisyys (Q274160).

§  YSO: välirauha [Eng. Interim peace] > Wikidata: 6. Prikaati (välirauha) [Eng. 6th Brigade (Interim peace)]. There is, however, also a Wikidata item Välirauha (Q5448808).

o  Preliminary matches based on a matching word found in Wikidata item’s description often don’t seem useful. Wikidata descriptions often don’t include the exact word for the item they are trying to describe. Example:

§  YSO: konsultointi [Eng. consulting] was matched to Pöyry, a Finnish consulting company based on the word konsultointi appearing in the description.

o  Preliminary matches are sometimes based on Wikidata property P856 (official website). In YSO’s case, this will hardly bring any useful matches as the website URL can include all kinds of non-related words. Example:

§  YSO: kulttuuripalvelut [Eng. cultural services] was matched to K.H. Renlund Museum, a Finnish museum, based on the website’s rather long URL which included the word kulttuuripalvelut.

o  Excluding item description and property P856 from the matching material might bring more accurate results.


MOBILE MATCH

-         Statistics based on a test set of 200 YSO concepts

o  17%: match confirmed, 83%: no match confirmed.

o  In other words, results with Preliminary matches were better (26%) than with Mobile Match despite the fact that Mobile match usually gives several options for each case.

o  In all the confirmed matches, the word on which the match was based occurred as the first word of the Wikipedia article. In half of the cases the word occurred more than once in the first paragraph of the Wikipedia article.

-         Observations

o  Mobile match seems to look for term matches in Wikipedia article texts. However, it doesn’t seem to match items based on Wikipedia article’s name/headline (or at least, it doesn’t recognize the singular/plural correspondence in a Wikipedia headline even though it does recognize plural/singular forms in article texts). Paying more attention to the Wikipedia article name/headline would quite possibly bring better results. Examples:

§  YSO’s työhevoset [workhorses, plural] was not matched with Wikipedia’s Työhevonen [singular]. YSO’s vaelluskalat [migratory fishes, plural] was not matched with Wikipedia’s Vaelluskala [singular].

o  In the case of compound words or multi-word terms, Mobile match seems to treat the words separately. It may either look for occurences of one word only, or for all words uniquely (ie. not attached to each other). It would most likely bring better results if for compound words matches would be given only when all the words occurred one after the other in the right order.

§  Example: YSO’s geneettiset tekijät [eng. genetic factors] was matched with Wikidata’s Epigeneettinen periytyminen [eng. Epigenetics] based on the text: ”Epigeneettisiin tekijöihin vaikuttavat monet ulkoiset tekijät. Esimerkiksi ravinto vaikuttaa geneettisessä materiaalissa tapahtuvien metylaatioiden määrään”. The word tekijät actually appears before the word geneettisissä [inflected form of geneettiset, which Mix’n’match correctly recognized as the same word].

§  Moreover, the phrase ”geneettiset tekijät” appears precisely so 17 times in Finnish Wikipedia, according to Google. Of the ten matches suggested by Mobile Match for YSO’s geneettiset tekijät, only one was one of these 17.

o  When matches are based on the occurrence of a word anywhere in the Wikipedia article, the connection to the source concept is often quite trivial, particularly when the word occurs after the first paragraph or after Table of Contents. Results might be better, if matches were restricted to the first word or occurrences in the first paragraph or before Table of Contents.

§  YSO’s taidepedagokiikka [eng. art pedagogy] was matched with Wikipedia’s Dschinghis Khan [the band] because it was mentioned that one of the band members had studied art pedagogy.

o  A Wikipedia article usually includes words that are links to other articles. Matching based on any word in the Wikipedia article may, naturally, be based on such a link word. Obviously, this is somewhat misguiding as these words point to another Wikipedia article. If there is a way to exclude link words from matches, it might improve the results.

Reply to "Suggestions/wishes/observations for Mix'n'match"