Wikidata talk:Lexicographical data/Archive/2017/02

From Wikidata
Jump to navigation Jump to search
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

A new Labs Tool to visually explore etymological relationships extracted from the English Wiktionary and an RDF database of etymological relationships

Hi all! I have developed a tool to visualize etymologies and thouught it would be somehow relevant in this discussion. Please check it out at tools.wmflabs.org/etytree. My work is funded by an IEG grant. Please leave your feedback here if you like.

a screenshot of the graph for word coffee
a screenshot of the graph for word coffee

As a first release, it's is impressive how well automatic extraction of data works for etymologies (with some bugs of course...) as etymological relationships are extracted from sentences. This is because etymology sections are written using well defined standars. I would like to get some feedback about some difficulties I have encountered while extracting data and some ideas I have about new templates. I wrote some notes here. Please add your comment there if you have any. Some additional notes follow:

  1. In the future I would like to use the nicer demo I made a while ago. I could not because there were loops between words that could not be fit in trees (in trees branches don't merge). Loops are conflicting etymologies. Many are due to simple inconsistencies that users can easily fix, others are real conflicting etymologies and should be represented using multiple trees. I will work on this in a future release.
  2. Etymology sections rarely link to words and their sense/pos, generally only link to the lemma. This is a problem for homographs, cause they generally have different etymological trees which get mixed up in this current implementation. See for example the discussion on the Etymology Scriptorium. It would be nice to have more precise links in etymology sections that link to the correct word.
  3. I am not plotting all derived words as of now to clean up a bit the visualization.

Looking forward to your comments! Epantaleo (talk) 18:37, 15 February 2017 (UTC)[reply]

Thank you for sharing @Epantaleo:! I'll share that with my colleagues who work on understanding and modeling lexocographic data. Lea Lacroix (WMDE) (talk) 08:38, 16 February 2017 (UTC)[reply]
@Epantaleo: why don't you use Russian Wiktionary? We have strictly structured articles which gives the following advantages: 1) you will not mix different homonymes (homographs) with each other; 2) etymology relationships are expressed with etymology templates which have tree-like structure which allow automatically build such trees. And we have comparable number of items (if you don't count units in different conjugations and declensions). --Infovarius (talk) 16:37, 17 February 2017 (UTC)[reply]
@Infovarius: Thanks! That's a good idea. The project is over now although I would like to continue it somehow, to make it more stable. I I do I can make some tests and see how hard it would be to also incorporate the Russian Wiktionary. Epantaleo (talk) 01:05, 18 February 2017 (UTC)[reply]