Wikidata:Text corpora to lexicographical data

From Wikidata
Jump to navigation Jump to search
Recording of the 30 Lexic-o-days session on Leveraging text corpora for curating lexicographical data.

About[edit]

This page serves follow-ups to the 30 Lexic-o-days session on Leveraging text corpora for curating lexicographical data.

Summary[edit]

Wikimedia projects provide text corpora that contain millions of words, phrases and other lexicographical elements that are not yet curated in terms of lexemes, forms and senses on Wikidata. How can we pave the way to get there? How can we build workflows that scale? How can we integrate other minable corpora — e.g. the open-access literature — into such workflows?

Participants[edit]

Notes of the meeting on April 8th, 2021[edit]

Question from Hogü-456: What about the license. Is it allowed to extract Words out of a Text for example from Wikipedia and use this words as Lexemes.

    • Answer: A word alone is not a legal problem. Sentences (starting from ~7 words) can be copyrighted.

Tools:

Lexeme coverage https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage

  • Explanation by Denny: this tool take the whole text of Wikipedia (or other source) and compare it to the existing lexemes.
  • There is also a "missing" sort by frequency, working on this highly frequent can drammatically increase the coverage.

Sense relations: https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Statistics/sense_relation_counts

Annotations https://annotation.wmcloud.org/

https://www.grammaticalframework.org/ https://universaldependencies.org/