Jump to content

Wikidata:WikiProject LD4 Wikidata Affinity Group/Affinity Group Calls/Second Project Series

From Wikidata
 Welcome Events Affinity Group Coordination 


Project background

[edit]

In November and December of 2024, we are continuing to pilot a new format that combines our previous calls and working hours. In this second series, Mahir Morshed (Q89633049) will lead us through four sessions focused on lexicographical data.

Series sessions

[edit]

Session 1 (November 5) - Intro to lexicographical data, what can be done with it and what can be done to improve it

[edit]
Notes (click "expand" on the right to see the notes)

Basics:

  • Lexemes are "items" for elements of language. Fundamental units/distinct elements. They can be words, compounds, suffixes, phrases, (proverbial) sentences
  • The Lexicographical Data Glossary (linked above) has more definitions of useful terms
  • As with other Wikimedia statements, it is best to support assertions with references
  • Mahir's presentation slides are arranged as a grid: columns can be navigated in different directions
  • Question: What if a word has several different meanings?
    • Scroll down to the "senses" section of the lexeme.
      • Each lexeme can have multiple meanings
      • Each sense is a separate meaning of the lexeme
      • Definitions in different languages do not constitute different meanings. They are instead separate glosses on the same meaning
    • Gloss: A separate string in a different language on the same meaning

What can Lexicographical Data Do?

  • It serves as a Monolingual Dictionary
    • You can search for lexemes in Wikidata using Prefix L:
    • Tools for searching: Hangor (select a language then type a query into a search box, it presents results to you); Ordia works in a similar way
    • Question: Can you search for Dagbani? (Yes, demonstrated)
    • Question: This is real cool, but I am not sure how this data is used in production? (This is a good question, but it's one for another session)
    • To determine you are looking at the right lexeme: check the top of the lexeme: if these are all right, you are probably in the right place:
      • Lemmata: headword/citation form
      • Language - Wikidata item for language of the Lexeme
      • Lexical category = part of speech of the lexeme
    • How to improve the monolingual dictionary
      • Add lexemes for words AND prove they exist by adding external identifiers, adding gloss quotes (definitions from public domain dictionaries), adding usage examples (quotes that use a lexeme), adding "described by source", adding "described at URL"
  • It serves as a Multilingual Dictionary
    • Equivalents of words or concepts in different languages
    • How to find equivalents:
      • Use a script!
        • LexemeTranslations.js (from a sense) User:Nikki

https://www.wikidata.org/wiki/User:Nikki/LexemeTranslations.js

  • Use SenseForThisItem.js (from the item)

https://www.wikidata.org/wiki/User:Lectrician1/SenseForThisItem.js

  • Where do these equivalences come from?
    • Senses to corresponding Wikidata items: "item for this sense", "predicate for", "demonym of"
      • Not every concept will have a Wikidata item--sometimes will be "translation" or "synonym" from senses to senses when no appropriate Wikidata item exists
    • Other properties linking senses: antonym, hyperonym, pertainym
    • How to improve monolingual dictionary: link lexemes to one another! If an item corresponding to a concept exists and the item is not a verb, use the item for this sense. If it is a verb, use "predicate for", etc.
  • It serves as an Etymological Dictionary
    • Words with roughly the same origin
    • Not really dedicated tools to surface these relationships, but the Wikidata Query Service works (queries in slides)
    • Questions:
      • Will these queries work with indigenous languages?
      • Are there transliteration properties for people who can't read a script?
    • Comments:
    • Are these the only properties linking lexemes to their origins? No way! Also: "semantic derivation", ...
    • How to imrpove the etymological dictionary:
      • Add links from lexemes to their originating lexemes, add derived from lexeme, mode of derivation, semantic derivation
  • Serves as a Morphological Dictionary
    • Pronounciation, tenses
    • How to surface this information: Hangor can show much of it, Ordia shows a little less, or just scroll down to "forms" on a lexeme page
    • How does modeling happen?
      • Forms represent inflections of a word, or variations of a base form
      • What forms are needed will depend on the lexeme's language: don't add a "feminine plural" or a "comparative" form if your language has no such thing!
      • "pronounciation audio" links to an audio file on Wikimedia Commons
      • "IPA transcription" provides the International Phonetic Alphabet representation of a pronounciation

- "pronounciation" is only used when using non-standard schemes for representing pronounciation

    • How to improve the morphological dictionary
      • Add information to forms: use Lingua Libre, use lexeme-forms tool to add forms to lexemes in your language, add other properties, especially if they cannot be predicted from other information on the lexeme
  • Serves as a Phraseological Dictionary
    • Ordia can show some, Hangor can show a little less. Easier to scroll down to the combined lexemes statements on a lexeme page
    • To combine lexemes to make new lexemes:
      • "combine lexemes", "series ordinal", "object form", "object sense" - what form/sense are being used -
      • Information about how to put lexemes together: there is a way to model syntax: positions and relationships will be explained in a later session
    • How to improve phraseological data:
      • Add lexemes for compounds and link them to their parts
      • Compounds may be findable using Mishramilian
      • Make sure that the parts of those compounds are marked using the properties described here!

Session 2 (November 19) - Working session improving existing lexemes and using tools to find lexemes to create

[edit]

Presentation slides

Session Dashboard

Notes (click "expand" on the right to see the notes)

How to improve Lexemes:

  • Prove your lexemes exist!
  • Add senses to those lexemes! There needs to be at least one sense for a lexeme to be "done"
  • Can query to find lexemes in a particular language without a sense
  • Tool for adding senses
   * Add Adjectives 
   * Add nouns 
   * Add verbs 
   * Add adverbs 
   * Only English lexemes with OED ID 
   * Only English lexemes with DARE ID 
  • Question: Do you try to keep the meaning short rather than exhaustive?
  • You can turn on a script to add a Merriam-Webster frame to your display, and then copy definitions from there. https://www.wikidata.org/wiki/Wikidata:Tools/Enhance_user_interface and search for Merriam Webster Iframe
  • Tool for making lexemes for words that don't have them already: Mishramilian: Mix-n-Match for Lexemes!
  • Define parts of speech
  • Provide the meaning before creating
  • Can add several forms and meanings at the point of creations
  • Can add various inflections for a particular lexeme: Wikidata Lexeme forms (singular/plural for instance)
  • Question: There are many definitions in OED. Which one do you choose/how do you choose?
  - Adam: Add them all! Mahir: you can do that! You can also choose the most common one/the one you like the best.
  • Question: why aren't [different definitions of nave: part of church vs. wheel] just added as separate senses on the same lexeme?
   * Some look the same but have different etymologies and that is the reason. For this one, they have separate origins so they need to be separate words. 

Session 3 (December 3) - Working session modeling parts of complex lexemes and using them to help other languages

[edit]

Session Dashboard

Notes (click "expand" on the right to see the notes)
  • Continued using Slides from Session 2
  • Improving Multilingual Dictionary:
    • Equivalents to words, equivalents to concepts
    • Link lexeme senses to one another!
    • If an item corresponding to a concept exists, and the lexeme is not for a verb, use 'item for this sense'. Make sure you pick exactly the right item corresponding to the sense. Example: Uztai. Same thing as a bow (elastic launching device) -- linking senses allows for automated translation
    • If an item corresponding to a concept exists, and the lexeme is for a verb, use 'predicate for' instead
    • If no item exists, then use 'translation' or 'synonym' as appropriate
  • Abstract Wikipedia
    • Choose a concept: There's a lexeme for the item in a language. A drop-down list renders lexemes in other languages for the same concepts which you can then search in Wikidata and connect, if found. Based on 'item for this sense' links. Allows for translation to happen between languages with accuracy
    • QUESTION: There are two senses linked for Bokmal. How does it choose? Does it just take the first alphabetically (idrett)?

- It chooses based on the earlier L-number, so quasi-randomly

    • QUESTION: will that break grammatical rules within languages?

- The code is written with rules for each language in mind

Instructions for today's work:
[edit]

1. Look at the item you've been assigned.
1.1. It should have a simple, two-word English phrase as its label; this phrase is what you will be creating a lexeme for.
2. Add the lexeme for that phrase to Wikidata.
2.1. Go to [mishramilian.toolforge.org Mishramilian].
2.2. If you see "Log in" at the top of the page, then click that to log in.
2.3. Click the "Filter" button on the right of the search box.
2.4. In the "Select Languages" dropdown, select English.
2.5. Now search for the English phrase you were assigned.
2.6. Click on any of the results for that exact phrase that show up.
2.7. In the "Create a new lexeme" box, if the second dropdown is empty, search in that box for "noun" (or type "Q1084", which is the item ID for "noun").
2.8. In the "Sense gloss (required!)" box, tell what the phrase you were assigned means.
2.9. Click the "Create and add forms for English noun" button. (If a new tab does not open, you may need to allow the pop-up to display.)
2.10. If it says you need to log in to that new tab to make changes, you should do that. (This is a different tool, after all.)
2.11. Add the singular and plural of this noun and click "Edit".
2.12. You should be redirected to the lexeme you created, complete with a sense, two forms, and an identifier.
2.13. (If there were multiple results for the phrase you were assigned, you can copy the lexeme ID for the lexeme you created, go back to those other results, paste the ID onto the "Other lexeme ID" box, and click "Add to other lexeme".)
3. Link the item to the lexeme's sense.
3.1. Under the sense that was added (that is, under the definition of the phrase you provided), click "Add statement".
3.2. Search for the property "item for this sense".
3.3. Use the item for the phrase you were assigned as the value for that property.
3.4. Click "publish".
4. Add statements to the lexeme describing what words are being combined.
4.1. Under the identifier on that lexeme, click "Add statement".
4.2. Search for the property "combines lexemes".
4.3. Search for the first word in the phrase you were assigned. Make sure to select the "English, noun", and not another language or lexical category!
4.4. Click "Add qualifier".
4.5. Search for the property "series ordinal".
4.6. Use the value "1", since this is the first word.
4.7. Repeat steps (d-f) with the following:
4.7.1. Property P9764, with value "2".
4.7.2. Property P9763, with value "nominal attribute".
4.8. Click "publish".
4.9. Repeat the above steps with the second word, using the following qualifiers:
4.9.1. series ordinal: 2
4.9.2. P9764: 0
4.9.3. P9763: syntactic root
5. Indicate that the phrase you were assigned is a type of the second word in that phrase. (For instance, a "tree house" is a type of "house".)
5.1. Under the "combines" statements you added, click "Add statement".
5.2. Search for the property "instance of".
5.3. Use the value "endocentric compound".
5.4. Click "publish".
6. (Now that you have got ahead, here's a few things you can check:)
6.1. (If the lexemes for either of the words in the phrase you were assigned do not have a sense, add a sense to them.)
6.2. (If, for a sense you add, there is an item for that sense, just as there was an item for the phrase you were assigned, then perform step 3 above on that new sense with that item.)

Session 4 (December 17) - Working session modeling and forming sentences using Wikidata lexemes and items

[edit]

Session Dashboard

Zoom link to join all sessions:
https://washington.zoom.us/j/99751624871?pwd=9aP5kaCixHlMRwCbppaZHitt8Un3wZ.1

Meeting ID: 997 5162 4871
Passcode: 113059

  • Includes link to pre-participation survey
  • Pre-participation survey link (you only need to fill this out once for the entire series)

FAQ

[edit]

How can I make sure I know about upcoming sessions?

How do we use the series Etherpad?
Our Etherpad is hosted by Wikimedia, and we are using it to take notes and keep track of questions and answers discussed during the sessions. After sessions are finished, we will update our series project page accordingly.

Where can I chat more about lexicographical data?
Join the Telegram group!

Collaborators

[edit]

Project Lead:

Mahir Morshed (Q89633049)

Project Page Maintainers:

Crystal Yragui

How to Add Yourself to the Participants Section

[edit]

First: make sure that you are LOGGED IN!

Then, simply typing four tildes (~), like this: ~~~~ below the numbered list, will automatically generate your signature and time stamp (if you would like to have your username appear after a bullet point, you can type * using the asterisk key before adding the tildes).

Alternatively, once you are in Edit mode, you can click on the four tildes below as seen in this image:

Screenshot of the insert sign your name feature
How to sign your name as a participant of this page

Participants

[edit]