Wikidata:Property proposal/dependency grammar relations

From Wikidata
Jump to navigation Jump to search

Dependency grammar relations[edit]

relationship to head in syntactic dependency[edit]

Originally proposed at Wikidata:Property proposal/Lexemes

Descriptiontype of dependency relationship between the part being qualified and its corresponding head (denoted with "position of head")
Data typeItem
Allowed valuesitems for dependency relationships (set to be finalized)
Example 1Using the values of combines lexemes (P5238) on cool as a cucumber (L559269) as an example:
Example 2Using the values of combines lexemes (P5238) on মাছের তেলে মাছ ভাজা (L314705) as an example:
  • মাছ (L314663)possession (Q3543662) (nmod:poss)
  • তেল (L314665)adpositional phrase (Q1816188) (obl)
  • মাছ (L314663)direct object (Q2990574) (obj)
  • ভাজা (L314671)root node (Q1757074) (root)
  • Example 3Using possible values of combines lexemes (P5238) on som man reder, ligger man (L345524) as an example, assigning them series ordinal (P1545) values 1 through 5 in order:
  • som (L448822)dependent clause (Q1122094) (mark)
  • man (L449138)subject (Q164573) (subj)
  • rede (L471173)adverbial clause (Q132120) (advcl)
  • ligge (L303858)root node (Q1757074) (root)
  • man (L449138)subject (Q164573) (subj)
  • Example 4Using possible values of combines lexemes (P5238) on sich einen feuchten Kehricht um etwas kümmern (L46007) as an example, assigning them series ordinal (P1545) values 1 through 7 in order (I am de-0, so the relationships might not be quite correct):
  • sie (L304142)reflexive pronoun (Q953129) (expl:pv)
  • ein (L409194)determiner (Q576271) (det)
  • feucht (L344856)adjectival phrase (Q357760) (amod)
  • "Kehricht" → intensifier (Q17154617) (dislocated)
  • um (L6722)case (Q128234) (case)
  • etwas (L497065)adpositional phrase (Q1816188) (obl)
  • "kümmern" → root node (Q1757074) (root)
  • Planned useaddition to combines lexemes (P5238) values on multi-term lexemes
    See alsoobject has role (P3831)

    position of head in syntactic dependency[edit]

    Originally proposed at Wikidata:Property proposal/Lexemes

    Descriptionposition (as a value for series ordinal (P1545) on a "combines" statement) of the head to which the syntactic dependency relationship qualifying this combines lexemes (P5238) value points
    Data typeString
    Allowed valuesvalues of series ordinal (P1545) on other combines lexemes (P5238) values on the lexeme
    Example 1Using the current values of combines lexemes (P5238) on cool as a cucumber (L559269) as an example:
    Example 2Using the values of combines lexemes (P5238) on মাছের তেলে মাছ ভাজা (L314705) as an example:
  • মাছ (L314663) → 2
  • তেল (L314665) → 4
  • মাছ (L314663) → 4
  • ভাজা (L314671) → (not needed, but see note in example 1)
  • Example 3Using possible values of combines lexemes (P5238) on som man reder, ligger man (L345524) as an example, assigning them series ordinal (P1545) values 1 through 5 in order:
  • som (L448822) → 3
  • man (L449138) → 3
  • rede (L471173) → 4
  • ligge (L303858) → (not needed, but see note in example 1)
  • man (L449138) → 4
  • Example 4Using possible values of combines lexemes (P5238) on sich einen feuchten Kehricht um etwas kümmern (L46007) as an example, assigning them series ordinal (P1545) values 1 through 7 in order (I am de-0, so the relationships might not be quite correct):
  • sie (L304142) → 7
  • ein (L409194) → 4
  • feucht (L344856) → 4
  • "Kehricht" → 7
  • um (L6722) → 6
  • etwas (L497065) → 7
  • "kümmern" → (not needed, but see note in example 1)
  • Planned useaddition to combines lexemes (P5238) values on multi-term lexemes
    See alsoobject has role (P3831)

    Motivation[edit]

    This is an attempt to provide qualifiers to combines lexemes (P5238) in aiding the construction of syntactic trees for multi-term lexemes in a dependency grammar framework (with an adaptation of Universal Dependencies in mind here):

    • The first property is intended to indicate the particular relationship between the lexeme being qualified (the dependent) and the lexeme on which it depends (the head). A first attempt at a mapping by @Tpt: three years ago exists, but a set of items for better alignment with the UD relation set (or deviations therefrom where they may be needed) should be determined. (The items which are used in the examples for the first qualifier are the closest equivalents I could think of at the moment I wrote this proposal; they certainly may not be the ones actually used.)
    • The second property, ideally, would be a pointer to the actual "combines" statement containing the head of the relationship in which the lexeme is taking part, but absent such a datatype the next best thing would be to use the series ordinal (P1545) values on other combines lexemes (P5238) statements as a guide. As it is largely already the case that combines lexemes (P5238) statements are qualified with series ordinal (P1545) (6989 out of 9447 lexemes), this would fit in nicely with those qualifiers, and the existence of a proper tree relationship among the components could be checked on the client side if desired.

    Suggestions for improvement of both of these properties (or suggestions of an alternate manner of representing dependency grammar relationships) welcome. Mahir256 (talk) 01:23, 1 July 2021 (UTC)[reply]

    Discussion[edit]

    •  Question Just out of curiosity, what applications does this data have? And why would these applications prefer this data over automated taggers trained on already published UD data? — Robert Važan (talk) 10:22, 1 July 2021 (UTC)[reply]
      @Robert Važan: This is intended for Abstract Wikipedia renderers that generate text by constructing syntax trees based on a dependency grammar. The overall dependency relation set is not expected to be aligned completely with UD (indeed, why would it, when it predates the concept of Wikidata lexemes and when these trees for multi-term lexemes should be manipulable in the course of text generation), and the outputs of taggers that do exist for some languages (perhaps all those noted on UD's home page) may not coincide with the dependency relation set we choose. It is thus not intended for entirely UD-compliant annotation of existing data, not just because for some languages (such as Breton and Kurmanji) the UD corpora that do exist are much smaller (which would affect the quality of the tagger used), but also because for the five Abstract Wikipedia focus languages, in addition to many others for which we have lexemes, UD corpora are to my knowledge nonexistent. (One of course is welcome to get inspired by the corpora that do exist in the course of annotating lexemes with this information.) Mahir256 (talk) 12:23, 1 July 2021 (UTC)[reply]
    •  Support Tree model of phrases. Useful in NLP. Most of the data can be generated semi-automatically. — Robert Važan (talk) 14:58, 1 July 2021 (UTC)[reply]
    •  Question Hi, can you throw together an example query in your proposal Motivation that shows output of the "list of ordered elements" against L559269 or the representation you are thinking? My SPARQL is rusty here, would it just use ORDER BY P1545 or something? A few hypothetical hand-waving SPARQL queries and their likely result representation would help me.
    • I'm unclear on your examples that seem to imply some "hierarchy" similar to UD results overlay? Your 1's, 2's, 3's, and 4's. I cannot see the data modeling which can have different flexible data representations. A "hierarchy"(which might use a dictionary type like Python's), a "tuple", a "vector", a "set" are different things and representations. I see you map "cucumber" to 1, and "as" to 4. Are they distances, levels, what?
      • The current uses of series ordinal (P1545) as qualifiers to combines lexemes (P5238) typically represent the order in which the components of a lexeme appear in it (e.g. "cool" appears first, "as" second, and so on). The component "as" having a P1545 qualifier of "2" and a "location of head" value of "4" means that an edge exists from "as" pointing to "cucumber" (which has a P1545 value of "4"). As Wikidata statements cannot have other statements or more complex types of structures as values, the use of these two properties to represent an edge in a dependency graph is the best possible that I can come up with at the moment.
    • Further, what does that mean to a developer or presentation in a SPARQL result that might use different data modeling (data overlays) with a Lexeme data representation? In OpenRefine we might later have different Data Representation for various Data Models (actually we have this now, but plan to expand it more) and this might include Lexemes, even though our EPIC that I drafted with lots of ideas uses Records as examples it could use Lexemes or anything needing Flexible Data Representation https://github.com/OpenRefine/OpenRefine/issues/2825 -Thadguidry (talk) 02:57, 9 July 2021 (UTC)[reply]
    •  Support Thanks, that's what I needed to see and know, so yes this is useful! I would hope after approval that some nice docs and good examples are kept around. Let's not lose the knowledge. -Thadguidry (talk) 03:36, 9 July 2021 (UTC)[reply]
    •  Support, no concerns. Regards, ZI Jony (Talk) 13:37, 9 July 2021 (UTC)[reply]
    • @Mahir256, Robert Važan, Thadguidry, ZI Jony: ✓ Done as syntactic dependency head relationship (P9763) and syntactic dependency head position (P9764). UWashPrincipalCataloger (talk) 16:39, 30 July 2021 (UTC)[reply]