User talk:AGutman-WMF

From Wikidata
Jump to navigation Jump to search
Logo of Wikidata

Welcome to Wikidata, AGutman-WMF!

Wikidata is a free knowledge base that you can edit! It can be read and edited by humans and machines alike and you can go to any item page now and add to this ever-growing database!

Need some help getting started? Here are some pages you can familiarize yourself with:

  • Introduction – An introduction to the project.
  • Wikidata tours – Interactive tutorials to show you how Wikidata works.
  • Community portal – The portal for community members.
  • User options – including the 'Babel' extension, to set your language preferences.
  • Contents – The main help page for editing and using the site.
  • Project chat – Discussions about the project.
  • Tools – A collection of user-developed tools to allow for easier completion of some tasks.

Please remember to sign your messages on talk pages by typing four tildes (~~~~); this will automatically insert your username and the date.

If you have any questions, don't hesitate to ask on Project chat. If you want to try out editing, you can use the sandbox to try. Once again, welcome, and I hope you quickly feel comfortable here, and become an active editor for Wikidata.

Best regards! --VIGNERON (talk) 19:31, 17 June 2022 (UTC)[reply]

Small mistake on Grammatical Person proposal[edit]

Hi,

There is a small mistake on Wikidata:Property proposal/Grammatical Person, you wrote « personal pronoun (Q468801) or personal pronoun (Q468801) » which is twice the same item. I'm curious, what other lexical category than personal pronoun (Q468801) would be acceptable for grammatical Person?

Cheers, VIGNERON (talk) 19:31, 17 June 2022 (UTC)[reply]

Thanks! I've corrected this now. The intention was to include also the category pronoun (Q36224), as I thought it may be desirable for consistency (although other non-personal pronouns will only have the "third person" feature). It is not strictly needed. AGutman-WMF (talk) 10:23, 20 June 2022 (UTC)[reply]

Forms[edit]

Hi,

This message on Wd:LD Special:Diff/1665973099 is very interesting but not relevant for the discussion there. So I decided to pursue it here on a separate thread.

I'm not suggesting anything, just stating how the data model is (or at least, how I understood it for the last five years). And this model is - on purpose - very flexible. The forms can have grammatical features but it's not mandatory (for instance for adverbs the unique form does not have a relevant grammatical features) and multiple forms can have the same set of grammatical features. In that case, it's a very good idea to put statement(s) to distinguish them but again not mandatory (which can be very problematic for tools re-using them, but it's more a tool problem).

Wikidata:Lexical Masks are very useful but they expect languages to be consistent when they're often not consistent (even highly standardised languages like French have some contradictory rules).

For dialectal form (and all lectal forms truly), I'm not sure a separate lexeme is the best way to go, lexically it's still the same lexeme and for Lexeme entity, statements and qualifiers are enough to deal with them without the need to duplicate data.

What do you think? VIGNERON (talk) 13:24, 26 June 2022 (UTC)[reply]

@VIGNERON Thanks for starting the thread here!
The documentation does state that the forms should contain "typically one [form] for each relevant combination of grammatical features, such as 2nd person / singular / past tense", and this very idea is encoded in the lexical masks. Of course, lexemes which have only a single forms, such as adverbs, do not need to specify any (form-level) grammatical features.
I agree that this formulation does leave space for flexibility of having multiple forms with the same set of features, but this flexibility comes with a cost. The basic idea behind Wikidata, as far as I understand, is that the data should be (easily) usable for computational applications. Specifically for lexemes, we want them to be usable in the Natural Language Generation domain within the Abstract Wikipedia project. If multiple forms have the same set of grammatical features, it is not clear what are the criteria to select one among the others. Supposedly, there should be some statements (or different language code?) to inform any down-stream application, but given the vast "universe" of possible statements, the usage of such information is not straight-forward. Moreover, it breaks the simple data model as described by the citation I gave above.
For maximum usability of the data, different aspects of variation should ideally be kept apart. Wikidata already lends iteslf quite neatly for a three-tier representation, which I think we should strive to maintain:
1) Lexemes of different languages are kept apart. In the same way, major dialects should also be kept apart, using different lexemes with different language codes. The distinction between a dialect and a language being mostly a social construct, I think the various language-community contributors should decide on this.
2) Inflected forms of the same lexeme (distinguished by grammatical features) should be listed as different forms, according to the rule given above (one form per set of features). One may be more lenient in languages which lack morphological inflection altogether, and re-purpose the notion of forms there for some other axis of variation, but mixing differing criteria in the list of forms makes the data less usable.
3) Orthographic variation (in a lexeme or a form) should be represented using the "spelling variant" facility, with some language code indicating the source of the variation if possible. Depending on the decision of the language community, here one may also list certain "light" dialectal variation (i.e. a spelling representing in fact a different pronunciation, thus not being a true "spelling variant") with an appropriate language code, though in general I think this is not ideal. Similarly, frequently use abbreviations, such as Mme. for Madame may be put there. I would keep domain-specific abbreviations (e.g. USA state-codes) listed as statements, as they are only used in very specific contexts.
I acknowledge that in some cases this data model may need to be broken, but this should be the exception rather than a common practice, and it should, in my opinion, be justified for every single case, in order to keep the versatility of data representation under control. AGutman-WMF (talk) 10:51, 27 June 2022 (UTC)[reply]
Hi,
Very interresting and I mostly agree with you.
I'm just not sure why multiple forms with the same set of grammatical features would not be « (easily) usable for computational applications », as long as the data is there, it's there. It doesn't matter (for machine) if the data is in a place or an other (for SPARQL wikibase:grammaticalFeature or a property like variety of lexeme, form or sense (P7481) or language style (P6191), plus the "universe" of possible statements is not that vast, there is only 22 properties under the class Wikidata property for lexicographic senses (Q54275340), only a few a them being used in forms so it feels still pretty "straight-forward"). Meanwhile, having two lexemes with almost all the data identical is very redundant and can cause many troubles.
To be sure I understand you correctly, let's take examples. A simple one to start: cactus (L4460) would you split it for the pair cactus-cactuses and cactus-cacti? Let's take now less simple cases: center/centre/centre (L30443), would you create 2 lexemes or more? (and what do you think on putting several representation on the same forms? I feel it would hinder application for text annotation) or ki (L69), it has two plurals: "chas" (a suppletive one and the most common one) and "kon" (a regular but very rare one), despite that, it's still grammatically the same lexical unit. Here, as Breton is not standardized, it would results in duplicating allmost all words (if not multiply by 4 or 5, it's not uncommon for a verb to have 5 different infinitives :/ plus there is 4 more-or-less "major" dialects for Breton, so are you suggesting 20 lexemes - with almost the same data outside the forms - where we have only one right now?). In both case, the inflection is hard to qualify, it's not grammatical nor really orthographic (or not "just" that at least), it's more/also cultural I guess (which is why there is no statements yet, which is indeed a very bad thing for machine, but maybe you would have an idea).
I'm sure our current application of the model is not perfect but I'm not sure our proposal really solve the problem and that it doesn't cause more trouble. I'd love to know what you think.
PS: I'll leave abbreviation out for now, this is a different and more specific can of worms.
Cheers, VIGNERON (talk) 15:48, 28 June 2022 (UTC)[reply]
While one could qualify forms with statements, there doesn't seem to be a standard way to do this. While the distinction of forms by grammatical features is relatively straight-forward, distinguishing between orthographic variants using statements is more challenging, and allows for various annotation options (and often it is not done at all ,judging by the examples I've seen so far). This in turn makes the choice of a down-the-line Natural Language Generation system arbitrary, which is not necessarily what we want.
Looking into your examples, the case of center/centre/centre (L30443) is the straight-forward case. The different orthographic variations are exactly that, only spelling variations, while the meaning or pronunciation does not change (except for the expected accent differences), so the current representation is the correct one: a single lexeme with spelling variants. [Aside, the senses are completely off, since the lexeme refers to the verb "to center").
The case of cactus (L4460) is more complicated. From a purely linguistic point-of-view, one may indeed argue that there are three Lexemes here, in fact: the invariable cactus lexeme, the cactus/cacti lexeme, and the cactus/cactuses lexeme. However, this doesn't feel entirely satisfactory because the difference between those lexemes is only in the plural form. An alternative would be to admit that English has a class of words with two plurals (i.e. words of Latin origin in which the Latin plural is still used) and create a specialized grammatical feature for these case, i.e. distinguish between a regular plural, and a 'latinate plural' (the invariable 'catcus' would still need to be treated specially, but one could possibly see it as a "light" variation of 'cactuses', or alternatively not mark the form 'cactus' for number at all; note that the current annotation of 'cactus' as both singular and plural is problematic, because nothing tells us there is a disjunction there).
As for ki (L69), I don't know too much about Breton, but given that I see there many forms which are annotated by their phonological mutation class, I think it is legitimate to list chas as a plural form with the suppletion (Q324982) feature.
The crucial thing, in my opinion, is come up for each language with a clear model of what grammatical features are relevant to capture the various forms of lexemes of that language, what conversely counts as orthographic variation, and then apply the decided-upon model consistently as much as possible. Such a model may take into account a certain class of not-strictly-grammatical variations (like the English double plurals) if it happens frequently enough (by adding grammatical features to distinguish such cases), or one conversely can decide that such variation requires creation of extra lexemes. Creating duplicate forms (of the same grammatical features) within the same lexeme seems to me to be the worse solution, as it opens the way to an ad-hoc distinction using an open set of statements (even if it's only 22). As I mentioned above, such an approach may be used in some exceptional cases, but preferably not on a regular basis. AGutman-WMF (talk) 13:07, 30 June 2022 (UTC)[reply]
Thanks, again I mostly agree with you, and yes we need clearer model (and not just for grammatical features).
That said, I feel you over-estimate how standard languages are. To say with English and weird plural, the plural of "scenario" can be "scenarios", but also "scenaria" or "scenarii" same for "brother" and "brothers"/"brethren" (and so on, I could easily give thousands of examples). And English is among (amongst?) one on the most standard language of the world, most other languages are less standard (Breton being a nightmare, 4 main dialect, 3 orthography only for the last century, unusual phenomena appearing a bit everywhere, etc.).
PS: I corrected at least the conflicting "singular and plural" on L:L4460#F1 (I don't remember where - again we are bad at documentation - but it was decided that there is no disjunction for grammatical features ; by the way, how would do you understand L:L114#F3?).
Cheers, VIGNERON (talk) 13:18, 3 July 2022 (UTC)[reply]
@VIGNERON Languages are of course ripe with irregularities. There is a beautiful quote of Wittgenstein concerning this. But nonetheless, when creating machine-readable dictionaries (as Wikidata is), one should strive in systematizing the irregularities into a coherent data-model. One could argue that Wikidata's data-model lacks one level of grouping, namely variants of related forms (which are not spelling variants), and that causes a lot of the discussion. Since it is difficult to change this at this point, one has to carefully examine the model's options when listing such variation. In many cases, the best option, IMHO, is to create more lexemes, which can then be qualified with statements at the lexeme-level.
  • The example of brethren for instance, is easy: it is clearly another lexeme from brothers, not only because of the different pronunciation, but also because brethren has a nuance of meaning lacking in brothers (and indeed, it is represented separately as brethren (L317372)).
  • Non-standard forms such as scenaria or scenarii could also be represented as separate lexemes, with some statement (on the lexeme level) mentioning they are rare and non-standard. This is in my opinion better than qualifying each form, for the reasons I gave above. There is of course also another possible choice, namely not to represent at all these forms (just as you wouldn't represent any possible typo) - and indeed, many dictionaries don't list them.
  • In general, dialectal variation is easy to represent, since ideally each dialect should have its own language-code, and its own separate lexemes (if they differ from the main language).
At the end of the day, each community of contributors for a specific language must make their call how to represent variation present in that language. As I wrote, the most important thing is to have a consistent approach (and hopefully, as consistent as possible across language as well, but this is probably difficult to achieve). The "one form per set of grammatical features" is an easy one to follow, so I hope contributors will (try to) adhere to it more strictly.
P.S Thanks for fixing L:L4460#F1! In fact, I will use this lexeme as an example in my Quality Days talk (with due credit to you, of course!). As for L:L114#F3, as far as I understand, it exhibits the same problem: the form can be interpreted either as dual or as plural, so it should really be split into two forms for straight-forward machine-processing. AGutman-WMF (talk) 15:23, 4 July 2022 (UTC)[reply]