Wikidata:Wikidata Lexeme Forms

From Wikidata
Jump to navigation Jump to search

Wikidata Lexeme Forms is a tool to create a Lexeme with a set of Forms, e. g. the declensions of a noun or the conjugations of a verb, or to edit the Forms of an existing Lexeme.

Usage[edit]

You select a template on the index page (e. g. “English noun”), fill in the Forms based on the example sentences (the first Form will become the lemma), and then submit the page to create the Lexeme (to which you will be redirected).

If a Lexeme with the same lemma already exists, you will be warned of the duplicate and can decide whether you want to go ahead or not.

Finding words to add[edit]

There is no list of Lexemes that need to be added. The tool does not suggest any. You can randomly guess missing Lexemes, or you can search the internet or books for word lists, and see how many of those need to be added.

If you guess a lemma that matches an existing Lexeme, the tool will warn you not to create a duplicate. The “advanced” mode will let you specify which pre-existing Lexeme to add the Forms to.

Multiple variants[edit]

In some languages, there can be multiple Forms with the same grammatical features. For example, some German words have two genitive or dative singular Forms („des Hunds/Hundes“, „dem Kind/Kinde“). To create several Forms for them, you can specify the different variants, separated by slashes.

This should not be confused with multiple spelling variants of the same Form, e. g. “color/colour” in English. Those should be added as additional representations of the same Form, with different language codes indicating where the spelling is used, as seen on color/colour (L1347).

User script[edit]

User:Lucas Werkmeister/lexeme-forms.js is a user script to automatically add links to this tool on Lexeme pages that don’t have any Forms yet, or where any Forms have no grammatical features. To use it, add the following line to your common.js:

mw.loader.load( '//www.wikidata.org/w/index.php?title=User:Lucas_Werkmeister/lexeme-forms.js&action=raw&ctype=text/javascript' );

The user script will automatically determine which template(s) best match the current Lexeme, based on its language, lexical category, and statements. This is not always unambiguous – when it suggests more than one template, you’ll have to decide which one matches the current template best. (Even when it only suggests one template, you should still make sure it’s correct, of course.) When it doesn’t suggest any templates at all, the tool might not support this kind of Lexeme yet – see the section below for how you can add support for them.

Bulk mode[edit]

In bulk mode, you can create many Lexemes from the same template at once. You specify the Form representations in a single text field, where each line creates one Lexeme and the Forms of that Lexeme are separated by vertical pipe (|) or Tab characters. (Many spreadsheet and similar programs separate columns by tabs when copying into plain text, so you can prepare your Lexemes there and then directly paste them into bulk mode.) Lines may also begin with a Lexeme ID, also separated from the Forms by pipes or tabs, to add (some) Forms to existing Lexemes. As in other modes, you can separate multiple variants of a Form with slashes; as in advanced mode, you can leave Form representations blank (i. e. have two or more consecutive separators, e. g. A||C) to skip creating those Forms; and Lexemes that look like potential duplicates are unconditionally skipped (if you still want to create them, do it in one of the other modes where you can confirm that you’re aware of the potential duplicate and still want to go ahead). After submitting the Form, you will be shown the URLs of all the newly created Lexemes, or warning alerts for Forms that were skipped as duplicates.

Bulk mode is currently restricted to users who are autoconfirmed.

Edit mode[edit]

In edit mode, you edit the Forms of a particular Lexeme, specified in the URL: for example, to edit example (L2237) using the english-noun template, go to the URL https://lexeme-forms.toolforge.org/template/english-noun/edit/L2237. (Eventually, there should be a better way to reach this.) The tool will try to match the Lexeme’s existing Forms to the Forms in the template and sort them into the input fields accordingly. By editing the contents of the input fields, you can add, edit, or remove Forms: as usual, a slash can be used to separate multiple Forms with the same grammatical features, see § Multiple variants. If any Forms cannot be matched to a single template Form, they are listed at the top; you can drag’n’drop them into an input field to manually match them to a template Form, and grammatical features and statements will be added as needed.

Language support[edit]

To start adding support for a new language, enter the language name in English here, follow the instructions, and then {{Ping}} Lucas Werkmeister on the talk page:


To add a new template for an already supported language, go to the subpage for that language and start with the inputbox there. The following languages are currently supported:

Additionally, the following languages have translations but no templates yet, or their templates still need some more work before they can be added, or Lucas Werkmeister just hasn’t found the time to add them yet:


Please only add templates for languages you speak yourself, and speak well – there isn’t yet any tool that can be used to automatically migrate a large set of Forms to a different data model (e. g. replace a certain grammatical feature item ID with another one across all Lexemes in a certain language), so we should try to get this right from the start.

There are also instructions for transcribing these templates, though you shouldn’t need to worry about that part (that’s Lucas Werkmeister’s responsibility).

Monitoring[edit]

You can see edits made using this tool since on the recent changes list. Updates to the tool are usually logged on Wikitech.

For edits prior to (when the tool switched OAuth consumers for the toolforge.org migration), use this recent changes link instead.

Programmatic usage[edit]

The tool can also be used programmatically, e. g. by other tools or external code. Just don’t flood the tool with requests too much, please.

No promises as to the stability of any API are made, but breaking changes will most likely increase the API version number at the beginning of the path, and in that case the old path will likely be changed to return HTTP 410 Gone.

Duplicates API[edit]

To search for duplicates of a potential new Lexeme by its lemma (or, equivalently, to search for existing Lexemes by lemma), send a GET request to https://lexeme-forms.toolforge.org/api/v1/duplicates/www/language-code/lemma, where language-code is a language code like en or de-at and lemma is the lemma you’re looking for (which may contain slashes, if necessary). To search test.wikidata.org, replace the www with test.

The response is either a JSON array with objects for the search results, where each object has id, label, description and uri members, or HTTP 204 No Content if there are no results.

You must specify a header Accept: application/json when sending requests to this API, otherwise the results may be returned in an HTML format that’s specific to this tool and not useful outside of it. (Note that the curl command-line tool sends Accept: */* by default, which means you get HTML back if you don’t explicitly specify a different Accept header.)

Matching API[edit]

To match a Lexeme against all the templates the tool knows, send a GET request to https://lexeme-forms.toolforge.org/api/v1/match_template_to_lexeme/www/lexeme-id, where lexeme-id is the ID of the Lexeme (e. g. L123). To match a Lexeme against just one template, append it to the URL, i. e. https://lexeme-forms.toolforge.org/api/v1/match_template_to_lexeme/www/lexeme-id/template-name. To use test.wikidata.org Lexemes and templates, replace the www with test.

The response is a JSON object; for the first version, it maps template names to match objects, whereas the second version returns a single match object directly. Match objects have the following structure:

{
  "language": true,
  "lexical_category": false,
  "matched_statements": {},
  "missing_statements": {},
  "conflicting_statements": {}
}

"language" and "lexical_category" indicate whether the Lexeme matches the template’s language and lexical category, respectively. "matched_statements", "missing_statements" and "conflicting_statements" are statement lists, with the same format as the "claims" in the Wikibase JSON data model, containing statements in the Lexeme that match statements expected by the template, statements in the template that don’t match any statements in the Lexeme, and statements in the Lexeme that conflict with statements expected by the template. (Whether an extra Lexeme statement is considered a conflict or not currently depends on the property used: extra instance of (P31) are fine, extra grammatical gender (P5185) are not. This might change in the future if templates with other statements are introduced.)

This API is used by the user script documented in a previous section.

Templates API[edit]

To get the templates which the tool uses, send a GET request to https://lexeme-forms.toolforge.org/api/v1/template/template-name, where template-name is the name of a single template to return, or omit template-name altogether to get all templates at once. The response is a JSON object, either a single template or a map from template names to templates.

Note that most templates were contributed by Wikidata users on wiki pages, and they were not asked to license them under any special license, so the templates are published under the same license as the non-structured data of Wikidata, CC BY-SA 3.0. The "@attribution" member of each template object contains the names of the "users" who contributed to the template, as well as the "title" of the Wikidata page where the full history may be seen.

Automatically generating Forms[edit]

You can pre-populate the form shown on the page by specifying form parameters in the URL. Form representations can be given as a form_representation parameter (usually occurs more than once); you can also specify where the pre-populated data comes from in a generated_via parameter, which is included in the summary (i. e. you can use [[these links]] but not [these ones]).

For example, the following URL was used to create the Lexeme musher (L42850) with just a single button press:

https://lexeme-forms.toolforge.org/template/english-noun/?form_representation=musher&form_representation=mushers&generated_via=manual%20input

The user can still verify all the Forms and check that they are correct before submitting the page, and the tool will also ask them to confirm the new Lexeme is not a duplicate (as will be the case if you load the above URL now, since musher (L42850) was already created).

You can use this feature to write tools or user scripts for automatically generating Forms: you just have to take care of the Form generation itself, Wikidata Lexeme Forms handles OAuth, grammatical feature item IDs and duplicate detection, and the user still has an opportunity to correct any problem with the auto-generated content before it is even added to Wikidata.