User:Smalyshev (WMF)/Lexeme search

This page is a work in progress, not an article or policy, and may be incomplete and/or unreliable.
Please offer suggestions on the talk page.

Indexing and searching Lexemes using ElasticSearch.

Indexing[edit]

We index a lexeme as a single document. All natural MediaWiki parts (title, etc.) get indexed by default.

For a lexeme, we need to index the following:

Lemma - is a language string. We need at least case-folding and possibly normalization here, probably would be a good match to label.* fields we use for wikidata. Lemma can have more than one language string if the language has variants, all strings go into the appropriate language fields. Also we copy those to labels_all.
Language and lexical category - as Q-id keywords, no analyzing
(?) Language as text string (e.g. "en") - TBD?
Gloss parts of each sense, which are analogous to descriptions as fulltext fields but there can be more than one of the glosses on each lemma.
TBD: any of the statements? We probably need to make this configurable.

Additionally, for each form, we need to index (as fields in the same document):

Form ID, as a keyword
Form representation, as an analyzed language string. We probably can put them into label.* fields as aliases.
(?) Grammatical features as Q-id keywords?
TBD: any of the statements? We probably need to make this configurable.
Structured view of the Form, e.g.:

lexeme_forms: [
  { 
    "id": "L1-F1", 
    "representation": { "de": "Leiter" }
  },
  { 
    "id": "L1-F2", 
    "representation": { "de": "Leiters" }
  }
]

This field would not be indexed but would be used to distinguish matches and display results when searching.

Searching[edit]

The search works in a similar way to Wikidata search. Matching is against labels, with possible filters against keyword fields.

Lemma: we need to know in which languages we are searching. If the language has multiple scripts - we will need to have a label.* field for each of them, with the proper analyzer configured, and we need to know which ones we're searching. TBD: do Language classes help us with this?
While we can index any sets of languages, we cannot match too many languages in one query to specific label.* fields. We can match labels_all though, but with less precise results. TBD: what role do language fallback chains play in all this?

We still have two search forms - fulltext search and completion search. For completion search, we will have some additional restrictions on which item we're actually looking for (see below).

The output of the search query should distinguish between the match to Lemma and to Form (using the structured view above). For Lemma match, we display generated description of the Lemma, e.g.: "Leiter (L1): German noun". For Form match, we display both form and lemma descriptions, e.g. "Leiters (L1-F2): singular genitive of Leiter (L1): German noun". TBD: this is fine for full-text search, but may get too long/complicated for completion search.

Completion search[edit]

Completion search uses only lemma and form fields for matching, not senses. There should be a way to specify we're looking specifically for Forms or Lexemes. The search is implemented via wbsearchentities API.

For the first iteration, Senses are not covered. Please consider any mention of Senses in the text as tentative discussion.
In completion search for Lexemes, lemma matches should be preferred, forms act as aliases
In completion search for Forms, lemma matches should be ignored. Each Form match is reported as separate result row.
Matches in all languages should be found. TBD: not sure about the ranking
"Quick search" in Lexeme namespace will display Lexemes, using Forms and their representations as aliases.
Lexeme & Form display would use the Lemma and Representation strings as described above. If Lemma and Representation have multiple strings, one with shortest language code will be used. For Q-ids that are part of the description, UI language will be used for their labels.
The search will allow hints such as language & lexical category Q-ids, which if set will filter the results to items having those.
ID match (L1 and L1-F2) should still work for Lexeme and Form searches respectively.

Senses[edit]

Not doing it yet, recording things for future addition.

In completion search for Senses, matches should be done against lemma. TBD: not clear how exactly this would work - i.e. if a lemma has 20 senses, which ones are displayed? What if the API asked for just 5 results? Are glosses then matched only for full text, or at all?

TBD: do we display senses in the result? There could be more than one, which one do we choose?

Open questions[edit]

What to do if we have a lot of matches for the term? E.g. "te" has 63 entries in Wiktionary. It is plausible that the user may want to select any of them (and possibly their Forms). How do we provide such ability?

Fulltext search[edit]

We will use lemmas, forms and sense texts to locate the matches.

We use highlighting and structured forms to figure out where the match happened.

TBD: do we prefer form matches to lemma matches?

TBD: how do we display senses? There can be multiple senses and more than one could match, or none at all.

TBD: how we rank the search results? We do not have sitelinks, should we factor in incoming links and statement counts somehow? Any other criteria? Whatever we use, it must be stored in some field on the document.