User talk:Smalyshev (WMF)/Lexeme search

From Wikidata
Jump to navigation Jump to search

Mapping for forms and representation[edit]

Do we really want language specific analysis behaviors for forms? To save some fields should we map the forms array to something like:

lexeme_forms: [
  { 
    "id": "L1-F1", 
    "representations": [
        {
           "lang": "de",
           "Leiter",
        }
    ],
    "statements": [ /* filtered list of statements */ ]
  },
  { 
    "id": "L1-F2", 
    "representations": [
        {
           "lang": "de",
           "Leiter",
        }
    ],
    "statements": [ /* filtered list of statements */ ]
  },
]

This will help with highlighting (the experimental highlighter have the possibility to to fetch sibling elements). So that you can extract the language ID for free when highlighting. Additionally I'm not sure we should apply any language specific analysis to the form representations (except light normalization). The question then is how to fetch easily the form ID (can be fixed by repeating the form id under representations). DCausse (WMF) (talk) 14:16, 12 February 2018 (UTC)[reply]

I agree about the language-specific analysis on forms. If you are searching for one specific form, you don't want to find the other related forms. OTOH, quotes help. TJones (WMF) (talk) 17:28, 12 February 2018 (UTC)[reply]
This is tricky. On one hand, we definitely don't want forms, stemming, etc. On the other hand, we probably want case folding, Unicode normalization, and maybe stuff like variant folding for multivariand and multiscript languages. So I guess this is the question for @TJones (WMF): - do we need language-specific for that and do we have something that already does that while not doing the stemming part - and if not, can we get it and how hard would it be to get it? The alternative may be to ignore variants and do standard unicode folding stuff - case, normalization, etc. Not clear what to do with stuff like diacritics folding - do we want to find anos when we're looking for años and vice versa? I am a definite "I have no idea, need help from a linguist" on this. Smalyshev (WMF) (talk) 22:02, 14 February 2018 (UTC)[reply]
Language specific folding is interesting indeed but is it worth the price of another multi-lingual field? I would like that we investigate all workarounds before adding a new one. One possibility is to use the trick we do for the text.plain field on wikipedia search. We emit 2 forms (for "foldable" tokens) at index time, años would be indexed as años and anos. On the other side at search time we do not apply any hazardous case/accent folding. Searching for años would not find anos. This limits the ambiguity to searches not including any folded chars: searching anos would still find años. Is this a reasonable workaround? DCausse (WMF) (talk) 09:57, 15 February 2018 (UTC)[reply]

TBD: do we prefer form matches to lemma matches?[edit]

Hard to tell for now, I'm not seeing any obvious reason to favor one or the other. It'll really depend on the number of ambiguities we have between forms and lemmas of different entities DCausse (WMF) (talk) 14:16, 12 February 2018 (UTC)[reply]

In completion searches for Lexemes, lemma matches should be preferred, forms act as aliases. In completion searches for Forms, lemma matches should be ignored. When searching for all kinds of entities via the "quick search" box, we may want to ignore Forms, and only offer Lexemes (this is intuitive, since lexemes are "pages"). The same is probably true for "full text" search on Special:Search. We may want to offer a specialized UI for finding forms of specific word types in a given language, etc. -- 46.183.103.8 17:09, 13 February 2018 (UTC)[reply]
By "ignore forms" for quick search/fulltext search, do you mean showing only the lexeme in results, or not showing result if form matches but lemma does not? I.e. if you search for "children", what would be shown - "child" (as main lexeme), "children" (as a form) or nothing at all? Smalyshev (WMF) (talk) 22:17, 14 February 2018 (UTC)[reply]

TBD: do we also index senses?[edit]

I think this one should be very similar to what we do for descriptions (multilingual and tokenized)? DCausse (WMF) (talk) 14:16, 12 February 2018 (UTC)[reply]

Yes, senses need to be indexed, but prefix/completion search for them works differently (for use in a selector when editing a reference to a sense): the completion index would be based only on the lexeme's lemma (that is, all senses of a lexeme are always found/suggested together), the sense's "gloss" acts as a "description" (highlight). -- 46.183.103.8 17:05, 13 February 2018 (UTC)[reply]
So I get from this that senses (or, more precisely, the glosses of senses) are pretty much like descriptions in Wikidata items, but there could be more than one of them. I wonder though whether we need to display them in search results and how. Especially if there's more than one. Smalyshev (WMF) (talk) 22:07, 14 February 2018 (UTC)[reply]
Not sure I got how sense completion would work. If there's a lemma with 20 senses, would they always be displayed together? How you choose one of them? What if you limited the results to just 5 results? Smalyshev (WMF) (talk) 00:01, 15 February 2018 (UTC)[reply]
For me senses would only be used for fulltext searches as a tokenized multi-lingual field (very similar to what we do for descriptions). For completion it'll eventually be displayed but how to choose the one we display, as suggested by Stas displaying all of them does not sound possible. For descriptions we do on a best effort basis by using the language as a selection criteria (IIRC). Here we have multiple senses with multilingual label. DCausse (WMF) (talk) 10:40, 15 February 2018 (UTC)[reply]

How we rank the search results? We do not have sitelinks, should we factor in incoming links and statement counts somehow? Any other criteria? Whatever we use, must be stored in some field on the document.[edit]

It's probably too early, user language and language fallbacks will probably play some roles. DCausse (WMF) (talk) 14:16, 12 February 2018 (UTC)[reply]

How to switch between query&ranking behaviors (itemsearch/lexemesearch/pagesearch)?[edit]

The current strategy to switch between ranking behaviors is to decide based on the:

  • entry point (custom wbsearchentities API vs prefixsearch/opensearch) for prefix search
  • selected namespaces and the use of complex syntax for fulltext searches

We are going to add a third method to query and rank results, how will this be integrated in all of this? DCausse (WMF) (talk) 14:33, 12 February 2018 (UTC)[reply]

Should we split the discussions into multiple pages or sections based on the usecases (fulltext/completion/senses)?[edit]

It's sometimes unclear in the discussions here if we talk about fulltext or completion searches, since these are two completely different search techniques I think it'll help to split the discussions based on the usecases to avoid confusions. DCausse (WMF) (talk) 10:57, 15 February 2018 (UTC)[reply]