User:Smalyshev (WMF)/Wikidata search

From Wikidata
Jump to navigation Jump to search

This page documents the design for improved Wikidata search using ElasticSearch backend. It is a work in progress and can change anytime.

Prefix search[edit]

Matching the pages by typing part of the string identifying them (not necessarily the title). See below for details.

API[edit]

API:Opensearch, API:Prefixsearch[edit]

API:Opensearch implementation is not very friendly to Wikidata data - it basically ignores everything but titles, so Wikidata labels/descriptions get filtered out. We may in theory have separate implementation, but so far we had no use case for it, so right now we are not implementing support for API:Opensearch and API:Prefixsearch for item namespaces.

wbsearchentities[edit]

This API will keep current functionality, but will have ElasticSearch implementation.

  • only searches item namespaces
  • Current DB implementation stays as alternative (switched by option) implementing common API (EntitySearchHelper)
  • DB search extracts last ASCII word from ID match, supporting something like (Q24). We probably won't support anything but basic whitespace trimming in Elastic implementation.

GUI[edit]

Prefix search will be used in the GUI in two places: the quick search box and item suggestion widgets. Item suggestion widgets will use wbsearchentities.

The quick search box will do two queries - one to API:Opensearch and one to wbsearchentities. If both return results, the article space results are displayed first. The article space search only happens if they query is not detected to have item-space prefix (e.g. Property:).

This will be implemented by extending the standard search box implementation so that it allows result supplementing/substitution. The extension function would return (a promise for) a search result, either as HTML or in a well known structure, which will be shown in a separate section, visually divided from the other search results. Wikibase could just plug in wbsearchentities there. The advantage is that this mechanism will also work without Cirrus. A hard dependency on cirrus would be bad for 3rd parties. The proposed extension point would be better than replacing the search box with an entity suggester (which is hack, and does not find titles).

For item namespaces search, the search would be done in a set of default namespaces, plus current namespace if it is not in the default set and is an item namespace.

Internals[edit]

  • Entity search requires its own search queries. It would match entities both by label (see below) and by ID (Q123), i.e. by title.
  • We probably need internal API that produces list of results that is possible to return both for wbsearchentities and for completionSearch needs (may be not necessary initially)
  • We should be able to handle any entity types that provide labels/aliases/descriptions. It is now assumed all entity types will do so by implementing appropriate interfaces/hooks.
    • TBD: We may need better term than "labels" for the field that holds search terms. We will use labels/aliases terminology for now, until we find better name.
  • We also support descriptions here but only for storing/fetching, not indexing/matching.

Fulltext search[edit]

The main challenge in Wikidata is that we are dealing with substantially different content models - articles, Items (including Properties, here and further, because while being formally different type, they are similar enough to Items for search to ignore the difference) and Lexemes organize their data in a different way, and should be searched using different specialized queries. This is currently unique for Wikidata, but SDC might eventually have the same challenge to deal with. This produces following issues:

  1. TBD: When we search on Special:Search, does the user expect us to search only Item space, Item + Lexeme spaces, or other spaces too?
  2. TBD: What we do if the user tells us to search all namespaces? How useful is to really search all namespaces - e.g. at once pages of all existing content models? What would be the use case for such search and does the user really mean all or just some shortcut for some collection of namespaces that are hard to express otherwise?

Currently, we are using a hack that makes any search mentioning Item namespace only use item search, but this is not the right approach, I feel.

Problems[edit]

Searching more than one content type is tricky, because:

  1. We need different queries, which are generated by different code in different extensions (this is the easiest problem)
  2. We also need different highlighting configurations since different models use different fields. This is harder since highlighting queries are not as easy to combine as filtering parts of the query. We can refactor result type code so that extensions can inject fields they need to process matches, but highlighting query is very hard to get right so it'd work for all content models at once.
  3. We will somehow need to combine rankings for various content models, and we have no idea how to do it - different models use different ranking systems, which are not directly comparable.

Options[edit]

We have the following options to achieve cross-content-model search:

One query to rule them all[edit]

Try to somehow run all specialized content model queries combined in one single query (we do not know how yet).

Pro: We will have a solution for ranking (since there's one query, there would be common ranking) and simpler code once we figure out how to make this query.

Con: We have no idea how to make this query, especially highlighting/ranking part. Making in pluggable for extensions may be even harder. We can do boolean combination for filters, but highlighting & ranking parts are completely unclear.

This "solution" is provided to show we have considered this option, but we don't currently have any viable way to actually do that AFAIK.

The garden of forking queries[edit]

Run several specialized queries and then try to display the results together.

Pro: no complications with making the queries - we just run several builders, slap queries together and run them.

Con: Ranking is problematic, pagination is even more problematic since having no common rank between queries, we can not represent the result as one stream of results, and thus lose the ability to pass pagination work to Elastic. We could move pagination work into PHP code, but that mean worst case we'd have to retrieve 10K*(number of queries) documents, which is for now 30K. This is too many.

We could opt for only displaying part of the results (e.g. only top N for each query type) and then when paginating make separate pagination for each type, but the UI feels awkward and probably will require significant redesign of how Special:Search presents results. Also, not clear what to do with bots & API that would try to paginate the old way.

Another option for ranking is to use an empirical distribution function and essentially reduce each query rank to a number in [0,1] interval, and then deem these to be comparable. For this, it would be necessary to collect score distribution for each query type and set up EDF for each, and probably update it each time any of the queries change - which requires lot of setup/pre-processing efforts. We probably could use old query logs for creating EDFs (though for Lexemes we have no such logs for now), but still requires manual work before it is functional. Also, it is not clear how to make ElasticSearch to paginate using combined EDF-based ranking.

Best possible query[edit]

If we are searching in namespace(s) that represent one single content model (treating Items & Properties as the same content model), we run specialized query. Otherwise we run a generic query (basically the same as article space one, but for all namespaces) and try to do our best with the results.

This option is the current favorite.

Pro: We don't have to invent mega-complex queries or even mechanism to unite queries together - there's always only one (relatively) simple query. And we don't have content model handlers to know about each other, but can have them working in their own little namespaces. We do not have ranking problem as we always run only one query with its own ranking, whatever it is.

Con: we may lose highlighting in some results. Ranking and relevance in non-article namespaces may be lousy in combined searches. It may be confusing to the users to have search magically switch between modes and produce wildly different rankings just by adding a namespace.

We still need to refactor result parsing/display so we can allow results from different models be displayed together. Also we need to ensure text fields for non-article pages are at least good enough to provide some tolerable search.

Action plan so far[edit]