Wikidata:Requests for comment/Inverse property access for wikis : a lua API request for the development team

From Wikidata
Jump to navigation Jump to search
An editor has requested the community to provide input on "Inverse property access for wikis : a lua API request for the development team" via the Requests for comment (RFC) process. This is the discussion page regarding the issue.

If you have an opinion regarding this issue, feel free to comment below. Thank you!

We currently cannot access inverse property values on Wikipedia. This can be a data management issue on Wikipedia as we must always ask ourself if we must introduce an inverse property for cases where we need them. So I think it’s useful to gather the usecases community would want and draft a request for an API to the devteam to do that.

Api proposal request, draft[edit]

getStatementsReferencingAsMainValue(item, property) => [Statement list] 
-- gets all "property" statements that have "item" as main value. 

Of course with a number limit as getting all "instance of : human" is unrealistic and mostly useless, this is not a lua usecase.

Optional requirements :

  • be able to filter statements according to the presence of some qualifier
  • be able to filter statements according to the presence of some qualifier/value pairs
  • be able to filter statements according to the presence of some qualifier/set of value pairs

This should solved most of a usecase for inverse properties. Optionally, but less useful, for cases like "union / disjoint union of" :

getStatementsReferencingAsQualifier value(item, property, qualifier) => [Statement list]
-- get all "property" statements, whatever the main value, that gets a "qualifier" valued "item"

Phabricator tickets and prior discussion[edit]

Usecases[edit]

please add cases when you would need this API on a client wiki

Discussion[edit]

  • I think it's better to wait for the dev team to ask us for input when they have time. I don't want to discuss a new API if it won't be implemented in 5 years or more. Now even the "mul" language code development looks halted, and tiny improvements like REST API can take one year or more. Midleading (talk) 13:31, 13 January 2024 (UTC)[reply]
    It's never too soon to discuss a topic that have been around since forever. Community input will not be lost anyway, plus it's an opportunity to reassess that community is interested and update their priority accordingly. If we don't talk about it they will assume it's low on our wishlist, it's how this works. author  TomT0m / talk page 16:36, 13 January 2024 (UTC)[reply]
    The REST API isn’t a „tiny“ improvement. Karl Oblique (talk) 16:49, 13 January 2024 (UTC)[reply]
    Ok, but where is it in community wishlist surveys? The added value of REST API to the community is, I think, tiny. A simple wrapper around the MediaWiki API can realize its full functionality and at a low cost. Anyway we are not discussing REST API here. I just want to say the dev team is busy on their priorities, like the REST API, so our discussions may have zero impact. Midleading (talk) 13:52, 14 January 2024 (UTC)[reply]
  • See also m:Community_Wishlist_Survey_2022/Wikidata/Accessing_items_with_particular_statements_via_Lua. Mahir256 (talk) 14:54, 13 January 2024 (UTC)[reply]
    Added to the phabricator tickets, and populated usecases from it, thanks. author  TomT0m / talk page 16:56, 13 January 2024 (UTC)[reply]
  • Phabricator tickets added, above. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:23, 13 January 2024 (UTC)[reply]
  • A possibly stupid question that may open a can of worms somewhere: is there a reason not to use an existing query method, such as SPARQL, for access from the wikis? It feels like an attempt to shield end-users from programming, but like all such attempts doesn’t really do that, while also slowly requiring expansion for endless corner-cases. Or does such a method already exist in parallel? Karl Oblique (talk) 16:56, 13 January 2024 (UTC)[reply]
    There are query usage possible in clients, but currently it's only through some special acces such as the graph extension (which is deactivated (were?) for security reasons. We cannot access the query service to generate infoboxes, data access has always been done through the lua wikibase and template API. The wiki usecases like generating infobox do generally not need the full power of sparql and I guess this allows for a fine grained resource usage tracking, we know which page uses which items. Lua modules are programming modules so except in the case of simple templates, nothing is really shielded for lua wiki coders, and even template coders, so no it's not really an attempt to shield anybody of programming I think. The closest thing there is to that, imo, are the wikibase parser function access (and the wikibase template which plays the same role), both can be used quite successfully to extend a classic or infobox into a Wikidata one quite simply actually, maybe surprisingly. You can do a lot with good primitives in simple cases with Wikidata. author  TomT0m / talk page 17:10, 13 January 2024 (UTC)[reply]
  • There are things that are designed well in this fashion and others that aren't. In this case, I think it's questionable. An API should likely have ways to limit the number of results and/or sort them. Whether you can both might depend on technical details. ChristianKl12:53, 14 January 2024 (UTC)[reply]
    Sorting is trivial to do afterhand in lua. The crucial part is data access, which is currently the blocking part. author  TomT0m / talk page 15:43, 14 January 2024 (UTC)[reply]
    And I don't think this API should be designed to handle cases where there are many many references like "who are the instances of human". This should not be part of the requirements. author  TomT0m / talk page 15:47, 14 January 2024 (UTC)[reply]
    Part of software design is to design an API to handle all the cases that exist in real-world data.
    Having limits of how much values can be returned as a default is one way to do so that saves resources.
    While sorting is easy to do in LUA, having functions that don't return deterministic results can be problematic. If the function does sorting anyway to have deterministic results it would make sense to be able to pass it how you want data to be sorted. ChristianKl10:04, 16 January 2024 (UTC)[reply]
    @ChristianKl Arguably those usecases are not what we are interested in if we are interested in lua clients and infoboxes only, which could make sense. It could make sense, in the article about humans or cats, to get a sample of instances. But in the human case whatever criteria you could come up with make no sense to sort everything and this sample would get totally arbitrary, better to make a human choice if we really do not want to.
    I think in infoboxes we really are interested into the cases where the number is very small and the data is complete, so an arbitrary sample is not useful. If the list is open, we should handle this with queries or categories, not with infoboxes or lua, in general.
    And really, the main usecase is infoboxes, arguably ? author  TomT0m / talk page 11:03, 16 January 2024 (UTC)[reply]
    To me it feels like your sentiment, just shows that it's good to let these decisions be made by programmers.
    You seem to assume that the people who use an infobox know how many entries there might be. If the infobox programmer assumes that there's a maximum number of ten entries it's good to pass that as a limit into the function. ChristianKl12:45, 16 January 2024 (UTC)[reply]
    @ChristianKl I try to motivate the need for this feature, why would we need some feature like this, to the devteam. So it’s better to think of the usecase beforehand. And as discussed elsewhere on this page, retrieving such informations might be resourced constrained or need deep change to the Wikidata codebase that might be out of question right now with the current development resources.
    I think it’s not worth designing this feature to handle more than a few loads. There are already hard limits for these, a wiki page cannot load a lot of items or use a high number of template expansion, or a high number of expensive calls. With that in mind, I think we should not assume that the programmer will think he can handle cases where the number of inverse statement is very high with this feature. Nor that it is some kind of search engine who will find the most relevant results. Maybe if the community has a strong opinion on what is relevant we can use ranks to sort things out (which make me think we should think about ranking filtering).
    The usecases for this feature are really, I think, stuff as finding the editions of a work, and not finding all instances of human. author  TomT0m / talk page 19:29, 16 January 2024 (UTC)[reply]
    Just because someone who creates an infobox thinks that in all cases there a certain limit of items, doesn't mean that there aren't cases that have a lot more items then expected. An infobox developer might expect that no book has a hundred editions but that doesn't mean that no counter example exists. ChristianKl14:28, 18 January 2024 (UTC)[reply]
    @ChristianKl Yes, but I think the solution to could be something like "giving up trying to put that in the wikipedia page" and provide a link to something like scholia, a query, or sqid (example with the Bible and a list of editions), or something like that. Because, say for the case of the Bible, there are many editions and many language. The API should then be powerful enough to let us filter by publication date, language, … This is a rabbit hole, and is not basic features like it is now, accessing the best statements of some property or the whole item and do the filtering in client code.
    The API could say "many results" in that case, and let the infobox provide a link to a tool that handles these many results, using a more powerful engine like Sparql or something. author  TomT0m / talk page 14:56, 18 January 2024 (UTC)[reply]
    That solution could only be used if the person who develops the template knows about the exception.
    If the template writer just says "LIMIT 10", the template wouldn't fully break for that that item. ChristianKl17:32, 18 January 2024 (UTC)[reply]
    @ChristianKl In infoboxes we are generally interested in either the full list of values, or a carefully selected sample of values. The carefully selected sample could be taken care of either by manually enter values in the infobox in the wiki or by preferred values qualifiers, but "limit 10" is not really one of them. Except if you allow for "limit 10 of the most recent", and that is what community want, but this kind of needs the load of every values, one way or another, which we might not want. author  TomT0m / talk page 18:42, 18 January 2024 (UTC)[reply]
    (except of course if the software that sorts is able to do some "top-n" smart algorithm sorting, or maintain the list and update it on relevant item modification, but this entails complexity on selecting the criteria) author  TomT0m / talk page 18:46, 18 January 2024 (UTC)[reply]
  • We need to think about how this can be done, what can be done. Currently (without inverse query) wiki page rendering only needs information from SQL database. Are we going to require every Wikibase installations to also install the deprecated Blazegraph? And what about cache and purging? Midleading (talk) 17:05, 14 January 2024 (UTC)[reply]
    There is at least the haswbstatement search feature with Sirrus Search that already index the statements. author  TomT0m / talk page 18:01, 14 January 2024 (UTC)[reply]
    I guess caching is not really different from what is done now in the EntityUsage mechanism : WikibaseClientEntityUsage which already exists.
    There also could be some kind of "query" table which stores in the same spirit which page queries for which property it is tracking its usage for. When adding or modifying a statement this usage table could be checked against to see whether this adds a potential usage. This would work, I think, if we restrict the back lookup to only the item of the page and no further / recursive lookup. Which would be OK for hierarchic properties, but less for equivalence classes (like get the brother / sister chain if they are all linked but indirectly, say always from the older to the younger and not the opposite). author  TomT0m / talk page 18:52, 14 January 2024 (UTC)[reply]
    Adding a dependency on Elasticsearch seems easy, but it does not support usage tracking, and it can break. When Elasticsearch is down like wikitech:Incidents/2023-06-18 search broken on wikidata and commons, operation team have to purge all Wikimedia articles which use this functionality to recover from an incident - too costly I think.
    The next idea is to have a SQL table which basically stores "subject", "property" and "value" with appropriate index. In this way we are implementing some kind of graph query language on the SQL database. This can theoretically work. The problem is the increased database load. Besides database access, database size might also be a concern. The development team is actively trying to keep database as small as possible. I'm not sure if they have the funding and resources to do this. Midleading (talk) 04:18, 15 January 2024 (UTC)[reply]
    Currently we only need to add a few has part(s) (P527) statements. If inverse lookup is supported, then how do we prevent the database from being filled up by "instances of human"? Midleading (talk) 04:30, 15 January 2024 (UTC)[reply]
    Few stuffs :
    • We definitely don't want high numbers of statements in infoboxes. So there should be low limits on the number of loadable statements. If the number is too high, the lua function should return no results and an error or something. This already exists as there is a limit on the number of entities a page can low, so the Dev team is not afraid to put limits, hard if necessary
    • We definitely don't need all triples. I think we only need "snacks" (property values pairs).
      • When a page connected to item1 does make such a lua call for property1 we put the property1 / item1 snak in a SQL table, call it inverse properties dependancy. We call elastic search to populate the matching statements as usage tracking. Ideally we do that only once. We get a list of dependancy items for item1. This can be handled, I think, by already existing "usage tracking" code in the codebase or something similar. If the list cardinality is too high we store only its size.
      • Updates: when a statement on any item is added, modified, deleted, we check if it matches a snack in the inverse property dependancy. If it matches a low cardinality one, we add it (resp delete)to the usage tracking for the corresponding items. It it matches a high cardinality one, we could update a change counter or something like that. If the change counter becomes of the same order than the cardinality of items, we might want to call elastic back to update the cardinality as it might be all deletions and the cardinality might become low enough to be under the limit.
    In that setup, if we store the dates of the elastic calls, we might not need to purge everything in case of outage but only things that happened when we think elastic was inconsistent, I guess .
    But unless you are actually devteam, although I enjoy thinking about it this discussion is pointless and we should let them worry about what's too expensive and how to deal with, don't you think ? Of course they are perfectly legitimate to answer something around that line but it's for them to juge ...
    Whe it's done, these dependancies can be updated when someone edit the required statement on Wikidata. If it's deleted and no other statement in the item has the item value delete the usage tracking dependan. When adding a statement on an item author  TomT0m / talk page 09:49, 15 January 2024 (UTC)[reply]