Wikidata talk:WikiProject Categories

From Wikidata
Jump to navigation Jump to search
WikiProject Categories
WikiProject to solve any issues regarding categories.

Stubs, templates and category combines topics[edit]

Hi everyone! So, in this first day of the WikiProject I want to start discussing an issue that creates a lot of constraint violations.

So, at the moment there are 16465 cases of instance of (P31)  Wikimedia category of stubs (Q24046192) + category combines topics (P971)  Wikipedia:Stub (Q4663261). However, 12511 of them are constraint violations because category combines topics (P971) has only that value, even if it should have multiple values; the remaining items have category combines topics (P971)  Wikipedia:Stub (Q4663261) + other category combines topics (P971) indicating stubs' topics.

  1. My proposal is substituting category combines topics (P971)  Wikipedia:Stub (Q4663261) with category contains (P4224)  Wikipedia:Stub (Q4663261) and substituting the remaining category combines topics (P971) with main subject (P921) used as qualifier of category contains (P4224)  Wikipedia:Stub (Q4663261).

Same problem regarding 216 cases of instance of (P31)  Wikimedia templates category (Q23894233) + category combines topics (P971)  Wikimedia template (Q11266439) (56 of which are constraint violations of category combines topics (P971)) and 34442 cases of instance of (P31)  Wikimedia category (Q4167836) + category combines topics (P971)  Wikimedia template (Q11266439) (34162 of which are constraint violations of category combines topics (P971)).

  1. My proposal is substituting instance of (P31)  Wikimedia category (Q4167836) with instance of (P31)  Wikimedia templates category (Q23894233), substituting category combines topics (P971)  Wikimedia template (Q11266439) with category contains (P4224)  Wikimedia template (Q11266439) and substituting the remaining category combines topics (P971) with main subject (P921) used as qualifier of category contains (P4224)  Wikimedia template (Q11266439).

Thank you for your interest in categories! Bye, --Epìdosis 13:04, 8 July 2018 (UTC)


Pictogram voting comment.svg Notified participants of WikiProject Categories Please give your opinion! --Epìdosis 12:29, 11 July 2018 (UTC)

Metacategory vs. Category[edit]

We previously had a short chat about that with @Jura1:, but I believe now we can involve a few other interested participants here. I would be very convenient to "mark" somehow metacategories (categories, that are intended to include other categories only). Although I understand that we don't want to "mirror the structure of Wikipedia categories with P31/P279 on Wikidata" I see nothing particularly wrong with marking such items as P31:Q30432511 (instead of P31:Q4167836) --Ghuron (talk) 12:26, 11 July 2018 (UTC)

I support instance of (P31)  metacategory in Wikimedia projects (Q30432511) as I support instance of (P31)  Wikimedia templates category (Q23894233) because these categories contain not articles, but respectively other categories and templates. --Epìdosis 12:33, 11 July 2018 (UTC)
  • How would we know what Wikipedia/etc. users put in there? Would we have to monitor them? Redoing P31 periodically based on people's usage? Why complicate things for users how want to check a structure by splitting the topic among several properties? The point of a flat P971 is that it's flat ;)
    --- Jura 12:39, 11 July 2018 (UTC)
    • @Jura1: there is nothing wrong when bot will assign for all new categories (including metacategories) instance of (P31)  Wikimedia category (Q4167836). But if someone will notice that particular category is indeed metacategory, she can narrow down P31 value. I have nothing against P971, and I'm using it extensively. The problem is that there is no universal way to express "this is metacategory" statement via P971. Instead I can say category combines topics (P971)  by country (Q19360703) or category combines topics (P971)  by city (Q18683478) or category combines topics (P971)  by genre (Q42903116). And if I want to get "all non-meta-categories", I will have to FILTER NOT EXISTS {?cat wdt:P971/wdt:P31 wd:Q24571886}? It's gonna be slow :( --Ghuron (talk) 13:05, 11 July 2018 (UTC)
      • The problem is that people may not want any category items, so just exclude any P31 with a given value is way easier.
        It's still not clear how you decide what categories that only contain categories are.
        --- Jura 13:12, 11 July 2018 (UTC)
        • Well, judging mostly from the title (e.g. I assumes that Category:Argentine women by occupation (Q8262772) is intended to contains other categories). More formally, there is Q4048796, that is used at least in en-wiki quite extensively
          PREFIX mw: <>
          SELECT ?item ?page WHERE {
            hint:Query hint:optimizer "None" .
            SERVICE <> {
              ?page mw:inCategory <> .
            ?page schema:about ?item . ?item wdt:P31 wd:Q4167836
          Try it! --Ghuron (talk) 13:29, 11 July 2018 (UTC)
What we need perhaps is to focus more narrowly on categories of the type X by Y, the topic of recent discussion at Wikidata:Project_chat#Category:A_[split_up_by]_B.
For the record, I also dislike the idea of introducing lots of subclasses of Wikimedia category (Q4167836). I would prefer to indicate contents or attributes of the category by using additional statements. Jheald (talk) 17:21, 11 July 2018 (UTC)
One idea might be to identify such classes using something like
<category item> category combines topics (P971) "X"
<category item> category combines topics (P971) "Y" object has role (P3831) "Partition class"
Category items with statements of this pattern would be fairly easy to require or to exclude in a query. Jheald (talk) 17:37, 11 July 2018 (UTC)
As an exploration of what we have to deal with, here's a quick query looking at 30,000 categories with en-labels of this form, to see the sort of partitioning classes that may be most relevant. "By country" (4498), "by year" (807), "by nationality" (771), "by state" (755) lead the list. Jheald (talk) 18:01, 11 July 2018 (UTC)
For what it's worth, here are the current uses of category combines topics (P971) on such categories:
Jheald (talk) 18:28, 11 July 2018 (UTC)
The "by ..." items are all members of meta category criterion (Q24571886) View with Reasonator View with SQID, created by User:Shinnin. According to Reasonator, there are 38 items currently in this class. It's an interesting approach, but I don't think it scales well -- I think the qualifier I have suggested above would be a more general approach. On the other hand, perhaps it doesn't need to scale very far -- the number of partitioning classes we need to cope with is fairly finite, at least judging by the query above.
Pinging @Shinnin: Are there any particular advantages of your model that you would like to bring into the discussion? Jheald (talk) 18:40, 11 July 2018 (UTC)
@Jheald: I didn't create meta category criterion (Q24571886) View with Reasonator View with SQID. "By <something>" type of items existed in Wikidata before I started editing here. I think by year (Q29053180) is my creation, the rest are not. I've used these types of items mainly because they seemed to be the de facto way of modeling these types of categories.
I do think that after category contains (P4224) was created, many of the current use cases could have been changed to use it instead of P971. E.g. Category:Albums by year (Q6695739)category contains (P4224)  album (Q482994) / grouped byyear (Q577) However, this approach would only work for categories that group articles based on their type. Not the ones that group them base on a common topic (e.g. Category:Geography by country (Q6491485)). All in all, I'm not an advocate of the current system. --Shinnin (talk) 20:10, 11 July 2018 (UTC)
I can see your point now and I believe I need to clarify what I'm trying to do. I want to exclude categories similar to X by Y from the scope of my queries. Although I believe that setting instance of (P31)  metacategory in Wikimedia projects (Q30432511) for them is most efficient way to achieve that, I'm not against any other way to labeling them that would fulfill my needs. We've been discussing idea of using P971 on the thread above and I still do not see efficient way how I can exclude (compare this to this) --Ghuron (talk) 18:45, 11 July 2018 (UTC)
@Ghuron: I'm not sure that I would read too much into that comparison, unless that count is literally all you want to do, in which case you can calculate a count excluding a particular subset efficiently simply by subtraction.
The first query is fast because the query engine never has to materialise the items, it can just count the difference between two index positions.
As soon as you are wanting anything more concrete (typically involving a more restricted solution set), the difference in time between the two queries would be a lot smaller. Jheald (talk) 20:49, 11 July 2018 (UTC)
Also worth noting that this query is not particularly happy either. Jheald (talk) 20:56, 11 July 2018 (UTC)
Fair point about count, let's assume I want to get labels for all categories w/o P4224 (for machine learning experiment). No single query can return 4M records in 60 seconds, so I'm using LIMIT/OFFSET. Let's see how well each of discussed schemas fits here:
One might argue that my task is rare, but even if we accept this, I still failed to see any reasons against using instance of (P31)  metacategory in Wikimedia projects (Q30432511) except pure aesthetical (that are very subjective) --Ghuron (talk) 11:45, 12 July 2018 (UTC)
@Ghuron: It's not a great surprise that putting a LIMIT 50000 on a COUNT query with a one-line answer fails to be particularly effective  :-)
Starting with this (or its OPTIONAL { ... } FILTER (!bound(...)) alternative) might be more interesting. Jheald (talk) 13:03, 12 July 2018 (UTC)
@Jheald: yep, too may query windows open, but still getting timeout on OFFSET 1000000 LIMIT 1. Couldn't figure out how to use OPTIONAL { ... } FILTER (!bound(...)) here because (unlike P4224) we expect several values on P971, and having P3831 qualifier on ANY of them should eliminate category from output --Ghuron (talk) 13:09, 12 July 2018 (UTC)
@Ghuron: I would do it this way: generate a controlled number of categories first, then start applying tests to them. Jheald (talk) 13:26, 12 July 2018 (UTC)
On a tranche of 500,000 categories, the hash join to exclude p:P971/pq:P3831 is adding about 8 seconds (17 seconds vs 9 seconds). Jheald (talk) 13:31, 12 July 2018 (UTC)
This variant is a bit slower, at 29 seconds. Jheald (talk) 13:36, 12 July 2018 (UTC)
@Jheald: as long as it fits 60s timeout, a few seconds slower doesn't matter. Your approach works for both P3831 qualifier and by country (Q19360703)/by city (Q18683478)/by genre (Q42903116)/etc (see [2]). But my understanding was that if something can be expressed without qualifiers, it should be expressed so. Shouldn't we use meta category criterion (Q24571886) children? --Ghuron (talk) 13:47, 12 July 2018 (UTC)
@Ghuron: I would delete that entire class, because I think it just causes difficulty and confusion -- starting with the name, which is deeply opaque. IMO, if something is being used as the partitioning class, it is better to use the regular class-item for that, with a qualifier to say that that is its role, rather to expect people to create and consistently use parallel different items. Jheald (talk) 14:13, 12 July 2018 (UTC)
@Jheald, Jura1, Epìdosis: So apparently we have 3 or 4 competing proposals here (not counting what was discussed in Wikidata:Project_chat#Category:A_[split_up_by]_B). Since any of them will work for me, I don't really care which one will be selected, but I do want us to select one (so I can start using it). Please advise how should we proceed from here --Ghuron (talk) 17:12, 12 July 2018 (UTC)