Wikidata talk:Statistics/Wikipedia

From Wikidata
Jump to navigation Jump to search

nlwiki

[edit]

Please include the Dutch Wikipedia, it has a lot of active users working on it. Sjoerd de Bruin (talk) 13:31, 10 June 2015 (UTC)[reply]

@Sjoerddebruin, Multichill: done. --Zolo (talk) 06:22, 12 June 2015 (UTC)[reply]
@Zolo: thanks! The taxon overload is clearly visible. I believe we have about 30.000 claimless items so that means we have about 75.000 items that don't have a P31/P279, but does have some sort of other claim. Would be an interesting set to focus on. Multichill (talk) 16:18, 12 June 2015 (UTC)[reply]

Awesome!

[edit]

Really, really nice work, User:Zolo! Thanks! --Atlasowa (talk) 10:24, 12 June 2015 (UTC)[reply]

Notes

[edit]
Thanks, for urban areas, I was mostly looking for places that were likely to have many articles in many languages (actually that mostly means, large, fairly ancient European cities). --Zolo (talk) 21:44, 15 June 2015 (UTC)[reply]

Moved to main

[edit]

I have moved this page to Wikidata:Statistics/Wikipedia. Currently, statistical data are scattered in user pages, which make it a bit difficult to find, I think it deserves a collaborative effort in the main, so I started a new Wikidata:Statistics that is supposed to look a bit like a statistics homepage. --Zolo (talk) 22:03, 15 June 2015 (UTC)[reply]

Sitecontent

[edit]
(71,611,020)
human: 6,376,879 (8.9%)taxon: 2,726,046 (3.8%)administrative territorial entity: 1,943,285 (2.7%)architectural structure: 3,159,472 (4.4%)occurrence: 3,898,674 (5.4%)chemical compound: 1,188,724 (1.7%)film: 294,370 (0.4%)thoroughfare: 630,794 (0.9%)astronomical object: 4,601,733 (6.4%)Wikimedia list article: 404,454 (0.6%)Wikimedia disambiguation page: 1,358,230 (1.9%)Wikinews article: 195,900 (0.3%)scholarly article: 22,574,314 (31.5%)other P31/P279: 18,284,676 (25.5%)no P31/P279: 3,973,469 (5.5%)
  •   human: 6,376,879 (8.9%)
  •   taxon: 2,726,046 (3.8%)
  •   administrative territorial entity: 1,943,285 (2.7%)
  •   architectural structure: 3,159,472 (4.4%)
  •   occurrence: 3,898,674 (5.4%)
  •   chemical compound: 1,188,724 (1.7%)
  •   film: 294,370 (0.4%)
  •   thoroughfare: 630,794 (0.9%)
  •   astronomical object: 4,601,733 (6.4%)
  •   Wikimedia list article: 404,454 (0.6%)
  •   Wikimedia disambiguation page: 1,358,230 (1.9%)
  •   Wikinews article: 195,900 (0.3%)
  •   scholarly article: 22,574,314 (31.5%)
  •   other P31/P279: 18,284,676 (25.5%)
  •   no P31/P279: 3,973,469 (5.5%)
Module:Statistical data/by project/classes, 2020-02-16

Between "other P31" and "unknown", there should be about 400,000 to 450,000 WS "sources".

  • 450000 is from link[enwikisource,zhwikisource,dewikisource,frwikisource,arwikisource,ruwikisource,eswikisource,itwikisource,ptwikisource,plwikisource,cswikisource]
  • 400000 is from User:Pasleim/Connectivity#Wikisource: items of the same wikis with 0 sitelinks.

Not sure how to add it. Only 200000 of the 450000 have P31. --- Jura 22:51, 15 June 2015 (UTC)[reply]

Not sure to really understand what you mean. I think items with Wikisource sitelinks should have a p31 like other items. We should possibly add one or a few other classes for the pies, possibly "text" or "work", but we should be careful that classes do not overlap. --Zolo (talk) 07:52, 16 June 2015 (UTC)[reply]
Half of the items have P31, others still need to get them. User:Pasleim/Connectivity#Wikisource shows also links from enwikisource to enwiki. To some extend, these may be links mixing things up. --- Jura 08:15, 16 June 2015 (UTC)[reply]


Going through Wikidata:Database_reports/Popular_items, some other we might want to add:

A few others:

  • gene (Q7187) 96559
  • film (Q11424) 162511
  • street (Q79007) 249561

Not sure how to define them.

chemical compound: 26,761 (0.2%) should probably be dropped. -- User:Jura1


@Zolo: I tried to add 4 classes, but I had to remove them as I found two issues:
It seems that means there is there is an overlap. In this case, it's between "architectural structure" and "thoroughfare". Both include bridges. --- Jura 16:14, 19 June 2015 (UTC)[reply]
@Jura1: I've fixed the template-breaking issue (but I think that we we add new classes, we should try to add it to all projects).
It may be fine to have overlapping classes in Module:Statistical data/classes, as long as they are not used in the same pie chart. In practice means that when two classes overlap, only one of them has should be used in Module:Show stats.sitecontent. -Zolo (talk) 11:04, 20 June 2015 (UTC)[reply]
Agree, but WS/WN classes aren't relevant to most. I will try to add film to enwiki. Street might do and it could be as "noclaim[31:79007]" on architectural.
BTW, is there a way to keep the colors the same? Where shall we keep the definitions for the queries? At some point it would be good to have a tool to fetch the numbers. --- Jura 11:08, 20 June 2015 (UTC)[reply]

Dense areas diagrams

[edit]

I think the diagrams should have all the same scaling. With different scalings it's quite impossible to visually compare.--Kopiersperre (talk) 20:42, 21 June 2015 (UTC)[reply]

It seems to require some changes to Module:Chart, and I do not fully grasp this module.
I intended to compare different cities for the same language, more than the same city across languages. If we scale all languages on enwiki, small Wikis will have very small bars, and will not be so readable. --Zolo (talk) 06:15, 26 June 2015 (UTC)[reply]

Bugs in time value statistics

[edit]

Great work, and very interesting indeed. However, the time/occupation statistics have a flaw that creates wrong outputs. They should only consider persons whose date of birth is precise at least to the year. There are many people whose date of birth is only known to the decade or century, and in these cases the stored date is rather arbitrary. Often, it is the last date in the century, so that "20th century" births are mapped to year 2000 if you ignore the precision. This is why an incredible 2717 artists in German Wikipedia are claimed to be born after 2000. Note that the effects of this bug may not be the same across time intervals and professions. In particular, nn50-nn99 intervals rarely benefit from imprecise dates. So people with unknown birth dates are moved to the nn00-nn49 intervals. This seems to be much more common for artists than for politicians. --Markus Krötzsch (talk) 20:59, 2 July 2015 (UTC)[reply]

@Markus Krötzsch: yes I have noticed that but I used user:Magnus Manske's wikidataquery, which ignores precision.
I originally used 1901-1950 like periods, which produced more sensible outputs for 21th century. However, my understanding was that the timestamp is supposed to be the start of the period, ie "1900-01-01 precision = century" means between 1900 and 1999. If so, and absent real precision filters it seems more sensible to have 2000 in a 2000-2049 period than in a 1951-2000 period. It would also mean that most of the "birthdate = 2000, precision = century" are wrong and should be replaced with birthdate = 1901.
Do you have data about birth dates > 1700 with precision worse than decade ? I sense there are sufficiently few that my stats are still useful in an informal context - except for 21th century, where the issue mentioned above has a disproportionate impact.
Best would probably be to switch from Wikidataquery to another system, but I am afraid I won't have time to learn SPARQL right now. --Zolo (talk) 07:22, 3 July 2015 (UTC)[reply]
Good point Markus!
Shall we have the default values changed to 19. century = 1801-01-01 ? There was some discussion about this in the forum, but I don't think a corresponding request came out of it. --- Jura 12:13, 4 July 2015 (UTC)[reply]

Storage of basic statistics

[edit]

What do you think of Q20642098? We could probably find a bot to update it regularly. --- Jura 12:13, 4 July 2015 (UTC)[reply]

More Wikipedias?

[edit]

Thank you, this is fantastic!

Can we get the breakdown pie-charts for more Wikipedias? If not absolutely all of them, at least all of them with more than 10K articles? It could be in a sub-page if there's concern it would make the main page too long. Thanks! Ijon (talk) 17:09, 12 July 2015 (UTC)[reply]

You can add them to Module:Statistical_data/by_project/classes. --- Jura 17:17, 12 July 2015 (UTC)[reply]

Selection of classes

[edit]

Geographical object

[edit]
  • where are the lakes, mountains, etc.?
  • "claim[31:618123] and claim[625]" [1] should probably be empty

80.134.89.98 23:13, 19 July 2015 (UTC)[reply]

@Zolo: Because if it has a location (P625), it is a concrete physical object and can be better classified than only with "geographical object" (Q618123). TimurKirov (talk) 20:41, 21 July 2015 (UTC)[reply]
What do you think about megred administrative territorial entity and thoroughfare with others geographical object into the new group: entites related with Wikimapia (with omnission of the concepts overlaping with architectural structure)? Dawid2009 (talk) 11:28, 4 March 2017 (UTC)[reply]

Work

[edit]

@Zolo: Instead of film, please use Q386724. TimurKirov (talk) 17:21, 22 July 2015 (UTC)[reply]

Property:P31:Q482994 (music album), could be useful, too. Could be grouped together with Q134556 (single) and ? (song) as musical works. --Papuass (talk) 11:56, 4 August 2015 (UTC)[reply]
I think we could include have several more meaningful subgroups.
I'd keep films separate, as freebase has quite a large set of content we might eventually include. --- Jura 12:01, 4 August 2015 (UTC)[reply]
I suggest to change film and artifical entity into media franchises and inventions. If Human was fork in biographies and anthropology, this concept would be right Dawid2009 (talk) 08:47, 4 March 2017 (UTC)[reply]

Fork of the Human into Biographies and Anthropology

[edit]
What do you think about include statistics biographies and concept related to anthropology? Dawid2009 (talk) 08:47, 4 March 2017 (UTC)[reply]
Hi user:Dawid2009, not sure this is what you mean but in the charts, "human" only means items that have instance of (P31)human (Q5), items that are about ethnographic concept, ethnic groups etc. are in "other P31/P279" and are presumably much less numerous. --Zolo (talk) 14:34, 4 March 2017 (UTC)[reply]
OK, I understand your explain, but I still don't know is exacyly in human. Are there just biographies or there are also some other concepts in this group? Based on your reply I see that concepts related with Cultural anthropology aren't include in Humans (there are include in "other P31/P279") but what about Biological anthropology? Is this include in humans, taxons or "other P31/P279"? In the case of humans include only biographies, what do you think about change of the title from humans to biographies?The title would be not misleading. Humans can associated to Anthropology Dawid2009 (talk) 14:48, 4 March 2017 (UTC)[reply]
Generally in Wikipedia there is a quite a lot of terms related to athropology (https://commons.wikimedia.org/wiki/File:Wikipedia_content_by_subject.png - stats of ENwiki from 2008). Dawid2009 (talk) 14:53, 4 March 2017 (UTC)[reply]

Cawiki

[edit]

Would it be possible to add the graph for catalan language? Thanks. Paucabot (talk) 18:07, 20 July 2015 (UTC)[reply]

Thanks. Later I found it. Paucabot (talk) 20:30, 22 July 2015 (UTC)[reply]

Maps of geotagged items per Wikipedia (2015)

[edit]

These maps were made here by Markus Krötzsch, based on the maps made by Denny Vrandečić in 2013. The maps "for Wikidata items with a sitelink from the specified wiki and with a coordinate" show geographical bias, unsurprisingly, and mass article creations. I recommend to flip through with Mediaviewer (full screen):

See also commons:Category:Wikidata_geocoding. Great stuff, compare with the Wikipedia content analysis! :-) --Atlasowa (talk) 08:39, 23 July 2015 (UTC)[reply]


For comparison, see a world map of Geonames database of places:

Geonames Q830106 has GeoNames ID Property:P1566 at wikidata, according to Wikidata:Database_reports/List_of_properties/Top100 currently 491.951 wikidata items link to the Geonames ID property). --Atlasowa (talk) 09:20, 23 July 2015 (UTC)[reply]

GeoNames ID (P1566) used on 913.204 wikidata items as of now. --Atlasowa (talk) 20:10, 16 January 2016 (UTC)[reply]

Actuality

[edit]

These statistics should be always on an actual level. These statistics are of great importance for teaching people outsite Wikimedia. But with 3, 4, 5 year old statistics this did not work, it becomes ridicoulous. So it would be great, is some people with the knowledge could care regulary for this. Marcus Cyron (talk) 03:31, 26 January 2019 (UTC)[reply]

Zolo? --Succu (talk) 22:15, 26 January 2019 (UTC)[reply]
+1 4nn1l2 (talk) 15:35, 30 January 2019 (UTC)[reply]
+1 --Epìdosis 16:09, 30 January 2019 (UTC)[reply]
I had originally made it essentially by hand, sending separate requests to the former "Autolist" tool for each number. Something similar can probably still be done but it is time consuming.
It should certainly be automated, but I don't know how to do it. For smaller Wikis, we can probably probably use the sparql query tool (like http://tinyurl.com/ybhb36c5 for Corsican) but unless there is a way to make the query much much more efficient, it will break for larger Wikis. It may require working directly with the dumps, but I don't think I will be able to do it in the foreseeable future. --Zolo (talk) 08:54, 4 February 2019 (UTC)[reply]

Current statistics

[edit]

Current statistics might be found at

Content on Wikidata by type (also see Section 'Raw Data)

M2k~dewiki (talk) 15:17, 7 September 2024 (UTC)[reply]

Interesting, didn't know of that wikiscan website which seems very useful. There are a lot of outdated charts in use on Wikipedia etc which can identify and possibly replace with newer ones here. Please replace the contents of the page and the one you striked with newer visual stats as well as these links. Prototyperspective (talk) 11:25, 8 September 2024 (UTC)[reply]
M2k~dewiki (talk) 11:37, 8 September 2024 (UTC)[reply]