Wikidata:Wikidata Concepts Monitor/WDCM Journal

From Wikidata
Jump to navigation Jump to search

Welcome to the Wikidata Concepts Monitor Journal! Here we present interesting and hopefully useful WDCM generated insights into the ways Wikidata is used across the many Wikimedia projects, on weekly basis. We track over 800 Wikimedia projects and collect the item usage statistics. Once we have the monthly update of our databases ready, statistical and machine learning takes place over the collected data sets in an attempt to discover as many informative and useful patterns in Wikidata item usage as possible. You can easily access many standardized results based on the latest monthly update from any of the WDCM Dashboards (WDCM Overview, WDCM Usage, WDCM Semantics, WDCM Geo).

If you have any ideas on what WDCM use cases you would like to see published on this page, please let us know via the WDCM User Feedback Page.

January 2018.[edit]

§ WDCM Journal, 4. January 2018. The Twisted Geography of Wikidata[edit]

Today we will transform the World's geography in accordance with the Wikidata usage of Country items in Wikipedia and other projects. A gargantuan task? Not at all.

With the development of the Wikidata Concepts Monitor (WDCM) in 2017, an analytical system that tracks, analyzes, and visualizes item usage across the Wikimedia projects, we are just beginning to gain interesting insights into the ways our editors use Wikidata. The following illustrative example will present a cartogram of Wikidata country (Q6256) items usage. WDCM collects the Wikidata item usage statistics by tracking the number of unique pages in a particular project that use some particular item. The item usage totals are simply sums of per project statistics. In this cartogram, the world map is deformed by a smart {cartogram} GIS algorithm in R until that the area of each country becomes proportional to the respective Wikidata item usage statistic in the Wikimedia universe.

So this is how the World would look like if countries were as large as much as their Wikidata items are used:

Wikidata Country (Q6256) Item usage across the Wikimedia. Based on the 16. Decemebr 2017 WDCM update.
Wikidata Country (Q6256) Item usage across the Wikimedia. Based on the 16. Decemeber 2017 WDCM update.

Fun? Or not: it is not difficult to spot the pronounced North-South divide in Wikidata usage from the cartogram.

Don't forget to check-out the WDCM Dashboards where you can conduct case studies similar to this one and even download various Wikidata usage data sets.

----

Wikidata Concepts Monitor Journal, 4. January 2018. The Twisted Geography of Wikidata. GoranSM (talk) 22:51, 30 December 2017 (UTC)

Visit the WDCM Dashboards for more insights.

Introduction to WDCM: The WDCM Wikidata Project Page

For Techies, Cognitive and Data Scientists: WDCM on Wikitech

Help us make WDCM better: The WDCM User Feedback Page

The most recent Wikidata usage update: WDCM public data sets

WDCM Loves The R Project for Statistical Computing


§ WDCM Journal, 11. January 2018. How much use of Scientific Article, Gene, and Chemical Entity Items Do You Make in Your Project[edit]

Let's start learning from the WDCM data sets.

With WDCM, we can compare the size of some set of Wikidata items (e.g. what one would typically learn from COUNT(?item) or something similar in SPARQL) with the extent of its usage in Wikipedias and other Wikimedia projects. Wikidata is growing on its own: not all created or imported items end up being used in some project necessarily. We thought it would be insightful to provide an analysis of what broad categories of Wikidata items receive high usage across the projects, and what categories (if any) lag in usage.

The WDCM system currently does not track all Wikidata items (more than 42 millions by the time of this write-up). Its collection of item usage statistics is guided by a hand-picked WDCM taxonomy that currently encompasses 14 semantic categories: Human (human (Q5)), Wikimedia Internal (encompassing: Wikimedia category (Q4167836), Wikimedia disambiguation page (Q4167410), and Wikimedia template (Q11266439)), Work of Art (work of art (Q838948)), Scientific Article (scholarly article (Q13442814)), Book (book (Q571)), Geographical Object (geographical object (Q618123)), Organization (encompassing company (Q783794), club (Q988108), and organization (Q43229)), Architectural Structure (encompassing monument (Q4989906) and building (Q41176)), Gene (gene (Q7187)), Chemical Entities (encompassing chemical element (Q11344), chemical compound (Q11173), and chemical substance (Q79529)), Astronomical Object (astronomical body (Q6999)), Taxon (taxon (Q16521)), Event (event (Q1656682)), and Thoroughfare (thoroughfare (Q83620)). Taken together, the items in these 14 categories encompass around 83% of all items that are currently found in Wikidata. The choice of these categories was motivated on intuitive basis and by knowing that they will cover a vast percentage of Wikidata. We are also developing a method that will enable us to meaningfully map the rest.

All 14 categories from the WDCM taxonomy of Wikidata items are presented in the following figure.

The extent of 14 Wikidata item categories vs. their WDCM usage statistics across more than 800 Wikimedia projects. Based on the 16. December 2017. WDCM update.
The extent of 14 Wikidata item categories vs. their WDCM usage statistics across more than 800 Wikimedia projects. Based on the 16. December 2017. WDCM update.

The horizontal axis represents the respective category size, i.e. the count of Wikidata items in a particular category. The vertical axis represents the respective usage statistic. The WDCM usage statistic for a particular item is a sum of pages that make at least one use of that item in a particular Wikimedia project. For example, if an item is used on three different pages in one, and on four different pages in another project, its total usage statistic is seven. The numbers on the vertical axis are sums of usage statistics across all items encompassed by the respective category. 

The size of the bubbles that represent the categories in the plot are proportional to the Usage Ratio, which is nothing else but the ratio of the values on the horizontal (Wikidata Item Count) and vertical (Wikidata Item Usage) numbers. The numbers in the parentheses bellow the bubbles stand for the respective total item usage statistics.

Let's elaborate upon the Usage Ratio for a moment. The logic behind it is as follows: the more Item Usage surpasses Item Count in some particular category, the respective category is more popular in Wikipedia and co articles and pages. If the Usage Ratio is less than or equal to one, we have a category of items that encompasses some items that were never used at all. A special case is the Usage Ratio of 1 when we do not know whether it means that every item was used exactly once, or that some items where not used while some others have been used more than once. On the other hand, Usage Ratio larger than one implies that at least one item from a category was used more than once across the Wikimedia projects. The blue bubbles represent the categories with Usage Ratio ≤ 1, while the orange bubbles stand for those with Usage Ratio > 1; the later categories are designated as "Critical" (because they certainly encompass items that were never used in any project).

The following figure reproduces the first in log-log coordinates to avoid the overcrowding of the categories in the lower left corner:

The extent of 14 Wikidata item categories vs. their WDCM usage statistics across more than 800 Wikimedia projects on a log-log scale. Based on the 16. December 2017. WDCM update.

We can see that scientific articles, genes, and chemical entities (encompassing three related categories of items) are found to be of critically low usage by our Usage Ratio criterion. Scientific articles are especially interesting, not only because of their importance in grounding the knowledge and facts, but also because of their prevalence in Wikidata: at the moment of writing, they accounted for 26.6% (!) of all Wikidata items, while being found at the lower end of the item usage scale at the same time!

However, 2017. was a year of a massive imports of scientific articles to Wikidata, so that our editor communities certainly deserve more time before they start rolling out all these new items in their articles.

Finally, we provide an overview of the Wikimedia projects that use of the categories recognized as critical in this Wikidata usage study the most:

Wikimedia projects that make most use of the categories recognized as critical in this Wikidata usage study.
Wikimedia projects that make most use of the categories recognized as critical in this Wikidata usage study.

As you can see, a huge proportion of item usage in the Scientific Article and Gene categories happens in only two projects (the wikidatawiki itself and arwiki for Scientific Articles, and ukwiki and enwiki for Genes). The situation seems to be better in the category of Chemical Entities where the usage distribution is more equally spread across the projects. In case you need an insight into how these items are typically used in order to develop a strategy of their introduction to your favorite projects, now you know where to learn from.

----

Wikidata Concepts Monitor Journal, 11. January 2018. How much use of Scientific Article, Gene, and Chemical Entity Items Do You Make in Your Project? GoranSM (talk) 01:04, 31 December 2017 (UTC)

Visit the WDCM Dashboards for more insights.

Introduction to WDCM: The WDCM Wikidata Project Page

For Techies, Cognitive and Data Scientists: WDCM on Wikitech

Help us make WDCM better: The WDCM User Feedback Page

The most recent Wikidata usage update: WDCM public data sets

WDCM Loves The R Project for Statistical Computing


§ WDCM Journal, 18. January 2018. Big City Lights[edit]

A WDCM hommage to Adam Shorland's classic Wikidata maps, and, of course, the Wikidata World Maps June 2015. WDCM goes beyond geo-coordinates in that it maps item usage across the Wikimedia projects too.

This map shows the locations of the top 10,000 most frequently used city (Q515) items. The size of the marker and its lightness indicate higher item usage.

WDCM Wikidata City Items Usage - December 2017
WDCM Wikidata City Items Usage - December 2017

----

Wikidata Concepts Monitor Journal, 18. January 2018. Big City Lights. GoranSM (talk) 01:17, 31 December 2017 (UTC)

Visit the WDCM Dashboards for more insights.

Introduction to WDCM: The WDCM Wikidata Project Page

For Techies, Cognitive and Data Scientists: WDCM on Wikitech

Help us make WDCM better: The WDCM User Feedback Page

The most recent Wikidata usage update: WDCM public data sets

WDCM Loves The R Project for Statistical Computing


§ WDCM Journal, 25. January 2018. Wikidata and Wikistats: Who Makes Use of How Much Wikidata[edit]

We have combined some Wikistats with some WDCM data sets in order to illustrate what Wikipedias make use of Wikidata the most.

Wikidata usage December 2017 - w. Wikistats indicators per project
Wikidata usage December 2017 - w. Wikistats indicators per project

The vertical axis in the Figure represents the total Wikidata item usage for each Wikipedia, while the horizontal axis represents the number of articles in the respective projects. The size of the bubbles is proportional to the number of active users, while the color scale maps the ratio of number of edits and number of articles. Only the top 25 Wikipedias in respect to Wikidata usage are labeled. The Wikistats data for this Figure were scraped by using the {rvest} R package from the List of Wikipedias page; the reader is advised to check the exact definitions of the Wikistats used here from that source page.

The following figure plots the same variables in logarithmic coordinates.

Wikidata usage December 2017 - w. Wikistats indicators per project, log-log plot.
Wikidata usage December 2017 - w. Wikistats indicators per project, log-log plot.

It seems like the number of articles in a given project nicely predicts the respective Wikidata usage. In fact, log(Number of Articles + 1) and log(Wikidata Usage + 1) have an R2 of .96, meaning that they have approximately 96% of variance in common. However, the linear relationship is somewhat misleading in this case, due to the presence of influential cases that were detected on both ends of the Wikidata usage scale (putting aside the fact that the simple linear regression is a priori not the adequate model for counts data...).

As it can be seen from in the first figure, the English Wikipedia has a larger community of active editors, as well as a somewhat better edits per article ratio than its Russian and Chinese (zh) sisters, and still struggles with the Russian and falls short to challenge the Chinese (zh) encyclopedia in Wikidata usage.

In the next edition of the WDCM Journal we will take a closer look at the structure and diversity of Wikidata usage in the Wikipedia.

----

Wikidata Concepts Monitor Journal, 25. January 2018. Wikidata and Wikistats: Who Makes Use of How Much Wikidata? GoranSM (talk) 01:17, 31 December 2017 (UTC)

Visit the WDCM Dashboards for more insights.

Introduction to WDCM: The WDCM Wikidata Project Page

For Techies, Cognitive and Data Scientists: WDCM on Wikitech

Help us make WDCM better: The WDCM User Feedback Page

The most recent Wikidata usage update: WDCM public data sets

WDCM Loves The R Project for Statistical Computing

February 2018.[edit]

§ WDCM Journal, 1. February 2018. Diversity in Wikidata Usage[edit]

What if Wikidata item categories were biological species, and we ask: how diverse is their presence across the ecosystem of Wikipedia? We present you the classification of the top 100 Wikipedias in respect to the ways they use Wikidata and encourage you the read the 29. January 2018. edition of the WDCM Journal to learn about some important results on the diversity in their usage of Wikidata. It's a bit of longer read, but we hope that you might discover some very interesting things about Wikidata along the way.

WDCM currently tracks Wikidata item usage in 14 semantic categories: Human, Wikimedia Internal, Work of Art, Scientific Article, Book, Geographical Object, Organization, Architectural Structure, Gene, Chemical Entities, Astronomical Object, Taxon, Event, and Thoroughfare. We have looked at the distribution of WDCM item usage statistics in these semantic categories (excluding Wikimedia Internals such as templates) across the 100 Wikipedias that make use of Wikidata the most. The data encompass item usage counts from the respective categories in each of the 100 Wikipedias under consideration. It is a simple table in which the columns represent the semantic categories, while every row stands for a particular Wikipedia project. The cells of the table are populated by item usage statistics: how many different pages in a particular Wikipedia (rows) make use of the items from a particular category (columns). The data set was submitted to cluster analysis with {mclust} in R. Essentially, what clustering does to this data set is to determine the group of Wikipedias with similar patterns of Wikidata item usage. The probabilistic nature of the algorithm has then enabled us to rely on the information-theoretic concepts used in ecology to quantify the diversity of Wikidata usage.

The {mclust} algorithm has determined that the optimal clustering solution for 100 Wikipedia projects encompasses six clusters. The median profiles of Wikidata usage for the Wikipedias under consideration are presented in the following figure. Each panel represents the median Wikidata usage across the categories (horizontal axis, labeled at the bottom of the Figure) in a different cluster of Wikipedias. The language codes in the panel titles are the projects encompassed by the respective cluster.

Wikidata Usage Clusters - December 2017 - top 100 Wikipedias in respect to the WDCM Wikidata Item Usage statistics.
Wikidata Usage Clusters - December 2017 - top 100 Wikipedias in respect to the WDCM Wikidata Item Usage statistics.

The similarity of the median profiles of Wikidata usage across the six clusters of Wikipedias is only apparent. A common feature is the prevalence of the usage of geographical object (Q618123) and human (Q5) items, except for in the sixth cluster (lower right panel) where the usage of the taxon (Q16521) category of items surpasses by far the usage human (Q5) items. Differences in scale (pay attention to the vertical axes!) as well as the detailed differences in item usage, as we demonstrate next, determine the true nature of diversity in Wikidata usage.

The {mclust} clustering algorithm has also provided a probability of being categorized into any of the six clusters for each Wikipedia. Thus, the Wikidata item usage in any Wikipedia can be represented as a probability distribution. We have used the Hellinger metric to represent the pairwise similarity in Wikidata usage among the Wikipedias. The metric is used to represent the similarity of two probability distributions as a distance, so that one can think of Wikipedias as being represented by points in space, arranged so that those found in proximity of each other in space are at the same time similar in respect to the way they use Wikidata. We have then searched that representational space for the two nearest neighbors to each Wikipedia.

The nodes in the following graph represent all 100 Wikipedias under consideration. Each Wikipedia node points towards the node representing the Wikipedia that is most similar to it (black arrows) and then to its next most similar neighbor (grey arrows). The node color represents cluster membership. Observe how even the inclusion of links towards the second most proximal neighbors to each node did not give rise to occurrence of cross-cluster linkage (i.e. all projects are connected only to those that belong to the same cluster): the separation of clusters obtained from {mclust} is quite satisfactory.

Wikidata Usage Similarity Network in top 100 Wikipedias - based on 16. December 2017 WDCM update
Wikidata Usage Similarity Network in top 100 Wikipedias - based on 16. December 2017 WDCM update

Finally, let's quantify the diversity of Wikidata usage across the top 100 Wikipedias. As already explained, the probabilistic clustering algorithm has discovered six clusters of Wikipedias from the item usage data, and then described each Wikipedia as a probability distribution over the six clusters. These probability distributions are represented by distinct arrays of six numbers, with each array summing up to one (because they are probabilities, of course), and each number in each array telling us how probable it is for a particular Wikipedia to be found in the particular (1 - 6) cluster. The English Wikipedia offers the following understanding of the Diversity index in the context of ecology:

"A diversity index is a quantitative measure that reflects how many different types (such as species) there are in a dataset (a community), and simultaneously takes into account how evenly the basic entities (such as individuals) are distributed among those types."

[Wikipedia, The Free Encyclopedia, s.v. "Diversity index" (accessed December 21, 2017), https://en.wikipedia.org/wiki/Diversity_index]

We have computed the Shannon entropy, which is a commonly used diversity index in ecology, for each probability distribution obtained from the {mclust} algorithm and describing a particular Wikipedia. The entropy was re-scaled to [0, 1] by dividing it with the maximum entropy of a multinomial distribution of six elements, log(6). The horizontal axis in the following Figure represents Wikidata Usage Diversity as quantified by this relative entropy index, while the vertical axis stands for total Wikidata Usage. The colors represent cluster membership again, while the bubble sizes are proportional to the number of active users in the respective project as obtained from Wikistats. The labeled projects are found either among the top 10 diverse Wikipedias or/and among the top 10 largest in respect to Wikidata usage.

Wikidata Usage Diversity: Shannon Diversity Index vs total Wikidata Usage for top 100 Wikipedias. Based on 16. December 2017. WDCM update.
Wikidata Usage Diversity: Shannon Diversity Index vs total Wikidata Usage for top 100 Wikipedias. Based on 16. December 2017. WDCM update.

The closer the value of diversity to 1, the more uniformly distributed is the probability of the respective Wikipedia belonging to any of the six discovered clusters. This means that the current state of Wikidata usage on a Wikipedia of high diversity is also highly uncertain: the usage of different items there is highly variable, and it is rather difficult to predict how will the Wikidata usage in that Wikipedia look like in the future. In other words, the Wikipedias with high Wikidata usage diversity are more dynamic and probably didn't settle any characteristic common "strategy" or "interest" in their approach to Wikidata usage yet.

From the figure, we can observe several Wikipedias beloning to the same cluster (dark orange colored, approximately spread out horizontally above the Wikidata Usage Diversity), showing high item usage diversity: euwiki, idwiki, nowiki, plwiki, jawiki, and minwiki. This cluster of Wikipedias show the most promising Wikidata usage in terms of its future dynamics, namely: it is difficult to predict what will happen there (what items, from what categories, will start receiving a higher or lower usage in these projects).

On the other hand, lower entropy corresponds to lower diversity: Wikipedias of lower diversity are currently settled in their ways of using different Wikidata items. The Wikipedias that now make use of Wikidata the most are found among these less diverse projects: observe the almost perfect vertical alignment of large projects on the lower end of the diversity Wikidata Usage Diversity scale. Thus, where Wikidata is used the most, it seems to be used in some relatively settled, constant way.

It is important to note that the most diverse projects are (a) not found among those with the lowest (b) nor among those with the highest usage of Wikidata. On the contrary, we can safely conclude that the most diverse projects are those in a state of a rather dynamic development now: they didn't reach the item usage levels of the most developed projects yet, but their high diversity holds a promise that they might develop in the ways different from the ones already established by large projects. Celebrate the differences!

N.B. Every massive import or automated bot tagging in some Wikipedia will represent a (white, luckily) statistical Black Swan for our data sets, so making any firm predictions from the situation as depicted here is not well advised. It is better to adopt the exemplified methodology as a way of thinking about the diversity present in the world of Wikidata usage, then to plan any attempts at building real predictive models - because they will certainly fail. The study of Wikidata usage is a par excellence example of an attempt at understanding the pragmatics of a highly complex socio-technical system, and it is too complicated in itself, encompassing too many unobservable, potentially causal factors, to provide more than useful insights and help generate ideas and inspiration for future actions.

----

Wikidata Concepts Monitor Journal, 1. February 2018. Diversity in Wikidata Usage. GoranSM (talk) 16:34, 31 December 2017 (UTC)

Visit the WDCM Dashboards for more insights.

Introduction to WDCM: The WDCM Wikidata Project Page

For Techies, Cognitive and Data Scientists: WDCM on Wikitech

Help us make WDCM better: The WDCM User Feedback Page

The most recent Wikidata usage update: WDCM public data sets

WDCM Loves The R Project for Statistical Computing

§ WDCM Journal, 8. February 2018. What is Love (Q316)? Accessing Wikidata P279 and P31 paths from WDCM[edit]

In order to to determine what Wikidata classes we need to submit to the Wikidata Concept Monitor (WDCM)'s analytical procedures, sometimes we need to dive deep into the structure of item and class membership relations. We have started building a visual navigator - the WDCM Structure Dashboard - that can show us the subclass of (P279) and instance of (P31) paths "upward" from the desired Wikidata item, in order to help us find the potentially large classes that we are currently missing in the choice that undergoes WDCM analysis. While building the WDCM Structure Dashboard, we came to a conclusion that this method of Wikidata visualization could be potentially useful to others as well - even those not primarily interested in Wikidata usage statistics. So we have included an option for anyone to provide anything from one to maximally five Wikidata items and create a directed graph of subclass of (P279) and instance of (P31) paths that connect them to entity (Q35120) in Wikidata (note: only the subclass of (P279) paths are formally constrained to have entity (Q35120) as their final destination). Here are some (rather philosophical examples):

Knowledge (Q9081)
Knowledge (Q9081)
Love (Q316)
Love (Q316)
Meaning (Q183046)
Meaning (Q183046)
Human (Q5), Universe (Q1), Life (Q3), Death (Q4)
Human (Q5), Universe (Q1), Life (Q3), Death (Q4)

In all these examples, the dashed links represent the instance of (P31) relations, while the solid links stand for the subclass of (P279) relations. So, anytime you need a visualization of basic membership relations streaming from any Wikidata item that you might be interested in, visit the WDCM Structure Dashboard, navigate to the Make Your Own Network tab, enter up to five Wikidata items and click Create Network. The Dashboard will contact the Wikidata Query Sevice to fetch the data for you and provide an interactive graph that you can save by simply right-clicking it (hint: zoom it in our out first, until you adjust the scaling appropriately - especially for highly complex graphs) and then going for "Save image as".

Wikidata Concepts Monitor Journal, 8. February 2018. Accessing Wikidata P279 and P31 paths from WDCM. GoranSM (talk) 19:56, 8 February 2018 (UTC)

Visit the WDCM Dashboards for more insights.

Introduction to WDCM: The WDCM Wikidata Project Page

For Techies, Cognitive and Data Scientists: WDCM on Wikitech

Help us make WDCM better: The WDCM User Feedback Page

The most recent Wikidata usage update: WDCM public data sets

WDCM Loves The R Project for Statistical Computing


March 2018.[edit]

§ WDCM Journal, 8. March 2018. Gender equity in Wikidata Usage[edit]

We already know that the representation of knowledge and information on different genders is unequal. In Wikidata, for example, there are approximately 83% of human (Q5) items with sex or gender (P21) of male (Q6581097) and 17% with sex or gender (P21) of female (Q6581072). From the Wikidata Concepts Monitor we can also learn about the statistics of usage of these items across Wikipedia and its sister projects, and it turns out that the proportions are about the same, implying a ratio of 4.89 : 1 in favor of male human (Q5) items. In other words: almost five mentions of a male item are made for every single mention of a female item. We are currently working hard to finish the development of the WDCM Biases Dashboard that will present this and many other sources of inequity in knowledge representation in our projects. In the meantime, and on the occasion of 8. March appropriately, we share some of the most interesting, preliminary findings on Gender Bias and the North-South Divide here with you.

A slightly optimistic result is related to the shape of inequality distribution for male and female Wikidata items usage in themselves. Think of Wikidata usage as a value that can be distributed among a number of individuals in an economy, drawing the following analogy: Wikidata items are taken to be individuals, and the sum of their total usage across our projects is taken to represent the total wealth. So, each Wikidata item's "worth" is measured by its Wikidata usage: the number of pages that make a mention of that item in our projects. Then rank all the items according to their usage in the Wikimedia universe and divide them in a number of equal-sized groups. For each group of items compute its share in the total Wikidata usage ("wealth"), and plot the cumulative percentage or proportion of Wikidata items covered by each successive group against their share (also expressed as cumulative percentage or proportion). What obtains is the Lorenz curve, a concept in economics widely used to express the distribution of economic inequality in a particular society. The following figure presents two Lorenz curves, for male and female Wikidata human (Q5) items usage.

The Lorenz curves based on the Wikidata usage statistics of male and female human (Q5) items in Wikidata and sister projects.

The Lorenz curves in the figure were derived from 569,418 observations of female and 2,786,000 observations of male items usage. The diagonal represents the line of equality: a category of items that would be found to have a straight, diagonal Lorenz function would be the one where all items are mentioned exactly the same number of times. The empirical Lorenz curves for the female (red) and male (blue) Wikidata items usage are found far away from the line of equality, which is nothing strange and unexpected. The finding is similar to, for example, the well-known facts about word usage frequency distributions (see: Zipf's Law) in any language, where a small fraction of words is used predominantly in comparison to the rest of the words that are used rarely. The closer an empirical curve gets to the line of equality, the more equal distribution of wealth - Wikidata usage, in this case - it represents. From the figure we can see that the usage of female items is slightly more equally distributed than the usage of male items, which is also witnessed by the smaller value of the Gini coefficient - another important measure of inequality in the distribution - for female items (.62) in comparison to male items (.65).

Another interesting finding from this preliminary study of gender bias refers to its relation to another bias in knowledge representation: the North-South Divide. The following two maps represent the place of birth (P19) of all male and female items under consideration in this study; the top map (red) stands for female and the bottom (blue) for male items. The size of the circular markers in both maps corresponds to the total Wikidata usage of all human (Q5) items who represent persons born in the respective place. The prevalence in the usage of the items that represent people born in places situated above the equator is huge for both male and female items in comparison to the places situated below the equator. The estimates follow under the maps.

Female Q5 items birthplaces by Wikidata usage.
Male Q5 items birthplaces by Wikidata usage.

The preliminary results show the following: the North-South Divide in Wikidata usage according to the person's birthplace is split approximately 95% and 5% between North and South; the split varies only slightly between the traditional male and female gender categories. The gender bias, in spite of being large, and maybe unexpectedly, is less pronounced than this prevailing characteristic of the distribution of knowledge in reference to geographical entities.

April 2018.[edit]