Wikidata:Wikidata Concepts Monitor

From Wikidata
Jump to navigation Jump to search

[NOTE] The User Feedback collection for this system takes place here. Visit the WDCM Journal for latest results on Wikidata usage across the Wikimedia projects.

The Wikidata Concepts Monitor (WDCM) (aka: Q42376073) is an analytical tool that enables you to browse and build an understanding of the way Wikidata is used across the Wikimedia projects.

This page provides a non-technical overview of WDCM. Technical details are found on the corresponding Wikitech page.

This page does not stand as a project presentation merely - or at least it is not intended to be anything like that. An interested reader will find all the essential facts about the project listed here, true. However, the true intention here is to show you how to use the Wikidata Concepts Monitor statistical system to help you discover the wonderful, immensely complex universe of Wikidata usage across some 800 client projects in the Wikimedia ecosystem. The WDCM is designed to become a path towards discovery: following the examples listed here not only that you can learn to work with a system that might improve your understanding of Wikidata, but you could also find yourself involved in adventurous attempts to learn and discover more of it. The page encompasses several examples of WDCM usage and provides two more elaborated use cases. We recommend that you take the path as described in the examples and the use cases in the beginning of the page before starting your own research with the WDCM.

For those interested in technical details. Those interested in the technical details should visit the WDCM Wikitech Page: it should be providing enough information in respect to how WDCM works. To put it in a nutshell, the current version of the WDCM system is developed in R and Python, and supported by Apache Hive, Apache Spark, and Apache Sqoop to enable Big Data processing of the wbc_entity_usage tables that provide for Wikidata usage client-side tracking over sister projects. MariaDB runs the WDCM dashboards back-end support, while the dashboards themselves are built in the RStudio Shiny framework and hosted by an open source version of the RStudio Shiny Server. The WDCM Engine scripts perform many data pre-processing procedures before the machine learning phase takes over to deliver the results to the front-end, utilizing Latent Dirichlet Allocation and t-SNE among other algorithms. The front-end data visualizations are developed primarily in {ggplot2}, {visNetwork}, and {rBokeh}.

WDCM loves R (Q206904), the lingua franca of data science (Q2374463) and the main development language of WDCM.

Usage: the outer life of Wikidata


On November 14, 2017, in the early CET hours, the following SPARQL (Q54871) query

SELECT (count(*) AS ?count) WHERE { ?item wdt:P31/wdt:P279* wd:Q13442814. }

run from our Wikidata Query Service resulted in a count of 9,301,454 items under scholarly article (Q13442814). However, the Wikidata Concepts Monitor (Q42376073), which is a piece of statistical software developed in R, reports from its Usage Dashboard that that the semantic category - or a concept - of a scientific article is used on only 1,762 distinct pages across all editions of Wikipedia.

The concept of a scholarly article (Q13442814) has a life of its own: it is a part of (P361) peer reviewed proceedings (Q16735857), scientific journal (Q5633421), and scientific publication (Q591041); it has cause (P828) of academic writing (Q4119870); is is the equivalent class (P1709) to ScholarlyArticle from; it is different from (P1889) an academic journal article (Q18918145). The description of what it means to be a scientific article in Wikidata constitutes its inner life as a Wikidata item. On the other hand, the fact that it is used on only 1,762 distinct pages across the Wikipedia tells us something about its outer life: simply, what happens to a scientific article when it goes out and starts playing a role in Wikipedia and its sister projects' pages.

WDCM Example 1. In order to reproduce this result from the WDCM, (1) visit the WDCM Usage Dashboard, (2) select Tabs/Crosstabs while at the Dashboard tab, select _Wikipedia in the Search projects field, (3) select Scientific Article in the Search categories field, (4) click Apply Selection, and (5) download the .csv file (that you can open directly in Libre Office Calc, Microsoft Excel, and similar applications) by clicking the Data (csv) button beneath the Categories chart to the left.
WDCM Usage Dashboard: what Wikipedia projects make use of Scientific Articles (Q13442814) the most?

On cats and mice


There is an individual, Brunhilda, and a category, house cat (Q146), of which Brunhilda is an P31 (or an instance of, if you prefer). In Wikidata, both are items. In cognitive science and cognitive psychology, one would typically say that there is a concept of a cat - a mental representation of a class of fury animals - of which Brunhilda is an exemplar (being a concept itself, of course). Concepts are considered to be extremely complicated mental entities that form the building blocks of human thought. They are said to be stored in our individual semantic memories. However, what exactly is that we store about concepts - what forms the basis of psychological semantics - and how concepts are used is critically determined by discourse: how concepts relate to the empirical world that surrounds us (a question partly answered by semantics) and how we use them to exchange with others what we know about the World and what we want to do with it (pragmatics).

Wikidata is an attempt to build an abstract formal world of entities and relations that is powerful enough to express many possible truths about the Universe (Q1)} in a rather flexible way. It is populated by new pieces of knowledge every day, evolving on its own. The usage of Wikidata in encyclopedic articles on Wikipedia (or any other pages of its sister projects) is a different story altogether. Tom Cat (Q1839152) and Jerry (Q1962394) are a cat and a mouse, certainly related by their belonging to the very abstract category of animal (Q729) ("kingdom of multicellular eukaryotic organisms"). At the same time, Tom is a fictional cat (Q27303676), while Jerry is a fictional mouse or rat (Q24668268). Imagine that we were able to discover whether the writers of Wikipedia use Tom and Jerry more often together (a) in the discussions of animated films, children, or entertainment industry, or (b) in the context of multicellular eukaryotic organisms and theoretical biology in general?

Well, that is exactly what WDCM is meant for. While we could read a potentially large number of Wikipedia articles (or at least in theory) in a search for Tom and Jerry, doing qualitative content analysis along the way, and decide what is the most typical context of usage for these two fictional mammals, WDCM runs its statistical machinery over big data (Q858810) to learn about many contexts of usage for many semantic concepts in order to help us answer questions like this and similar. In its operation, WDCM is fully dependent upon the nature of the data that we provide to it: it counts how often and where do we use some particular Wikidata items to feed its statistical engine, which means that in the end it provides a reflection of the interests, strategy, spontaneous associations, and reasoned usage of concepts on the behalf of the large community of editors on Wikimedia projects.

Dive into usage context


Let's use WDCM to discover more on how people make use of human (Q5) across the Wikimedia projects.

WDCM Example 2. For example, you can now visit WDCM Semantics Dashboard, and then (1) select the category Human in the Select Semantic Category field on the Semantic Models tab, (2) Select Semantic Topic: Topic 1, and discover that there is one potentially interesting context of usage of the Wikidata items in the category Q5 (human) that encompasses mostly "important historical figures and politicians". Now, if you only change Topic 1 to Topic 2 in the Select Semantic Topic field, you will discover another interesting context of usage that can be loosely described as "celebrities", where you will find many popular singers and actors, among others. Take a look at Topic 3: what context of Q5 usage does it represent?
WDCM Semantics Dashboard, Category: Human, Topic: 3.
What have we learned? The WDCM has figured out about these different contexts of Q5 items usage by inspecting the way their usage distributes across many Wikimedia pages and projects. We already know that it must be that there are some projects and communities that were interested to write about historical events and politics, on one, and some other that focused their attention more on fun, entertainment, and arts, on the other hand: Topics 1 and 2 of Q5 usage. Take a look at Topic 3: you will find many scholars and writers there, and most of them French (at least among the most important items in this semantic topic). Now, scroll down to the bottom of the dashboard page to discover what Wikimedia project seems to be the most prominent ones in this context of Q5 items usage (and that would be French and Breton Wikisource), and think: does it make sense?
WDCM Semantics Dashboard. The most important projects in Category: Human, Topic: 3.

There are important lessons to learn about Wikidata usage here already. While the prominence of the French Wikisource makes sense in respect to the usage of Q5 items in this context, as well as the prominence of the Breton (Q12107) (a severely endangered language spoken in Britanny, France) Wikisource, the prominence of the Czech (ranked third), Estonian, or Russian Wikisource projects does not necessarily make sense prima facie. This finding tells us the following: there must be a community of editors out there that have shown a particular interest in French culture on the respective projects. If you go to the Project Semantics tab on this WDCM dashboard, enter only rowikisource in the Select Projects field, and click Apply Selection, a chart will be produced from which you can learn that the dominance of this context of Q5 usage on rowikisource is some 7.85%, compared to Topic 1 that reaches almost 87%. First lesson: on many occasions, a particular semantic context of Wikidata usage is really strong in few projects only. Second lesson: Wikidata usage is determined not only by what one would expect to be logical in some sense - be that logic of a purely formal-semantic or of a cultural nature - but by what happens in the contributing communities as well. One consistent editor who is interested in a particular topic, not to mention a group of them, can change the semantic context of a project significantly.

WDCM Example 3. Go to the WDCM Semantics Dashboard, and (1) Select Category: Geographical Object from the Semantic Models tab, Select Semantic Topic: Topic 6. The first chart produces by the dashboard will tell you that items located in China play a significant role in this context of Wikidata usage of the semantic category of geographical objects. Now (2) scroll down to the bottom of the dashboard page and have a look at the Wikimedia projects in which this context of usage is important: zhwiki, zhwikisource, iiwiki... Surprised? Sometimes, the results that WDCM brings back receive a straightforward, direct interpretation, like this one. A context of Wikidata usage that is characterized by geographical entities in China is found, and mainly projects in languages that are characteristic of the Chinese culture make use of Wikidata items in that context. However, most of the time the situation is much more complicated and a myriad of factors influencing the nature of the context and its distribution across the projects must be taken into account to understand its origin and make sense of it.

The pragmatics of Wikidata


To put it simply: while the inner life of Wikidata is all about the structure of its

  1. data model, or its ontology (Q324254), and
  1. the introduction of new items, properties, statements, qualifiers, references, labels... alongside the debate on how the later (should) relate to the former and what possible instantiation of the permissive Wikidata structure reflects the empirical world in a most desirable way,

the outer life of Wikidata is all about the way our communities use all this collaboratively developed, structured knowledge across approximately 800 projects that currently track its usage on their pages.

Wikidata is a symbolic system, and such its definition must encompass both semantics and syntax. However, there is a third component of any natural symbolic system: its pragmatics. In an analogy to the study of natural language, where pragmatics is defined as "... a subfield of linguistics and semiotics that studies the ways in which context contributes to meaning...", WDCM is meant to become our method to study how the editors map the content and the formal structure of Wikidata to the page content of Wikimedia projects. As a consequence of having such a method at our disposal, we can begin to learn how Wikidata is used, i.e. how the meaning of the knowledge it stores gets altered by its contextual usage across the pages and projects - the usage that is mediated through the minds of Wikimedia contributors.

Very important things about this system


If you think you could make use of the WDCM and enjoy learning about the Wikidata usage in the Wikimedia universe, you probably need to prepare yourself to encounter a rather complex world of findings and reports. We hope for the WDCM system to become a path towards discovery. However, the path is not straightforward. WDCM is the first step towards building an understanding of the highly complicated structure of Wikidata usage. This system can help you discover what Wikidata client projects are similar and in what respect, what semantic categories of items are used more or less frequently across the projects, how do items connect in respect to how similarly they are used by our communities, what are the most popular items per project, and many more (hopefully) interesting things. If used properly and with understanding, it can be your navigation tool in the immensely interesting and complex field of Wikidata user behavior.

In general, what you should always take into your consideration while browsing the WDCM dashboards, is the following:

WDCM does not study all of Wikidata

  • The current version of the WDCM does not encompass all Wikidata items. This fact is not due to the technical constraints as much as it is related to methodological constraints, of which some will be discussed bellow. Further reading: somewhere on this page we discuss the WDCM Taxonomy, a principle to select the items that are tracked for their usage across our projects and that undergo WDCM analyses.

WDCM is agnostic in respect to the structure of Wikidata


What influences the nature of the semantic contexts discovered by the WDM


Well, this is rather important if you plan to understand what WDCM can do for you.

  • The core algorithm. In order to discover the semantic topics (i.e. contexts of Wikidata usage) across some 800 Wikimedia projects and a selection of semantic categories of items from Wikidata, the WDCM employs a standard algorithm used in text-mining and Natural Language Processing know as the Latent Dirichlet Allocation (LDA). While understanding the mathematical and computational details of the way LDA works is not essential for a WDCM user, reading through the less technical Wikipedia page on Topic Models - a general class of mathematical models used in text categorization - might prove to be helpful. For those who do the reading: it's just that we don't use the classic term-frequency matrix, but a project-item usage frequency matrix instead. The nature of this algorithm, of course, heavily influences what semantic contexts of Wikidata usage will be discovered.
  • The nature of the Universe. Of course, discovering that projects written in the languages of China are ranked highly in a semantic context characterized by items characteristic of the Chinese culture is exactly what one would expect to happen. WDCM tends to group similarly used things together. From time to time only, its results will match quite precisely you everyday expectations about the Universe. However, WDCM will do an even more important thing to you by showing you what information are you missing in order to fully understand the world of Wikidata (if that is possible at all).
  • Idiosyncratic phenomena. Let's study the following example for a while. It introduces a rather unusual situation from which we can learn how the nature of the WDCM system in itself introduces additional constraints in the interpretation of its results. Note: this is very complicated example, but be prepared to encounter many similar things during your journey into Wikidata usage with the WDCM, so it is highly recommended to study it.
WDCM Example 4. Go to the WDCM Semantics Dashboard, and (1) Select Category: Event from the Semantic Models tab, Select Semantic Topic: Topic 4. The first chart produces by the dashboard will tell you that the item 2014 Indian general election in Tamil Nadu (Q15894105) plays the most prominent role in this context, followed by a list of items mostly about Giro d'Italia (Q33861) (?!!) whose importance in this semantic topic (look at the x-axis!) is far, far less than the one of the first ranked item. What in the world do the elections in India have to do with the Giro d'Italia - a cycling road race held in Italy? Scroll down to the bottom of the dashboard page to learn about the Wikimedia projects in which this context of usage is important, and you will see that only tawiki and arwiki (again, take a look at the x-axis of the plot) are significantly interested in this topic, followed by a list of projects that barely make use of it (itwiki and trwiki ranked among the highest). So, the first thing that we learn is that this context presents something rather specific. We have inspected Wikidata and found out that we can explain why itwiki and trwiki are highly ranked in this semantic context: they consistently make use of many of the Giro d'Italia items from Wikidata. However, it remains unclear what brings together 2014 Indian general election in Tamil Nadu (Q15894105) , Gulf War (Q37643), and many Giro d'Italia items. From the WDCM Usage Dashboard, we have used the Project Report section on the Usage tab to find out that the top projects in this semantic context are indeed found among the Wikimedia projects that make use of Wikidata the most (tawiki ranked 16th, arwiki holding the 301. place - not bad, almost in the upper third of projects in respect to Wikidata usage, itwiki on the 13th position, and trwiki on the 39. place, all in respect to the total usage of Wikidata per project). Thus, the finding is not a consequence of merely having sparse data. Finally, we were able to understand the nature of this semantic context only by looking at the WDCM under the hood to find out on how many distinct pages on tawiki was the item of 2014 Indian general election in Tamil Nadu (Q15894105) used, and the number is: 9950, a rather high usage statistics. For some reasons, the community around tawiki was very focused on this event at some point in time. The WDCM has discovered this fact by means of statistical learning and separated this context of Wikidata usage in a semantic topic per se, in order to mark that something very specific but highly representative of the tawiki project has happend: in fact, the item under discussion is the 5-th ranked Wikidata item in respect to its usage on this project. Now, the question: why haven't the first four most frequently used items on tawiki influenced the result in this way? We have first visited the WDCM Semantics Dashboard: on the Project Semantics tab, select tawiki, and you will learn that it scores 100% in this very semantic context. Next, we again went under the hood of the WDCM and inspected source data to find out that the item of 2014 Indian general election in Tamil Nadu (Q15894105) is the only Wikidata item on tawiki with any significant usage in the semantic category of Events at all. And that is the message that the WDCM had for us: there is a very specific context (Topic 4 in Events) in which a single item (Q15894105) from a particular category (Event) is almost exclusively used in a particular project (tawiki). Again, question: what do the Gulf war and Giro d'Italia have with all this? The answer is: probably nothing. Under the theoretical model employed in WDCM, the one upon which the LDA algorithm is based, all items from a particular semantic category play some role in each of the discovered semantic topics (i.e. contexts of usage). In other words, no matter how specific a particular semantic context is, and the one under discussion is quite specific, all items must fit into it and be represented by some importance score in it (actually, it's the probability of them being used in this context). The various Giro d'Italia items and the Gulf war simply turned out be the at the top of the procession of a large number of very, very small item importance scores following the importance of 2014 Indian general election in Tamil Nadu (Q15894105) in this highly specific context. Indeed, the conclusions is the following one: (a) there is a highly specific context describing the dominant usage of one single item from the Event category on tawiki, and (b) the rest of the information in this context can be treated as a statistical artifact with not too much importance in the interpretation of the finding. 

Complicated? Well, Wikidata usage in itself is a behavioral phenomenon of immense complexity. The WDCM can help you reduce that complexity a bit and navigate through it, but it won't do the research and thinking part on your behalf. Do not expect that this system will explain the patterns of Wikidata usage in any way. It was built as a methodological tool, a measurement instrument, a portal to access the data and categorize them in the statistically most convenient way before they are presented to you. The Hubble Space Telescope (Q2513) helps us to observe the Universe, but the results that we obtain from its observations undergo careful and painful processing and discussions on the behalf of the scientific community in order to build theories and hypotheses about the physical world. Ask yourself: what is more complicated, the physical universe, guided by the laws of physics, or the semantic universe, guided by the interaction of billions of human beings online with all different cultural backgrounds, education, cognitive styles, information that they can access, points of view, and interests? The WDCM can make observations of this immense complexity and provide some means to help you reduce to a (hopefully) manageable proportion. However, you still need to view it and use it as an instrument only while doing the research part on your own. We hope that this call is challenging enough.

To summarize: the system will produce a finding based on whatever data on Wikidata usage it has, and you have to inspect the result carefully to understand whether they make sense, how specific they are, or if they simply present a "statistical artifact". The specifics of particular projects, as illustrated in the previous WDCM Exercise, do not end here. For the same tawiki project, go visit the WDCM Semantics Dashboard, and select Semantic Category: Human and Topic: Topic 8. You will discover another semantic topic that is of practical importance for the understanding of tawiki merely. The moral of the story: Wikidata usage is not about what you expect that the editors will do from the perspective of your own conceptual organization of the Universe, but about what different individuals and communities do with Wikidata across the Wikimedia projects. The WDCM can bring you back many interesting results on the later, and very little on the former - only up to some degree of match between the mind of a semanticist or a formal ontologist and what people actually do with Wikidata out there.

  • The characteristics that shape how our editors and communities make use of Wikidata. Obviously, this is what the game is all about. It is probably impossible to list all the factors that influence the patterns of Wikidata usage across some 800 projects, including some among the most dynamic places online at all.
    • Historical influences, like whether a particular event has shaped the culture or the educational system of a particular socio-linguistic community (that dominantly manages one or more projects) to conceptually organize their knowledge in a specific form, which is then reflected in a specific pattern of Wikidata usage.
    • The interests and the motivations of a particular editor, of course: if there is an editor whose interests in golf and Italian literature are consistent in time, and given that the editor shows a certain degree of persistence in their usage of Wikidata, there's not end to what they can do, including such changes in the pattern of Wikidata usage that will reflect in the discovery of new semantic topics (i.e. contexts of Wikidata usage).
    • Access to knowlegde and local/cultural variations in the organization of knowledge: community A thinks that all real-world phenomena x of some class X should be related to certain Wikidata items, while community B is equally persistent in making links towards another set of Wikidata items while writing about the same phenomena. As a result, such communities' Wikidata usage patterns can lead WDCM to discover semantic

contexts that cannot be interpreted in a straightforward manner, but must be construed as a mixture of two, potentially opposing, interpretations of some particular domain of knowledge.

    • Automated inputs: again, there's no end to what a good bot can do; its pattern of Wikidata usage will reflect the structure of the underlying algorithm, which will in turn reflect the knowledge and beliefs of its authors, a particularly tricky situation to investigate.
    • Access, ease of use, and the availability of sources: if a community A, working a particular language or a set of languages, has access to sources in some other particular language L, it will probably use Wikidata in a way that reflects this fact, no matter that its knowledge might not match completely with what is present in the sources available to them. This list is certainly not exhaustive.
  • The definition of key WDCM statistics. The current Wikidata item usage statistic definition is the count of the number of distinct pages in a particular Mediawiki project where the respective Wikidata item is used. This definition is motivated by the currently present constraints in Wikidata usage tracking across the client projects (see Wikibase/Schema/wbc entity usage). With more mature Wikidata usage tracking systems, the definition will become a subject of change. However, as the definition of the key metric changes, the results of the WDCM statistical learning procedures will necessarily change too.

The WDCM system


The WDCM system encompasses two components, of which the second one is meant for its users to interact with: (1) The WDCM Engine, which encompasses a set of R/HiveQL/SQL scripts that collect the data while providing ETL and machine learning until they are ready to feed the WDCM Dahsboards databases, and (2) The WDCM Dashboards, a set of (hopefully) user-friendly dashboards were data and the results of their statistical modeling can be visualized and downloaded. This page is about the second (2) component of the system. If you are interested to learn about the WDCM Engine, the Wikitech page should be telling enough.

The WDCM system is developed by Goran S. Milovanović, Data Scientist, Wikimedia Deutschland, with a help of many people to prepare complex ETL procedures and productionize the system, such as Dan Florin Andreescu, Software engineer, Wikimedia Foundation, and Adam Shorland, Software Developer, Wikimedia DeutschlandLydia Pintcher, Product Manager of Wikidata, Wikimedia Deutschland, supervised the development of the system and contributed the currently used WDCM Semantic Taxonomy that the system relies on.  The software development of the WDCM system is supervised by Tobias Gritschacher, Engineering Manager, Wikimedia Deutschland, while Jan Dittrich, UX Design / Research, Wikimedia Deutschland supervises the UI/UX aspects.The write-ups of the previous experiences in managing Shiny Dashboards on behalf of Mikhail Popov and the team that built our Discovery Dashboards were very helpful in the development of the WDCM Dashboards. Of course, enlightening discussions with Aaron Halfaker, Research Scientist, Wikimedia Foundation, and his team.

In order to be able to use the WDCM system in a way it was meant and designed to be used, i.e. with a clear understanding of what is it built for and why it was built that way, you probably need to get to learn about some important WDCM definitions (and the constraints that dictated them) first. You can do that by reading through the Definitions section of the WDCM Wikitech Technical Documentation. Do not panic, please: it is written in a language that a non-technical person who does not necessarily care about Data or Cognitive Science can understand.

Obviously, the current version of the WDCM system focuses on Wikidata item usage. The current version of the system does not track neither analyze the usage of properties, qualifiers, etc.

Any ideas and contributions are, of course, welcome. If you have anything on your mind please visit the talk page and edit.

The WDCM Definitions


The following terms are used frequently on the WDCM Dashboards and have a specific meaning in the context of this system:

  • The current Wikidata item usage statistic definition is the count of the number of pages in a particular client project where the respective Wikidata item is used. Thus, the current definition ignores the usage aspects (L, S, X, O, T) completely. This definition is motivated by the currently present constraints in Wikidata usage tracking across the client projects (see Wikibase/Schema/wbc entity usage). With more mature Wikidata usage tracking systems, the definition will become a subject of change.
  • The term Wikidata usage volume is reserved for total Wikidata usage (i.e. the sum of usage statistics) in a particular client project, group of client projects, or semantic categories.
  • By a Wikidata semantic category we mean a selection of Wikidata items that is that is operationally defined by a respective SPARQL query returning a selection of items that intuitively match a human, natural semantic category. The structure of Wikidata does not necessarily match any intuitive human semantics. In WDCM, an effort is made to select the semantic categories so to match the intuitive, everyday semantics as much as possible, in order to assist anyone involved in analytical work with this system. However, the choice of semantic categories in WDCM is not necessarily exhaustive (i.e. they do not necessarily cover all Wikidata items), neither the categories are necessarily mutually exclusive. The Wikidata ontology is very complex and a product of work of many people, so there is an optimization price to be paid in every attempt to adapt or simplify its present structure to the needs of a statistical analytical system such as WDCM. The current set of WDCM semantic categories is thus not normative in any sense and can become a subject of change in any moment, depending upon the analytical needs of the community. The currently used WDCM Taxonomy of Wikidata items encompasses the following 14 semantic categories: geographical feature (Q618123)organization (Q43229)architectural structure (Q811979)human (Q5), Wikimedia Internal which encompasses Wikimedia category (Q4167836), Wikimedia disambiguation page (Q4167410), Wikimedia template (Q11266439)work of art (Q838948)book (Q571),gene (Q7187)scholarly article (Q13442814), Chemical Entities that encompass chemical element (Q11344), chemical compound (Q11173), and chemical substance (Q79529), astronomical object (Q6999)thoroughfare (Q83620)event (Q1656682), and taxon (Q16521). All respective SPARQL queries used to fetch the item IDs from Wikidata in the respective categories have the same form: wdt:P31/wdt:P279*. In other words, they look for all the instances of a particular class of items, and search the whole data structure through sub-class relations until the most abstract, target level of categorization is reached.
WDCM Overview Dashboard: the 14 semantic categories of Wikidata items that are encompassed by the current version of the WDCM Taxonomy. Each bubble represents a Wikidata semantic category. These categories represent one possible way of categorizing the Wikidata items. The size of the bubble reflects the volume of Wikidata usage from the respective category. If two categories are found in proximity, that means that the projects that tend to use the one also tend to use the another, and vice versa.
  • By project type we mean: Wikipedia, Commons, Wikivoyage, Wiktionary, Wikiquote, etc.

WDCM Use cases


While the Overview Dashboard presents - as the name suggests - only the most robust, top-level patterns of Wikidata usage, and is meant as a sort of a "big picture" presentation of current Wikidata usage, the Usage and the Semantics Dashboard are built having in mind the needs of a particular user who is interested in some specific semantic categories and projects. The WDCM Items Dashboard is a planned component of the system that will enable the user to access the statistics and structural properties of Wikidata usage for particular items. The following two use cases illustrate the ways in which WDCM could be used to learn about some specifics of Wikidata usage from a viewpoint of a fictional but motivated user. Both use cases rely on the functions enabled by the Usage and the Semantics Dashboard. All WDCM dashboards have a Navigate WDCM tab, from which any component of the system can be reached. Also, they all have a Description tab, where a detailed explanation of the dashboard's functionality is found.

Use Case A: Compare large encyclopedias


In this use case, we want to compare English, French, German, and Russian Wikipedia in respect to their Wikidata usage. We divide the whole journey into several steps and analyses across the WDCM Dahsboards.

WDCM Example 5, Step 1. Our first destination is the WDCM Usage Dashboard. On the dashboard's landing page (the Usage Tab under the Dashboard Tab), the right column is dedicated to the study of particular projects. Under Project Report, select enwiki in the Search projects: field.The dashboard will start generating the results; please be patient. Now we can easily start to scroll down and inspect one by one result reported in respect to the Wikidata usage on the English Wikipedia. The first generated reports gives us an overview of Wikidata usage in the particular project. The bar plot to the right represents the volume of Wikidata usage in each of the semantic categories that are currently encompassed by the WDCM analyses. We can see that the items from the categories of Geographical Object, Human, Organization, Taxon, Work of Art, and Wikimedia are predominately used in the English Wikipedia. The summary text to the left of the bar plot says: "enwiki has a total Wikidata usage volume of 6335820 items (4.4% of total Wikidata usage across the client projects).In terms of Wikidata usage, it is ranked 5/789 among all client projects, and 4/301. in its Project Type (Wikipedia)." Let's see what does it mean to have a Wikidata usage volume of 6335820 items in the context of WDCM. The current definition of the WDCM item usage statistics is the following: the count of the number of pages in a particular client project where the respective Wikidata item is used. That means that WDCM has counted, for each item of interest, the number of distinct pages in this project that use (one or many times) a particular item, and the summed up these numbers to obtain 6,335,820. This sum accounts for approximately 4.4% of the total sum that one would obtain from all Wikimedia projects under consideration, and makes the English Wikipedia the fifth most prominent project in terms of Wikidata usage among all Wikimedia projects, as well as fourth among 301 Wikipedia projects - where Wikipedia is its project type, of course. The chart provided immediately bellow provides a context to this ranking of the project under discussion. By repeating this for the French, Russian, and German encyclopedias, we discover that the French Wikipedia accounts for some 2.69% of total usage and is ranked sixth among all the Wikimedia projects, the German Wikipedia accounts for 1.85% of total usage volume and is ranked twelve, while the Russian Wikipedia accounts for 6.26% and is ranked third.
WDCM Usage Dashboard: Wikidata usage pattern across 14 semantic categories for German Wikipedia (dewiki)

This first step have obtained elementary information and rankings of four projects from the WDCM. A careful analyst might have spotted additional important differences between these four projects by inspecting the first bar plot where semantic categories of Wikidata items are compared for their usage. For example, the Russian Wikipedia tends to use more items from the Architectural Structure category than the English Wikipedia, and much less items from the Wikimedia category. While English Wikipedia uses more items from the category Human than Geographical Objects or Organizations (similar to its German counterpart, dewiki), its French sister shows exactly the opposite pattern of usage.

WDCM Usage Dashboard- The ranking of English Wikipedia in respect to Wikidata usage.

An interested analyst might already have discovered two additional visualizations in the right column of the dashboard page: the Semantic Neighborhood interactive network and the top 30 Wikidata items plot. While the former will be discussed at a later point, the later is rather straightforward: it reports the 30 most frequently used Wikidata item for the currently selected project. In the English Wikipedia, the top 5 are: United States of America (Q30), house mouse (Q83310), brown rat (Q184224), Danio rerio (Q169444), and Drosophila melanogaster (Q130888) - a Wikidata user community with huge interest in biology indeed - while in the German Wikipedia we find: United States of America (Q30), Germany (Q183), United Kingdom (Q145), France (Q142), and black-and-white (Q838368) (monochrome form in visual arts).

WDCM Usage Dashboard- The most popular Wikidata items on French Wikipedia

However, comparing the projects in this way on the Usage Dashboard is tiresome. Let's find out whether the WDCM Usage Dashboard can provide better means for comparing across projects.

WDCM Example 5, Step 2. We are found at the WDCM Usage Dashboard again, but this time we visit the Tabs/Crosstabs Tab and carefully read the introductory instructions for its usage (provided at the top of the Tabs/Crosstabs dashboard tab). In the Search Projects: field we enter ruwiki, enwiki, dewiki, and frwiki, and select all categories in the Select categories: field; click Apply Selection. After some computation and plot rendering, the Dashboard will provide a new set of charts. The first two provide a total Wikidata usage volume per project and per selected category. The third one, immediately bellow, is the least interesting: namely, we have selected four project from the same project type (Wikipedia), so we can eventually learn only that the total Wikidata usage volume in the current project selection is around 21.9 million distinct item/pages. However, the next chart, the large Project x Category cross-tabulation chart, is very informative: it provides an overview of Wikidata usage volume (y-axis) for each project (x-axis) in each semantic category (every sub-panel represents one category).
WDCM Usage Dashboard: Project x Semantic Category Cross-Tabulation
WDCM Example 5, Step 2 (continued). We could have just selected _Wikimedia (note: mind the underscore) under Select projects: field to retrieve the complete Wikidata usage statistics for all projects of the project type Wikipedia. Since the number of selected projects here is high, the WDCM Usage Dashboard will visualize only the results for the top 30 projects in respect to the total Wikidata usage per project. However, each chart on the Tabs/Cross-tabs tab is accompanied by a Data (csv) button: click the button to download the full dataset for the selection irrespective of what is visualized. So, if you are interested about the big picture of Wikidata usage across the Wikipedia projects that make the most use of it, here it goes:
WDCM Usage Dashboard: Wikidata usage in 14 semantic categories for the top 30 Wikipedia projects in respect to their total volume of Wikidata usage.

We have now learned how to begin to work with the WDCM Usage Dashboard and compare projects in respect to their total volume of Wikidata usage, or their Wikidata usage in particular semantic categories. Nothing yet about the semantic contexts of Wikidata usage that were discussed in the introductory examples on this page. Let' see.

Use Case B: Connect the communities


The following example focuses on the Semantics Dashboard, the main WDCM tool to study the contexts of Wikidata usage in various semantic categories. We will have to dive a bit deeper into the underlying logic of the WDCM in order to understand the way it discovers the semantic context of Wikidata usage.

WDCM Example 6, Step 1. Go visit the WDCM Semantics Dashboard, and navigate to Similarity Maps Tab under Dashboard. In the field Select semantic category pick Human. An interactive plot will be generated, presenting a semantic map. Each bubble in the map represents one Wikidata client project (i.e. one Wikimedia project). Projects of different type (Wikipedia, Commons, Wiktionary, Wikiquote, etc) have different colors, with the color legend provided to the right of the map, alongside the tools to interact with it. Hovering any bubble will reveal the respective project name and Wikidata usage details.
WDCM Semantics Dashboard: the semantic map of the category Human (Q5). Each bubble represents a project. The closer the two projects are found, the more similarly they tend to use the items from this semantic category. The size of the bubble represents the total volume of Wikidata usage in the respective project.

The semantic map that we have just generated from the WDCM Semantics Dashboard serves to inspect the similarity structures in respect to Wikidata usage in particular categories. The similarity of projects in such maps is represented by distance in a 2D plane: the more proximal the projects are found to be, the more similar their usage of items from a particular category (human (Q5), in this case). In order to understand what the WDCM Semantics Dashboard does, we need provide at least a quick insight into the inner workings of the WDCM system.

How does the WDCM discover the similarity structures in Wikidata usage? As already explained, the elementary Wikidata item usage statistic in WDCM is the count of distinct pages in particular project that make use of the respective item (once or more than once in a page). What follows is that the pattern of Wikidata usage in a particular project can be described by an array of numbers (a vector), with each number representing the usage count for a particular item. In WDCM, each of the considered semantic categories is analyzed separately. We first select only the items from a particular category (for example human (Q5), represented on the map above), then produce their usage counts for each Wikimedia project under consideration, and obtain a matrix in which the rows are indexed by Wikimedia projects (i.e. each row representing one project), and columns by the Wikidata items from the selected semantic category (i.e. each column represents an item). The cells of the matrix are filled with item usage counts. Such matrices can be modeled by the Latent Dirichlet Allocation (LDA), a standard unsupervised learning algorithm in text-mining and Natural Language Processing algorithm that essentially results in the following:

  • Assume that the matrix of counts is produced in the following way:
    • there is a set of semantic topics, each topic representing the probabilities by which the considered Wikidata items can be used when the respective topic is used itself;
    • each project is represented by a mixture of all semantic topics, i.e. each project is characterized by the importance that each of the hypothesized semantic topics has in it (note: technically speaking, a project is thus described by a probability distribution of semantic topics; in turn, each semantic topic is a probability distribution over the items);
    • the hypothesized process that generates Wikidata usage in a particular project is the following: (1) randomly pick a semantic topic (say, Celebrities, from human (Q5)) according to the probability of the respective topic being selected in a given project, (2) from the selected semantic topic, randomly pick an item, according to the probability of the respective item to be selected in that semantic topic, (3) "use" the item in the project under consideration.

What the LDA algorithm attempts is to reverse engineer this hypothesized generative process that populates the projects x items matrix for a given number of semantic topics. WDCM runs LDA many times for each semantic category, inspecting solutions across a wide range of semantic topics, until it finds the most satisfactory one according to some quite complex criteria of statistical learning (see: Bayes factor). Once the optimal solution is selected, the algorithm returns two matrices that we use in all further WDCM analyses and visualizations:

  • the semantic topic x project matrix, in which each semantic topic has a weight (i.e. a probability) in each project, and
  • the item x semantic topic matrix, in which each Wikidata item has a weight (i.e. a probability) in each semantic topic.

Now each semantic topic is represented by a vector of probabilities (i.e. a probability distribution) across all present items from a particular Wikidata semantic category, and each project by a vector of probabilities (i.e. a probability distribution) across the semantic topics - representing how likely is that a particular item will be used in a project when a particular semantic topic is active in the hypothesized generative process. Caveat: WDCM does not model all items from any of the Wikidata semantic categories under consideration. We select a large number of the most frequently used items from a category simply because modeling items that are rarely used would not improve upon the quality of the LDA solutions in any respect.

Given that semantic topics and projects are represented by probability distributions, we can apply distance metrics upon them (such as the Hellinger distance, or Kullback–Leibler divergence), providing the basis for their visualizations. However, the obtained metric hyperspaces first need to undergo dimensionality reduction in order to be represented in a 2D or 3D spaces. The above semantic map, for example, is obtained from the t-distributed stochastic neighbor embedding (t-SNE) dimensionality reduction to 2D from the projects x topics hyperspace in the Wikidata category of human (Q5), following the LDA modeling of the category as described. This dimensionality reduction method is very good at conserving the local similarity structures found in the original hyperspaces, and we can see how bubbles representing Wikimedia projects tend to cluster according to how similarly use Wikidata items in this semantic category.

WDCM Example 6, Step 2. Use the zoom tool from the toolkit found to the right of semantic map of the category Human, and select the cluster of projects found in the upper left corner of the map.
WDCM Semantics Dashboard: a close-up of a cluster of projects grouped in respect to their similarity of Wikidata item usage from Q5 (Human).

By inspecting the projects in this cluster, we find itwiki, nlwiki, plwiki, enwiki, and ptwiki, among others.

WDCM Example 6, Step 3. Change the dashboard tab to Project Semantics and select itwiki, nlwiki, plwiki, enwiki, and ptwiki; Hit Apply Selection and wait until the plot updates. In the resulting chart, find the panel that represents the semantic category Human. You should be able to learn from there that these projects receive significant influence from topics 1, 2, 3, and 8, out of the total of eight topics in this category of Wikidata items. The topic 3 seems to be the most influential topic in this selection of projects.
WDCM Semantics Dashboard: the relative contribution of semantic topics from 14 LDA models (a model per each of the 14 semantic categories currently encompassed by the WDCM) to Wikidata usage in enwiki, plwiki, ptwiki, itwiki, and nlwiki.

Note: the LDA models for particular semantic categories do not necessarily encompass the same number of semantic topics; on the contrary, it would be a coincidence. The x-axis on this plot always represents the number of topics found in the LDA model of the semantic category that encompasses the highest number of topics, for reasons of consistent data visualization only. For example, we can see that the LDA model of the category Wikimedia encompasses only four topics, in contrast to the LDA model of the category Geographical Object that encompasses ten.

WDCM Example 6, Step 4. Now we know that the most important semantic topic in the category Human for enwiki, plwiki, itwiki, ptwiki, and nlwiki is Topic 3 (43.56% relative importance). Change the dashboard tab to Semantic models, select Human in the Select Semantic Category field, and Topic 1 in the Select Semantic Topic field. The first chart on the dashboard page will provide an insight into the 50 most important Wikidata items in this topic of the category Human. The second visualization (scroll down) is an interactive semantic network: each of these top 50 items is represented by a node and points towards the Wikidata item that is most similarly used to it in the respective category of items. The semantic network can help in the interpretation of the topic under consideration. Again: the similarity derived from the WDCM is the similarity in respect to the pattern of Wikidata usage, not necessarily in respect to your expectations based on the meaning of the respective Wikidata items.
WDCM Semantics Dashboard: the semantic network of top 50 items from Topic 3 of the LDA topic model in the category Human (Q5).

Interesting things happen in Topic 3: a lot of professional cyclists, then Johann Sebastian Bach (Q1339), and Nikolai Chernykh (Q318611) - a Soviet and Ukrainian astronomer - among others. Another opportunity to remind ourselves how complex are the pragmatics of Wikidata: neither the errors on behalf of Wikipedia editors and Wikidata users nor the shadow of doubt over WDCM being broken explain the semantics of Topic 4 in the category of Humans. The only plausible hypothesis is that the community of editors interested in cycling are doing a great job on several projects - in fact they might represent exactly a group of editors from whom other Wikidata users could potentially learn a lot.

WDCM Example 6, Step 5. Finally, the last plot on the dashboard page represents 50 Wikimedia projects that are most heavily influenced by this topic in the semantic category Human (Q5).
WDCM Semantics Dashboard: top 50 most prominent Wikimedia projects in the Topic 3 of the LDA model of Human (Q5).

And now we have learned something about a specific tendency of Wikidata usage in the category human (Q5) for five Wikimedia projects (itwiki, plwiki, ptwiki, nlwiki, enwiki) from the WDCM Semantics Dashboard.

Q: What do we do with these findings?

A: What WDCM is meant for:

  • We have just identified a group of projects were the editors seem to share similar interests in a particular category;
  • why not ask whether these editors are connected and would it be possible for them to cooperate and share their knowledge and experiences;
  • we have also identified a group of projects were this same semantic context is important, beyond the five projects we were initially interested in;
  • ask how much do they use Wikidata in the respective semantic category and connect the editors from underdeveloped and more developed projects to cooperate and learn;
  • focus on solving the Wikidata cycling conspiracy.

WDCM Dashboards


This section provides a concise description of all WDCM Dashboards that are currently online. The same information can be found on the Description tab of every respective dashboard.

WDCM Overview Dashboard




The WDCM Overview Dashboard presents the big picture of Wikidata usage; other WDCM dashboards go into more detail. The Overview Dashboard provides insights into (1) the similarities between the client projects in respect to their use of of Wikidata, as well as (2) the volume of Wikidata usage in every client project, (3) Wikidata usage tendencies, described by the volume of Wikidata usage in each of the semantic categories of items that are encompassed by the current WDCM edition, (4) the similarities between the Wikidata semantic categories of items in respect to their usage across the client projects, (5) ranking of client projects in respect to their Wikidata usage volume, (6) the Wikidata usage breakdown across the types of client projects and Wikidata semantic categories.

Wikidata Usage Overview


The similarity structure in Wikidata usage across the client projects is presented. Each bubble represents a client project. The size of the bubble reflects the volume of Wikidata usage in the respective project. Projects similar in respect to the semantics of Wikidata usage are grouped together.

The bubble chart is produced by performing a t-SNE dimensionality reduction of the client project pairwise Euclidean distances derived from the Projects x Categories contingency table. Given that the original higher-dimensional space from which the 2D map is derived is rather constrained by the choice of a small number of semantic categories, the similarity mapping is somewhat imprecise and should be taken as an attempt at an approximate big picture of the client projects similarity structure only. More precise 2D maps of the similarity structures in client projects are found on the WDCM Semantics Dashboard, where each semantic category first receives an LDA Topic Model, and the similarity structure between the client projects is then derived from project topical distributions.

While the Explore tab presents a dynamic {Rbokeh} visualization alongside the tools to explore it in detail, the Highlights tab shows a static {ggplot2} plot with the most important client projects marked (NOTE.Only top five projects (of each project type) in respect to Wikidata usage volume are labeled).

Wikidata Usage Tendency


The similarity structure in Wikidata usage across the semantic categories is presented. Each bubble represents a Wikidata semantic category. The size of the bubble reflects the volume of Wikidata usage from the respective category. If two categories are found in proximity, that means that the projects that tend to use the one also tend to use the another, and vice versa. Similarly to the Usage Overview, the 2D mapping is obtained by performing a t-SNE dimensionality reduction of the pairwise category Euclidean distances derived from the Projects x Categories contingency table.

Wikidata Usage Distribution


The plots are helpful to build an understanding of the relative range of Wikidata usage across the client projects. In the Project Usage Rank-Frequency plot, each point represents a client project; Wikidata usage is represented on the vertical and the project usage rank on the horizontal axis, while only top project (per project type) are labeled. The highly-skewed, asymmetrical distribution reveals that a small fraction of client projects only accounts for a huge proportion of Wikidata usage.

In the Project Usage log(Rank)-log(Frequency) plot, the logarithms of both variables are represented. A power-law relationship holds true if this plot is linear. The plot includes the best linear fit, however, no attempts to estimate the underlying probability distribution were made.

Client Project Types


Project types are provided in the rows of this chart, while the semantic categories are given on the horizontal axis. The height of the respective bar indicates Wikidata usage volume from the respective semantic category in a particular client project.

Client Projects Usage Volume


Use the slider to select the percentile rank range of the Wikidata usage volume distribution across the client project to show. The chart will automatically adjust to present the selected projects in increasing order of Wikidata usage, and presenting at most 30 top projects from the selection. NOTE. The percentile rank of a score is the percentage of scores in its frequency distribution that are equal to or lower than it. For example, a client project that has a Wikidata usage volume greater than or equal to 75% of all client projects under consideration is said to be at the 75th percentile, where 75 is the percentile rank.

In effect, you can browse the whole distribution of Wikidata usage across the client projects by selecting the lower and uppers limit in terms of usage percentile rank.

Wikidata Usage Browser


A breakdown of Wikidata usage statistics across client projects and semantic categories. To the left, a table that presents a Client Project vs. Semantic Category cross-tabulation. The Usage column in this table is the Wikidata usage statistic for a particular Semantic Category x Client Project combination (e.g. The Wikidata usage in the category "Human" in the dewiki project). To the right, the total Wikidata usage per client project is presented (i.e. the sum of Wikidata usage across all semantic categories for a particular client project; e.g. the total Wikidata usage volume of enwiki).

WDCM Usage Dashboard




The WDCM Usage Dashboard focuses on providing the detailed statistics on Wikidata usage in particular sister projects or the selected subsets of them. Three pages that present analytical results in this Dashboard receive a description here: (1) Usage, (2) Tabs/Crosstabs, and (3) Tables.



The Usage tab provides elementary statistics on Wikidata usage across the semantic categories (left column) and sister projects (right column).

To the left, we first encounter a general overview of Basic Facts: the number of Wikidata items that are encompassed by the current WDCM taxonomy (in effect, this is the number of items that are encompassed by all WDCM analyses), the number of sister projects that have client-side Wikidata usage tracking enabled (currently, that means that the Wikibase/Schema/wbc entity usage) is present there), the number of semantic categories in the current version of the WDCM Taxonomy, and the number of different sister project types (e.g. WikipediaWikinews, etc). 

The Category Report subsection allows you to select a specific semantic category and generate two charts beneath the selection: (a) the category top 30 projects chart, and (b) the category top 30 Wikidata items chart. The first chart will display 30 sister projects that use Wikidata items from this semantic category the most, with the usage data represented on the horizontal axis, and the project labels on the vertical axis. The percentages next to the data points in this chart refer to the proportion of total category usage that takes place in the respective project. The next chart will display the 30 most popular items from the selected semantic category: item usage is again placed on the horizontal axis, item labels are on the vertical axis, and item IDs are placed next to the data points themselves. 

The Categories General Overview subsection is static and allows no selection; it introduces two concise overviews of Wikidata usage across the semantic categories of Wikidata items. The Wikidata Usage per Semantic Cateory chart provides semantic categories on the vertical and item usage statistics on the horizontal axis; the percentages tells us about the proportion of total Wikidata usage that the respective semantic category carries. Beneath, the Wikidata item usage per semantic category in each project type provides a cross-tabulation of semantic categories vs. sister project types. The categories are color-coded and represented on the horizontal axes, while each chart represents one project type. The usage scale, represented on the vertical axes, is logarithmic to ease the comparison and enable practical data visualization. 

To the right, an opportunity to inspect Wikidata usage in a single Wikimedia project is provided. The Project Report section allows you to select a single Wikimedia project and obtain results on it. The first section that will be generated upon making a selection provides a concise narrative summary of Wikidata usage in the selected project alongside a chart presenting an overview of Wikidata usage per semantic category. The next chart, Wikidata usage rank, show the rank position of the selected project among other sister projects in respect to the Wikidata usage volume. Beneath, a more complex structure, Semantic Neighbourhood, is given. In this network, or a directed graph if you prefere, each project points towards the one most similar to it. The selected projects has a different color. The results are relevant only in the context of the current selection: the selected project and its 20 nearest semantic neighboors only are presented. Once again: each project points to the one which utilizes Wikidata in a way most similar to it. Thetop 30 Wikidata items chart presents the top 30 Wikidata items in the selected project: item labels are given on the vertical axis, Wikidata usage on the horizontal axis, and the item IDs are labeled close to the data points themselves.



Here we have the most direct opportunity to study the Wikidata usage statistics across the sister projects. A selection of projects and semantic categories will be intersected and only results in the scope of the intersection will be returned. The charts should be self-explanatory: the usage statistic is always represented by the vertical axis, while the horizontal axis and sub-panels play various roles in the context of whether a category vs project or a category vs project type crosstabulation is provided. Data points are labeled in million (M) or thousand (K) pages (see Wikidata usage) definition above). While charts can display a limited number of data points only, relative to the size of the selection, each of them is accompanied by a Data (csv) button that will initiate a download of the full respective data set as a comma separated file.



The section presents searchable and sortable tables and crosstabulations with self-explanatory semantics. Access full WDCM usage datasets from here.

WDCM Semantics Dashboard




The WDCM Semantics Dashboard is probably the central and the analytically most complicated of all WDCM Dashboards. Here we provide only the necessary basics of distributional semantics needed in order to understand the results of semantic topic modeling presented on this WDCM dashboard. A user who needs to dive deep into the similarity structures between the Wikimedia sister projects in respect to their Wikidata usage patterns will most probably have to do some additional reading first. However, the Dashboard simplifies the presentation of the results as much as possible to make them accessible to any Wikidata user or Wikipedia editor who is not necessarily involved in Data or Cognitive Science. Reading through the WDCM Semantic Topic Models section in this page is highly advised to anyone who has never met semantic topic models or distributional semantics before.

WDCM Semantic Topic Models

Suggested Readings
  • Distributional Semantics. In Wikipedia. Retrieved October 24, 2017, from
  • Topic model. In Wikipedia. Retrieved October 24, 2017, from
  • Latent Dirichlet allocation. In Wikipedia. Retrieved October 24, 2017, from
  • Dimensionality reduction. In Wikipedia. Retrieved October 24, 2017, from

While Wikidata itself is a semantic ontology with pre-defined and evolving normative rules of description and inference, Wikidata usage is essentialy a social, behavioral phenomenon, suitable for study by means of machine learning in the field of distributional semantics: the analysis and modeling of statistical patterns of occurrence and co-occurence of Wikidata item and property usage across the client projects (e.g.enwikifrwikiruwiki, etc). WDCM thus employes various statistical approaches in an attempt to describe and provide insights from the observable Wikidata usage statistics (e.g. topic modeling, clustering,dimensionality reduction, all beyond providing elementary descriptive statistics of Wikidata usage, of course). 

Wikidata Usage Patterns. The “golden line” that connects the reasoning behind all WDCM functions can be non-technically described in the following way. Imagine observing the number of times a set of size Nof particular Wikidata items was used across some project (enwiki, for example). Imagine having the same data or other projects as well: for example, if 200 projects are under analysis, then we have 200 counts for N items in a set, and the data can be desribed by a N x 200 matrix (items x projects). Each column of counts, representing the frequency of occurence of all Wikidata entities under consideration across one of the 200 projects under discussion - a vector, obviously - represents a particular Wikidata usage pattern. By inspecting and modeling statistically the usage pattern matrix - a matrix that encompasses all such usage patterns across the projects, or the derived covariance/correlation matrix - many insigths into the similarities between Wikimedia projects items projects (or, more precisely, the similarities between their usage patterns) can be found. 

In essence, the technology and mathematics behind WDCM relies on the same set of practical tools and ideas that support the development of semantic search engines and recommendation systems, only applied to a specific dataset that encompasses the usage patterns for tens of millions of Wikidata entities across its client projects.

Dashboard: Semantic Models


Each of the 14 currently used semantic categories in the WDCM Taxonomy of Wikidata items receives a separate topic model. Each topic model encompasses two or more topics, or semantic themes. Here you can select a semantic category (e.g. "Geographical Object", "Human") and a particular topic from its model. The page will produce three outputs: (1) the Top 50 items in this topic chart, which presents the 50 most important items in the select topic of the selected category's topic model, (2) the Topic similarity network, which presents the similarity structure among the 50 most important items in the selected topic, and (c) theTop 50 projects in this topic chart, where 50 Wikimedia projects in which the selected topic plays a prominent role in the selected semantic category.

Dashboard: Project Semantics


Make a selection of Wikimedia projects here and hit Apply Selection. The Dashboard will produce a series of charts, one per each Wikidata semantic category that is present in your selection of projects, and compute the relative importance (%) of each topic in the given selection and for each semantic category. Do not forget that category specific semantic models do not necessarily encompass the same number of topics (in fact, they rarely do); also, Topic n in one category is obviously not the same thing as Topic n in some other category.

Dashboard: Similarity Maps


Upon a selection of semantic category, the Dashboard will present a 2D map which represents the similarities between the Wikimedia projects computed from the selected category's semantic model only. Here you can learn how similar or dissimilar are the sister projects in respect to their usage Wikidata items from a single semantic category.

User feedback

  • Any feedback on the WDCM usage is welcome and will be highly appreciated.
  • If you wish to discuss or help improve the technical aspects of the WDCM System, leaving a comment on the project Wikitech page is the way to go.

How to contribute

  • Providing user feedback is essential for the development of analytical systems as WDCM. Sharing your experiences with the WDCM is in that respect crucial. Think: project Talk Page.
  • The most useful contribution that we can imagine at the moment is to share your experiences in interpreting the WDCM results obtained for any analytical purposes that you might have had.
  • In case you have any idea on how would you like to contribute to the WDCM system that is not listed here, contact the system developer or simply leave a comment here on the project Talk Page.