Help:Modeling causes

From Wikidata
Jump to navigation Jump to search

Describing the cause of things is a fundamental task of human knowledge. Whether it's rainbows or plane crashes, we want to know: "why?" Wikidata can provide structure for basic answers to these questions.

A common way to talk about causes is as a chain of events. Some initial event or condition results in a sequence of events, each one causing the next, that eventually gives rise to some effect. This model can oversimplify things, but for many cases it works well enough. Importantly, it is easy for humans to understand.

The two most important parts of a chain of events are the first cause and the last cause. The first cause -- the initial cause that put the others in motion -- is called the underlying cause. The last cause -- the final cause that resulted in some effect -- is called the immediate cause. Underlying causes and immediate causes are also known as "ultimate" causes and "proximate" causes, respectively. Things are often significantly influenced by conditions or events that do not directly cause the effect. Such a thing is called a contributing factor. The difference between a cause and a contributing factor is that removing a cause would have prevented the effect, while removing a contributing factor would not have.

When people talk simply about the "cause" of something, they tend to mean the underlying cause. Because of that and for simplicity, has cause is the preferred label for the main property about cause; this represents the "underlying cause" of something.

In summary, the three generic properties for representing causes are:

The absence of a cause can be modeled with:


Fields for underlying cause, immediate cause, and contributing factor in the US Standard Certificate of Death

The three-tier approach for modeling causes in terms of underlying causes, immediate causes and contributing factors comes from the US Standard Certificate of Death, following guidelines from the World Health Organization for structuring data about cause of death. Statements about a person's cause of death on this certificate are the basis for much work and research in public health and demographics. A quick overview of the certificate is available at Documenting Death; a comprehensive reference is in the Physicians' Handbook on Medical Certification of Death by the CDC.

Of course, the study of causes goes far beyond vital records. Cause talk is ubiquitous in everyday life and academic disciplines. Modern historical works like Guns, Germs, and Steel examine why the Spanish conquistador Pizarro conquered the Inca of Peru, and not vice versa, in terms of proximate causes and ultimate causes: guns, germs, and steel on the one side and agriculture and geography on the other. Causation also holds a central place in philosophy. Classical thinkers like Aristotle said that things result from four kinds of causes. Modern philosophy has investigated causes from the perspectives of metaphysics, processes and counterfactuals. In business, Ishikawa diagrams (also called fishbone diagrams) represent causes and effects graphically to help analyze product defects and engineering accidents.

Representing knowledge about causes has also been a focus in artificial intelligence (AI) and the Semantic Web. In Processes and Causality, John Sowa reviews approaches to modeling causation by Judea Pearl, John McCarthy and other luminaries. Many AI representations model causes in a directed acyclic graph (DAG). Some models weight the link between each node with a probability to capture uncertainity; these are called causal networks.

Distinguishing causes and contributing factors[edit]

What is the difference between a "cause" and a "contributing factor"? In public health, law, social and natural science, the difference is that an effect would not have happened but for a cause, whereas it would have still happened but for a contributing factor. In other words, if you take away a contributing factor, the effect still would have happened. However if you took away a cause, the effect would not have happened. This test derives from the concept of sine qua non, meaning "without which no". It is also known as the "counterfactual" basis for establishing something as a cause.

Here is a test for assigning something as a cause or a contributing factor:

If the thing were removed, would that have prevented the effect?

Yes: the thing is a cause
No, but it significantly influenced the effect: the thing is a contributing factor


Challenger disaster[edit]

Challenger explosion

The space shuttle Challenger exploded soon after liftoff in 1986. Why? The Rogers Commission Report, the authoritative analysis on the causes of the accident, included distinct chapters on causes and contributing causes of the accident.

Here is how the causes would be modeled on Wikidata:

Space Shuttle Challenger disaster (Q858145)


Darwin's finches

What causes evolution -- that is, changing traits in biological populations across generations?

This case illustrates how, sometimes, it's not obvious how to break causes into underlying causes and immediate causes. The key is to ask: which cause comes before the other? With that in mind, the way to model the causes of evolution becomes clearer:

evolution (Q1063)

In addition to describing the causes of evolution writ large, we can also use this approach to describe the evolution of specific biological groups (i.e. taxa), organs or traits. Which mutations and genetic drifts underlaid it? Which natural or artificial pressures immediately caused the selection of fitter populations?


Disease burden of malaria

Malaria is a mosquito-borne disease caused by microbial parasites called Plasmodium. It is a crushing health burden in Africa, where each year it kills hundreds of thousands of people and results in billions of dollars of economic losses. It has shaped facets of human evolution. The disease is amplified by poverty, which impedes distribution of preventative measures like mosquito nets, insect repellent and preventative and therapeutic medicines.

malaria (Q12156)

Cretaceous–Paleogene extinction event[edit]

Tyrannosaurus Rex Holotype.jpg

Why did the dinosaurs die? See below.

Cretaceous–Paleogene extinction event (Q55811)

American Civil War[edit]

Soldiers in the American Civil War

The American Civil War is that country's most deadly war, resulting in deaths of over 750,000 people. The causes of the war have been debated by historians for over 150 years. A central disagreement is whether slavery or states' rights was the underlying cause of the war. The consensus of modern historians is that slavery was the primary cause of the war, with states' rights being a pretext. A notable counter-narrative among some historians asserts roughly the opposite.

These conflicting claims among sources -- and their relative weight -- can be modeled with statement ranks as shown below. The war also showcases the benefit of being able to model causation at different scopes (immediate vs. underlying) and degrees (cause vs. contributing factor).

American Civil War (Q8676)

Things to avoid[edit]

Very causally distant statements[edit]

Consider the statement "universe (Q1) has part (P527) Neptune (Q332)". This statement is correct, but not a good idea to put in 'universe'. The universe has many, many parts. Rather than explicitly state each of those parts in 'universe', we should instead implicitly make those part of statements by means of chains like universe (Q1) has part A_1, A_1 has part A_2, ..., A_n has part Solar System (Q544), Solar System (Q544) has part Neptune (Q332).

A similar idea applies to has cause.

For example, while it might be correct to say "Alps (Q1286) has cause formation and evolution of the Solar System (Q3535)", that's not quite appropriate. Better to say something like "Alps (Q1286) has cause Alpine orogeny (Q661478)", then, if you're feeling ambitious, connect the links of the causal chain back to formation and evolution of the Solar System (Q3535). In other words, do not use has cause to connect an effect to a cause that is exceedingly far away on the relevant chain of events. Use common sense.

Bias about causes, especially in controversial events[edit]

The War in Donbas (Q16335075), also known as the war in Eastern Ukraine, is at the time of this writing an ongoing, controversial event. Reliable sources often disagree on the causes of that war and many other conflicts.

Sources aligned with each belligerent in a conflict will often make statements about underlying causes, immediate causes and contributing factors that contradict each other. Reflecting those statements in cause properties is OK -- Wikidata is designed to be able to handle knowledge diversity with ranks and granular, robust references. Just don't mark any controversial statements as "preferred" rank, ensure proper sourcing, and make avid use of qualifiers like statement disputed by (P1310). Don't give undue weight to one side or another.

Contributing factors as underlying causes[edit]

In the early days of the American Civil War, Abraham Lincoln is said to have declared upon meeting anti-slavery author Harriet Beecher Stowe: "So this is the little lady who started this great war." This does not mean you should state "American Civil War (Q8676) has cause (P828) Harriet Beecher Stowe (Q102513), source: Abraham Lincoln".

Stowe did write the influential novel Uncle Tom's Cabin, which did inflame passions in the Northern United States states, but neither Stowe nor her book was an underlying cause of the war. Larger forces were. Given the various factors' relative significance, it would be much more appropriate to state "American Civil War (Q8676) has contributing factor (P1479) Uncle Tom's Cabin (Q2222)".

Also note here that Stowe's work, not Stowe herself, is set as the contributing factor above, because the book was the thing that more precisely influenced the onset of the war. Humans can intuit and machines can infer that the author had a causal role, but that does not need to be explicitly stated here.


Not watertight[edit]

None of the examples above are perfect. In the Challenger disaster, sources differ on whether communication failure was an underlying cause or a contributing factor. In evolution, populations produced by modern genetic engineering could be said to have biological selection as their underlying (i.e. first) cause and mutation as their immediate (i.e. last) cause -- the reverse of what the example states. And so on.

Not completely precise[edit]

The examples are also not exactingly precise. The American Civil War was primarily caused by disagreement over slavery, not just slavery in itself. The dinosaurs died because of the impact of the asteroid, not the asteroid itself. The immediate cause of malaria is biting of mosquitos, not mosquitos themselves. There is a school of thought which maintains that we can only speak of causes and effects in terms of events. Presumably construing "mosquito" as "a mosquito existing" and therefore an event isn't adequate to students of that school of thought.

The three-tier approach is also a less robust way to model causation than, say, a directed graph in which each vertex is an influencing factor and each edge has a weight representing the correlation or approximate risk of one factor resulting in to the other.

A comment on trade-offs[edit]

These issues are part of the trade-off between expressiveness, rigor and simplicity. There are ways to mitigate each of these deficiencies within the three-tier model, e.g. by making extensive use of ranks, qualifiers, and constraint-checking bots. In doing so we should ensure that we do not transform Wikidata into a logically consistent but stilted and impenetrable network of axioms. It is important that Wikidata be intuitive for humans (especially non-expert humans) to understand at a glance.

Related properties[edit]

The properties has cause, has immediate cause and has contributing factor are related to other properties:

Further reading[edit]

The Metaphysics of Causation
Causal Processes
Counterfactual Theories of Causation
  • causally_upstream_of: an equivalent property of has cause in the Basic Formal Ontology, an upper ontology widely used in the sciences