User:Magnus Manske/Women in properties

From Wikidata
Jump to navigation Jump to search

This is an analysis of Wikidata properties relating to biographies, and their gender imbalance.

You are invited to fix, update, amend, and improve this page.

Data

[edit]
  • Properties used are of property type External ID, an instance of Wikidata property for authority control for people (Q19595382) (SPARQL query).
  • The data was collected in February 2020. 1323 properties were assessed.
  • The data was stored on Labs, database s51434__mixnmatch_p (publicly readable with a Labs account), table auth_control_gender.
  • The PHP code used to generate the data, and store it in the database, is here.
  • The snapshot of the data used for this analysis is here.
  • The "unknown" and "other" numbers for property VIAF ID (P214) are an estimate, as the underlying SPARQL query times out. All remaining candidates where classified as "unknown", and none as "other". Neither of those values are used in the following analysis.
  • All this is the fault of Magnus Manske (talk).

Analysis

[edit]

In the following plots, women are shown as a fraction (0-1.0) of the total number of (human (Q5)) items using that property. All mention of "items" refers to "items with instance of (P31)=human (Q5)"

Women vs. total number of items

[edit]

This plot shows the distribution of all 1323 properties, fraction of women vs. total number (log10) of items with the property. Properties with >= 100,000 items are labelled.

Women vs. property completeness

[edit]

This plot shows 163 properties where there is a number of records (P4876). Shown is the fraction of women vs. completeness (number of items with that property divided by number of records (P4876)). Properties with >= 100,000 items are labelled. Dots are colored (black=lot number of items, light blue=high number of items).

Histogram

[edit]

This plot shows 488 properties used in >= 1,000 items, as a histogram of the fraction of women.

Ordered view

[edit]

This plot shows 488 properties used in >= 1,000 items. It shows the (increasing) fraction of women in that property.


R code

[edit]

The following code was used to generate the above plots in R:

m <- read.delim("mixnmatch.tsv")
m_complete <- m[which(m$p_completed!="NA"),]

# Catalogs with p_completed
p = ggplot(m,aes(x=p_completed,y=p_female)) + geom_point(aes(colour=log10(total))) + xlim(0,1) + ylim(0,1) + xlab("Completed") + ylab("Women") + geom_text(aes(label=paste("P",property,sep='')),hjust=-0.1, vjust=0,size=3,subset(m,total>=100000))

# p_female vs total
mf <- m[order(m$p_female),] 
ggplot(mf,aes(y=p_female,x=log10(total))) + geom_point()  + ylim(0,1) + xlab("log10(total)") + ylab("Women") + geom_text(aes(label=paste("P",property,sep='')),hjust=-0.1, vjust=0,size=2,subset(m,total>=100000))

# Histogram
mf2 <- mf[which(mf$total>=1000),]
ggplot(mf2,aes(p_female)) + geom_histogram() + xlab("Women") + ylab("Properties (>=1000 items)")

# Women, ordered
ggplot(mf2,aes(x=seq(1,nrow(mf2)),y=p_female)) + geom_point(aes(colour=log10(total))) + ylim(0,1) + ylab("Women") + xlab("Properties with >=1000 items, ordered by fraction of women")