User:Magnus Manske/Women in properties

This is an analysis of Wikidata properties relating to biographies, and their gender imbalance.

You are invited to fix, update, amend, and improve this page.

Data

Properties used are of property type External ID, an instance of Wikidata property for authority control for people (Q19595382) (SPARQL query).
The data was collected in February 2020. 1323 properties were assessed.
The data was stored on Labs, database s51434__mixnmatch_p (publicly readable with a Labs account), table auth_control_gender.
The PHP code used to generate the data, and store it in the database, is here.
The snapshot of the data used for this analysis is here.
The "unknown" and "other" numbers for property VIAF ID (P214) are an estimate, as the underlying SPARQL query times out. All remaining candidates where classified as "unknown", and none as "other". Neither of those values are used in the following analysis.
All this is the fault of Magnus Manske (talk).

Analysis

In the following plots, women are shown as a fraction (0-1.0) of the total number of (human (Q5)) items using that property. All mention of "items" refers to "items with instance of (P31)=human (Q5)"

Women vs. total number of items

This plot shows the distribution of all 1323 properties, fraction of women vs. total number (log10) of items with the property. Properties with >= 100,000 items are labelled.

Women vs. property completeness

This plot shows 163 properties where there is a number of records (P4876). Shown is the fraction of women vs. completeness (number of items with that property divided by number of records (P4876)). Properties with >= 100,000 items are labelled. Dots are colored (black=lot number of items, light blue=high number of items).

Histogram

This plot shows 488 properties used in >= 1,000 items, as a histogram of the fraction of women.

Ordered view

This plot shows 488 properties used in >= 1,000 items. It shows the (increasing) fraction of women in that property.

R code

The following code was used to generate the above plots in R:

m <- read.delim("mixnmatch.tsv")
m_complete <- m[which(m$p_completed!="NA"),]

# Catalogs with p_completed
p = ggplot(m,aes(x=p_completed,y=p_female)) + geom_point(aes(colour=log10(total))) + xlim(0,1) + ylim(0,1) + xlab("Completed") + ylab("Women") + geom_text(aes(label=paste("P",property,sep='')),hjust=-0.1, vjust=0,size=3,subset(m,total>=100000))

# p_female vs total
mf <- m[order(m$p_female),] 
ggplot(mf,aes(y=p_female,x=log10(total))) + geom_point()  + ylim(0,1) + xlab("log10(total)") + ylab("Women") + geom_text(aes(label=paste("P",property,sep='')),hjust=-0.1, vjust=0,size=2,subset(m,total>=100000))

# Histogram
mf2 <- mf[which(mf$total>=1000),]
ggplot(mf2,aes(p_female)) + geom_histogram() + xlab("Women") + ylab("Properties (>=1000 items)")

# Women, ordered
ggplot(mf2,aes(x=seq(1,nrow(mf2)),y=p_female)) + geom_point(aes(colour=log10(total))) + ylim(0,1) + ylab("Women") + xlab("Properties with >=1000 items, ordered by fraction of women")

User:Magnus Manske/Women in properties

Contents

Data

Analysis

Women vs. total number of items

Women vs. property completeness

Histogram

Ordered view

R code

Navigation menu

User:Magnus Manske/Women in properties

Data

Analysis

Women vs. total number of items

Women vs. property completeness

Histogram

Ordered view

R code

Navigation menu

Search