User:Envlh/Denelezh/Schema

From Wikidata
Jump to navigation Jump to search

This page is a draft of the physical data model for data persistence in Denelezh, a mixed version between the current state of the model and what could be done in the future.

Process[edit]

Quick overview of the data processing:

  1. From Wikidata JSON dump:
    1. Download last available Wikidata JSON dump.
    2. Parse dump with Wikidata Toolkit and generate CSV files.
    3. Load CSV files into MySQL database.
  2. From Wikidata Query Service (SPARQL):
    1. Fetch list of Wikimedia projects.
  3. Generate reports.

Tables[edit]

Technical tables[edit]

dump[edit]

List dumps loaded into Denelezh.

Field Description Example Type
date Date of the dump. 2017-01-02 DATE
step Step of the loading.
Values: 0 = download in progress; 1 = parsing in progress; 2 = reports generation in progress; 3 = done.
0 TINYINT

Raw data[edit]

These tables are partitioned by dump (one partition per dump). They all have the following column as part of their primary key:

Field Description Example Type
dump Date of the dump. 2017-01-02 DATE

human[edit]

Base table for humans.

Field Description Example Type
dump Date of the dump. 2017-01-02 DATE
id Wikidata Qid of the human (without initial Q). 2689088 BIGINT UNSIGNED
gender Wikidata Qid of the gender (without initial Q), if known. 6581072 BIGINT UNSIGNED
year_of_birth Year of birth, if known. 1950 SMALLINT
year_of_death Year of death, if known. 2015 SMALLINT

human_country[edit]

List countries of citizenship for each human.

Field Description Example Type
dump Date of the dump. 2017-01-02 DATE
id Wikidata Qid of the human (without initial Q). 2689088 BIGINT UNSIGNED
country Wikidata Qid of the country (without initial Q). 142 BIGINT UNSIGNED

human_occupation[edit]

List occupations for each human.

Field Description Example Type
dump Date of the dump. 2017-01-02 DATE
id Wikidata Qid of the human (without initial Q). 2689088 BIGINT UNSIGNED
occupation Wikidata Qid of the occupation (without initial Q). 82955 BIGINT UNSIGNED

human_property[edit]

A line of this table shows that a property is used at least one time at normal or preferred rank in an item representing a human in Wikidata.

Field Description Example Type
dump Date of the dump. 2017-01-02 DATE
id Wikidata Qid of the human (without initial Q). 2689088 BIGINT UNSIGNED
property Wikidata Pid of the property (without initial P). 39 BIGINT UNSIGNED

human_sitelink[edit]

Existing sitelinks for a human.

Field Description Example Type
dump Date of the dump. 2017-01-02 DATE
id Wikidata Qid of the human (without initial Q). 2689088 BIGINT UNSIGNED
sitelink Code of the project. frwiki VARCHAR(32)

occupation_parent[edit]

Stores the hierarchy of occupations. For instance, Q182436 (librarian) is a sublcass (P279) of information professional (Q1662485).

Field Description Example Type
dump Date of the dump. 2017-01-02 DATE
occupation Wikidata Qid (without initial Q). 182436 BIGINT UNSIGNED
parent Wikidata Qid (without initial Q). 1662485 BIGINT UNSIGNED

label[edit]

This table stores all labels.

This table needs refining to handle internationalization.

Field Description Example Type
dump Date of the dump. 2017-01-02 DATE
qid Wikidata Qid (without initial Q). 182436 BIGINT UNSIGNED
label Label. Librarian VARCHAR(500)

Reports[edit]

kpi[edit]

Statistics generated from the previous tables. Each row represents a subset of humans in Wikidata, grouped by gender, year of birth, country of citizenship, occupation, and Wikimedia project. A value of 0 on these columns means any value (for instance, a value of 0 in the column country means any country of citizenship, even none known).

This table needs refining to handle several reports (number of humans in Wikidata, number of humans in Wikidata with at least one sitelink, number of sitelinks for humans in Wikidata, ...).

Field Description Example Type
dump Date of the dump. 2017-01-02 DATE
gender Values: 0 = any; 1 = female; 2 = male; 3 = other. 1 TINYINT UNSIGNED
year_of_birth Id from the table year, 0 for any. 20 SMALLINT UNSIGNED
country Id from the table country, 0 for any. 30 SMALLINT UNSIGNED
occupation Id from the table occupation, 0 for any. 40 SMALLINT UNSIGNED
project Id from the table project, 0 for any. 50 SMALLINT UNSIGNED
total Number of humans in this set. 6478 INT UNSIGNED
total_with_gender Number of humans with a known gender in this set. 6156 INT UNSIGNED
total_with_year_of_birth Number of humans with a known year of birth in this set. 5246 INT UNSIGNED
total_with_year_of_death Number of humans with a known year of death in this set. 3425 INT UNSIGNED
total_with_place_of_birth Number of humans with a known place of birth in this set. 4123 INT UNSIGNED
total_with_place_of_death Number of humans with a known place of death in this set. 1245 INT UNSIGNED
total_with_firstname Number of humans with a known first name in this set. 4236 INT UNSIGNED
total_with_lastname Number of humans with a known last name in this set. 4152 INT UNSIGNED
total_with_country Number of humans with a known country of citizenship in this set. 4263 INT UNSIGNED
total_with_occupation Number of humans with a known occupation in this set. 5236 INT UNSIGNED
total_with_image Number of humans with an image in this set. 2354 INT UNSIGNED

discarded_country[edit]

Countries used less than 200 times are removed from reports and put in this table.

Field Description Example Type
dump Date of the dump. 2017-01-02 DATE
country Wikidata Qid (without initial Q). 142 BIGINT UNSIGNED
count Number of humans with this country of citizenship. 17 SMALLINT UNSIGNED

discarded_occupation[edit]

Occupations used less than 1,000 times are removed from reports and put in this table.

Field Description Example Type
dump Date of the dump. 2017-01-02 DATE
occupation Wikidata Qid (without initial Q). 41546637 BIGINT UNSIGNED
count Number of humans with this occupation. 17 SMALLINT UNSIGNED

Reports reference tables[edit]

These tables are shared across all reports.

year[edit]

Years that are used in at least one report.

Field Description Example Type
id Technical id. 7 SMALLINT UNSIGNED
year Year. 1578 INT

country[edit]

Countries that are used in at least one report (when not discarded).

Field Description Example Type
id Technical id. 1 SMALLINT UNSIGNED
qid Wikidata Qid (without initial Q). 142 BIGINT UNSIGNED

occupation[edit]

Occupations that are used in at least one report (when not discarded).

Field Description Example Type
id Technical id. 1 SMALLINT UNSIGNED
qid Wikidata Qid (without initial Q). 41546637 BIGINT UNSIGNED

project[edit]

Projects that are used in at least one report.

Field Description Example Type
id Technical id. 1 SMALLINT UNSIGNED
qid Wikidata Qid (without initial Q). 41546637 BIGINT UNSIGNED
code Project code. frwiki VARCHAR(32)
type Type of project.
Values: commons, incubator, mediawiki, meta, wikispecies, wikidata, wikimania, wikibooks, wikinews, wikipedia, wikiquote, wikisource, wikiversity, wikivoyage, wiktionary.
wikipedia VARCHAR(32)
url URL of the project. https://fr.wikipedia.org/ VARCHAR(128)

Glossary[edit]

See also[edit]