User:Einebillion/Manual

From Wikidata
Jump to navigation Jump to search

Wikidata Training Notes[edit]

My Wikidata Training notes for Beginners

Setting up a Project in Wikidata[edit]

Documented Project Workflows[edit]

What do I want to record[edit]

Scope[edit]

In this step I need to determine the scope of my project. Wikidata and linked open data is a nightmare for the type of person that must complete everything so pick your scope.

I do this by determining what I want to get out of the project. e.g. for my photographers project I want an authoritative entry for all commercial pre-1950 photographers who have items in the collection of a GLAM (gallery, library, archive, or museum). I will by preference build on those working in New Zealand first.

I complete a physical mind map of what I think a perfect open linked data map looks like and use this to refine what I want to record. I refine this until I'm working with a scope that's achievable.

Wikidata Properties for the Project[edit]

Then I go and look for and create a list of the properties that meet the scope of the mind map. I may already have a dataset. If I've got a dataset then I'll include these in my property search. I'll also want to look for sources to use as references.

For the NZ photographers project the properties I am interested in are:

New Zealand Photographers

Property Property title Value Q number Q title
P31 Instance of Q5 Human
P18 Image Example Example
P21 Sex Example Example
P27 Country of citizenship Q664 New Zealand
P735 Given name Example Example
P734 Family name Example Example
P1317 active ( value needs to be year) Example Example
P569 date of birth Example Example
P19 place of birth Example Example
P570 date of death Example Example
P20 place of death Example Example
P119 place of burial Example Example
P106 Occupation Q33231 Photographer
P937 work location Example Example
P1830 Owner of Example Example
P6379 has works in the collection(s) Example Example
P31184 sibling Example Example
P463 member of Example Example
P1066 student of Example Example

New Zealand Photographic Studios

Property Property title Value Q number Q title
P31 Instance of Q672070 Photographic studio
P31 Instance of Q43229 Organization
P17 Country Q664 New Zealand
P18 Image Example Example
P159 Headquarters Location (City) Example Example
P669 Located on street Example Example
P276 Location (City) Example Example
P571 Inception Example Example
P576 Dissolved, abolished or demolished Example Example
P127 Owned by Example Example
P1037 Director / manager of Example Example
P155 Follows Example Example
P156 Followed by Example Example
P6379 Has works in the collection(s) Example Example
P463 member of Example Example
  • Other properties that might be handy are "described by Source" and, if the source is listed as a Q item, main subject.

References[edit]

For the New Zealand photographers pre-1950 the list of references I can use for research include:

Data Sources[edit]

Then I determine a set of data sources I will work with. These could be Mix n' Match sets, or external databases, or (once I know how to do an API call - I wish I knew how to work with an API) a dataset arising from an API call.

  • Check this for a How-to to do an API call from DigitalNZ API

https://web.archive.org/web/20200527074525/https://natlib.govt.nz/researchers/guides/making-api-calls-using-openrefine and https://web.archive.org/save/https://natlib.govt.nz/researchers/guides/making-api-calls

Wikidata Checking and Clean Up of Issues[edit]

Manually sync Mix-n-Match catalogues[edit]

There is an Action drop down that lets you do this. If you do this you'll see lists where values appear in Wikidata but not in Mix-n-Match and vice versa. Resolve those as in resolving it will also note any duplicates.

Constraint Violation Reports in Wikidata[edit]

The best way to check your data and is there are any issues created from uploading data is to review the Constraint Violation Report. To get to this report - Go to the Wikidata property for the identifier e.g. Alexander Turnbull Library ID https://www.wikidata.org/wiki/Property:P6683. Go to the Discussion page tab. In the Documentation section there is a section called Lists. In that section is a link called: Database reports/Constraint Violations. Click on it and use information from that report to clean up the issues. If there isn't much in the report you also have the option of adding new constraints to the Property entry and the report will generate issues automatically.

Cleaning up Constraint Violation Reports in Wikidata and in Mix'n'Match[edit]

Single Value Violations look to be where there are duplicate values for the same entity existing in the Mix'n'Match dataset and thus likely also exist in the original database. An example would be two different Te Papa Agent IDs for the same person. Well tended original databases change over time as the database administrators clean and update the data regularly so the Mix'n'Match dataset may get out of date over time depending on when it was uploaded to Mix'n'Match and how much data cleaning occurs in the original data set from which the Mix'n'Match set is harvested. As the Mix'n'match set gets added into Wikidata this means the duplication of ID can also migrate into Wikidata and thus appear in the Constraint Violation report.

An example of this is the Te Papa Artists Mix'n'Match import https://mix-n-match.toolforge.org/#/catalog/362 This feeds into the Wikidata Property P3544 Te Papa Agent ID. An example of the report as at 15 September 2020 includes:

By opening up the links to both the Te Papa database entries for Fredrick Hermann Otto Finsch you can see that while the first is live, the second returns as Page not found. This indicates that Te Papa has subsequently identified the duplicate entry and merged the two records, leaving Te Papa Agent ID 11983 as the live entry.

First, clean this up in Mix'n'Match. The Page not found entry needs to be changed in Mix'n'Match. It needs to be unlinked from the Q number and changed to N/A. To do this work you need the Wikidata tool showing Mix'n'Match entries on Wikidata items. You can add a piece of code to your common.js page and that piece of code will pull up every Mix'n'Match value matched to that Q item in Wikidata for you to check. It's useful here because it allows you to go straight to the relevant Mix'n'Match value and clean it.

To make all the other possible Mix'n'Match authority control id suggestions appear at the top of a Wikidata record / item

importScript( 'User:Magnus_Manske/mixnmatch_gadget.js' );

  • then Save.
  • Then refresh the page or clear the cache of your browser.

Examples to test that you've successfully added the code added into your common.js page

Once you've got the code successfully running, you can see all the Mix'n'Match entries for this item, including the duplicates from the same database.

If you click on the hyperlink on the word entry it will pull up the Mix'n'Match entry for that ID. Double check that it's the one that returns a Page not known in the original database. If it does then in the Mix'n'Match entry go to the Matched To section, click on Remove. This will remove the Mix'n'Match link. Then click on N/A to correct the Mix'n'Match data.

To correct the Wikidata, go to the item for the entity. Under the list of Identifiers there should be a duplicate entry of the two IDs. Edit and remove the one that is no longer valid.

The report hasn't been updated for a while. Not sure how the refresh occurs. Looks like it might be a bot that runs daily.

Open Refine[edit]

This is my current area of research - these notes are rough and are under development.

Top Level Workflow[edit]

  1. Upload dataset of people and people details into project
  2. Set up a reference Q number for data source and add column or alternatively check whether data source is a Property and IRNs are loaded into Project
  3. Create name columns into Familyname, givenname + ordinal number columns etc
  4. Create description with text
  5. Instance of column
  6. Sex or Gender reconciliation
  7. Create also known as column with given names as letters and family name
  8. Create Label column by concatenating name columns
  9. Reconcile off Label column
  10. Check reconciliation information to make sure reconciliation is accurate using various facets.
  11. Use reconciled to create new column for Wikidata QID
  12. Reconcile family name column
  13. Reconcile given name columns
  14. Reconcile Birth location
  15. Reconcile Death location
  16. Reconcile Year dates in both birth date and death date columns, then edit cells --> Common Transforms those values that are exact birth / death dates to date
  17. For Te Papa db entries I check to see whether the IRN from Te Papa matches the Te Papa agent ID from Wikidata
  18. Other columns that might be needed / require reconciliation are P166 Award received, and Member of
  19. Family Name / Given Names - new values - use schema to create basic Quick statement schema for immediate upload.
  20. Schema = item = column drop and drag, Label en column, description Family name, native label multiple languages column name, instance of family name. Use "Stated in" as reference.
  21. Facet by Reconcile --> Facets --> By judgement to facet on New
  22. Once schema and quickstatements have been processed then reconcile it again. This happens automatically if you use the upload edits to Wikidata extension within Open Refine.
  23. For non-new entries pull values from wikidata into new column and filter to figure out who needs what loading on a per column basis.
  24. For those with multiple values in a statement e.g. occupation use Quickstatements to prevent overwriting - works also for first name and series ordinals.

Cleaning[edit]

Cleaning dates https://docs.tibco.com/pub/clarity/3.1.0/doc/html/GUID-E50BB9BA-BBCB-4013-BB9A-CE10680A4461.html

Creating new column based on[edit]

Fetch URLs based on column Internal Record Number using expression GREL "https://collections.tepapa.govt.nz/agent/"+value

Sorting dates[edit]

  • Create a new column using Edit Column ---> Add column based on this column ---> test birth/death date column

Sort to ensure all YYYY only is transformed to YYYY-00-00 use: value.toDate('yyyy').toString('yyyy-00-00')

  • Then in the Test birth / death Date column use the dropdown menu to select Edit cells -> Transform
  • In the ‘Expression’ box type the GREL expression value.toDate("dd/MM/yyyy") and press OK.

Reconciling[edit]

Reconciling is an combination auto/manual check for matching your Open Refine dataset with an external dataset such as Wikidata. Example: A column of names of people with Q numbers for humans (Q5).

Before you reconcile with Open Refine you will want to make sure that there aren't duplicates in the Wikidata set. Go to the dataset in Mix'n'Match. Find the button called "Action" in the top right Click on the drop down button to get list Choose Catalog Report This will bring up a page reporting where there are errors. Scroll to the bottom to resolve duplicates and errors. Then run reconcile.

Creating a column for Names[edit]

When you've only got columns reflecting the First, Middle and Family Names of people you'll need to create a new column concatenating these cell values. To create a new cell with data elements from other cells

  • Click on the column drop down
  • Select Edit Column → Join columns...
  • Then select and order the column titles you wish to join together
  • Select a separator between the content (I put a space)
  • Replace nulls with skip nulls as this ensures the space doesn't double up between First and Last Names if there is no Middle Name.
  • Then Write result in new column named...
  • and enter a column title e.g. Full Name
  • Click on OK

Open Refine will take a bit of time to process

Ordering Columns in Open Refine[edit]

To organise your columns in the order you want.

  • Go to the first column, labeled “All” and click the drop down arrow button on the column header
  • Select the option “Edit columns”
  • Choose “Re-order/remove columns”
  • Order the columns as you wish.

Copying a Column[edit]

Reconciling a column will replace the original value with the value imported from the external database. For checking purposes it might be useful to have two copies of the same column. One for your original dataset and one after reconciliation. I did this for my people and organisation names so I could do a visual check of my disambiguation.

To Copy data into a new column or add column based on this column

  • Select the drop down arrow in the column header
  • Select Edit Column
  • Select Add Column Based On This Column
  • Add the new column name
  • Select the radio button "Copy Value from Original Colum". This will copy the original value from the column even if the column that you are copying from has already been reconciled.
  • Click on OK

Reconciling Names[edit]

To reconcile my dataset column of Person Names with Wikidata Q5 Human

  • Select the drop down menu next to the Full Name column header
  • Click on reconcile
  • Click on Start reconcile
  • Select the external service ( https://wikidata.reconci.link (en))
  • A new window will pop up. This will pre-pick a list of possible Q numbers for you
  • Select Q5 Human
  • Also add relevant details from other columns
  • Pick Internal Record number by ticking the checkbox and add a Wikidata property value to check against. I have a column called Internal Record Number from my database that relates to P3544 Te Papa agent ID.

You will note that there is a box that has been pre-selected. Auto-match candidates with high confidence. It will automatically match the entry with a Wikidata item. It will also give you is the ability to filter (facet) your reconciliation by Wikidata system match confidence. So, if you wanted to see only those items where a match is considered to have less than 50% confidence you could filter on this and double check those results.

  • Click on Start Reconciling

It may take some time to reconcile the column depending on how many rows you have. Open Refine will give you a progress statement at the top of the page.

  • Once the reconciling is complete you should review the data. There may be some occasions where the reconciling won't pick up the Q number despite having the IRN. This may be because for humans the Q5 value (human) of the "instance of" property may not be entered. Alternatively for photographic studios the Q672070 value (photographic studio) of the instance of may not be entered. You can still click on "search for match" to see what may eventuate and link the item.

Facets[edit]

Faceting gives you the opportunity to break your dataset up into smaller sets so you can data clean more easily. After reconciliation maybe you want to see only the bit where the data was blank and couldn't be unreconciled. You do this via facets.

To facet your People name column that you reconciled with Wikidata - you may want to create a facet for "Reconciliation by Wikidata system match confidence". To do this

  • Click on the drop down arrow in the column header
  • Select Reconcile
  • Select Facets
  • Select Best Candidate's Score

To find out more facets you can use see this YouTube video https://www.youtube.com/watch?v=0tQPmfb6IFk&t=658s snippet (1.54-6.51)

Tips:

  • When facet / filtering and then data cleaning: Don't forget to refresh and then reconcile the items with new data entry.

After Reconciling - faceting[edit]

To facet on reconciliation values, click on Reconcile ---> Facets ---> By judgement

After Reconciling - pulling information from Wikidata[edit]

This will only work with columns that have been reconciled with Wikidata. To add columns pulling information from Wikidata into your project

  • Click on the drop down arrow in the column header you've reconciled
  • Click on Edit column
  • Click on Add columns from reconciled values

It will come up with a list of possible properties to add. You can also input your preferred property number if it's not in the list. see https://www.youtube.com/watch?v=0tQPmfb6IFk&t=658s

After Reconciling - To get Wikidata IDs (Q Numbers) in another column[edit]

  • Click on the drop down arrow in the column header you've reconciled
  • Click on Edit column
  • Click on Add column based on this column
  • A new window will appear
  • Add the New Column Name e.g. Wikidata ID
  • In the Expression field - replace the text value with cell.recon.match.id
  • Click on OK and the new column with Wikidata ID will appear.

After Reconciling - Filtering data by Sort[edit]

  • Sort by
  • Sort by Numbers smallest first

After Reconciling - Faceting by value[edit]

  • Facet
  • Customized facet
  • Facet by null

Marking up Disambiguated People Rows[edit]

Use Star for approved and checked rows Use Flag for those to delete from the database

Remove Flagged Rows[edit]

  • Filter to get all Flogged rows only
  • Drop down All header
  • Edit Rows
  • Remove matching rows

Changing date rows[edit]

  • Copy date column to new column so you can compare
  • Go to the column you would like to change the format of and click the arrow button on the column header.
  • Choose “Edit cells” and then select “Common transformations.”
  • Under “Common transformation,” choose date.

How Do I[edit]

How to undo in Open Refine? You click on Undo/Redo and then click on the ENTRY next to the numbered checkbox. That way you can move up and down the changes by clicking. Drop down menu in column → Reconcile → Actions → Clear reconciliation data also removes all reconciling work. Discard reconciliation judgments also removes links but keeps prospective matches.

Remove a row as the record isn't suitable / doesn't have enough data of a quality to provide Wikidata? Filter it out of results. It will still be there but changes will only apply to filtered results. How do I bulk update when data cleaning in Open Refine. Currently updating on a cell by cell level when data is absent and reconciling isn't able to occur.

Bulk Upload of New Entries[edit]

When reconciling a column and reviewing the matching options with Wikidata results, one of the options is "Create New Item". By selecting this, when you create your schema and propose an upload to Wikidata, and then run your schema to Wikidata, it will create new items from those marked up as that.

To bulk upload edits to Wikidata you need to create a Schema.

Before you can create a schema you need to have reconciled columns.

Typically, my photographers reconciliation will be:

Photographers Name column

With terms

  • Label en (for english) Photographers name column. If you check on override if present it will be overwritten by your value. Best not to check it.
  • Description en (for english) photographer. Again there is an option to override if present. Best not to check it.

Statements: Use the standard statement propoerties and values for photographers that I determined here: https://www.wikidata.org/wiki/User:Einebillion

Once you've saved your schema it should show as a tab at the top of the Open Refine main window. You can also see an Issues tab and a Preview tab.

Use the issues tab and the preview tab to make sure that your data is all correct.

To upload edits to Wikidata (by selecting this option using the Extensions: dropdown on the right hand side), OpenRefine will use the Schema to create a new Q value for the photographer or photographic studio.

Tips and Hints[edit]

  • When trying to export to Quick Statements make sure you're logged into Quick Statements as it won't generate a download without access.

Expressions[edit]

Comparing column values[edit]

The Expression text is: if(cells["a"].value == cells["b"].value, "Y", "N")

Questions to figure out[edit]

You have to do more work. https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/New_items

You can use the drop down menu Reconcile then facets to show different types of facets. Check out video 2 to learn the details of this. Best candidate's score - you can use this to filter to return unmatched that have a good to likely candidate score using this range. (see second video)

Then Add column based on this column, Wikidata column. Cell.recon.match.id ok = creates column in as a Q number.

Where am I at the moment? I'm working on the dataset under role and flagging and removing rows that are not photographers. How I do this? Sort the Roles column by a-z Searching through the different roles, flagging and removing on

Exporting Open Refine Info to Google Sheets and Setting Up for Quick Statements[edit]

  • Export data via Export CSV in Open Refine
  • Open Google Sheets and import the CSV file
  • Set up a column with all cells having the Reference Property value of S854 = Reference URL
  • Set up a second column with the URL linking back to the original database for the person or organisation.


If you've only got the IRN and the IRN is the distinguishing value that creates the URL link & to join the URL value together. An example is ="https://collections.tepapa.govt.nz/agent/"&A2

Where A2 is the column value of the Te Papa Agent ID.

Don't forget you'll also need to add quote marks around the hyperlink to get the URL value accepted by Quick Statements. To do this use CHAR(34). This is the character value of quote marks.

The final formula should look like =CHAR(34)&"https://collections.tepapa.govt.nz/agent/"&A28&CHAR(34)

Once you've got the columns for the reference URL property and the reference URL value in the spreadsheet, then create columns for the wikidata properties that reflect the values you want to add. In addition work out the Q value for the values you want to add.

e.g. If you want to add a value of photographic studio Q672070 this is a value that is associated with the property of instance of P31.



Quick Statements Tool[edit]

To Add Statements to Existing Wikidata Items using Quick Statements[edit]

Once you have a dataset and you are happy with it in Google sheets you can use Quick Statements to add the data to wikidata To get this spreadsheet information into Quick Statements

  • Go to Quick Statements
  • login
  • Click on New Batch button
  • Copy google spreadsheet from Qnumber column onwards - the columns should be in the order of Item (Qnumber), Property, Value, Reference Property, Reference Value
  • Past into Quick Statements
  • Import V1 commands
  • Then change bottom right number to 100 or to 500 instead of 10
  • Then check and press RUN!

To Create new items in Quick Statements[edit]

You can create new items by inserting a line consisting only of the word "CREATE". To add statements to the newly created item, use the word "LAST" instead of the Q number, and the statement will be added to the last created item. An example where TAB is the tab space between each value. CREATE LAST TAB Len TAB "[insert Firstname Lastname]" = sets the label. The label must be within quotation marks.

LAST TAB Den TAB "[insert description text]" = sets the description. The description must be within quotation marks

LAST TAB P6683 TAB [insert Alexander Turnbull Library Unique ID] = adds the item number