Wikidata:WikiProject LD4 Wikidata Affinity Group/Affinity Group Calls/Meeting Notes/2021-02-23

From Wikidata
Jump to navigation Jump to search

Call details[edit]

Presentation Materials[edit]

Notes[edit]

  • Bibliographic/book data into Wikidata
    • At National Library of Wales for ~6 years
    • Started sharing data for artworks, artists, etc.
    • In 2019 had an opportunity to share catalog data
    • Trial on a sizeable scale to determine what is possible
    • Has about 100,000 items on Wikidata for items about or connected to the collection
    • 50,000 items are about books in the Welsh bibliography
  • Example of book in bib in library catalog (MARC-21), fields and text strings
    • Subjects and authors can link to authority files, but no guarantee it’s a unique item
    • A lot of publishers (for example) are text strings, which need to be mapped to Wikidata “things”
    • In Wikidata text strings => items (places, publishers, language, etc.) (example: Cardiff)
    • Three editions in a library catalog are three records, no guarantee they’re presented the same way
    • In Wikidata there is a central literary work, then editions or translations are separate items, and all editions connect to same data, connectivity and structured data compared to MARC-21 catalog records
  • Mapping Book Data
    • Chose easiest data, 50,000 most popular authors
    • Exported to CSV
    • Needed to disambiguate authors and publishers
    • Use OpenRefine to match authors, multiple ways to do this Also used OpenRefine to match authors to existing authors in Wikidata (names, titles, dates, etc. all possible matching points)
    • VIAF etc. also allowed them to create items for authors
    • Not possible to match for all authors, some items still have “author string”
    • Then created Wikidata item for Works
    • They then need to be connected to Editions/Translations/Etc.
    • Did with a combo of uploading directly and QuickStatements to add additional data
    • Lot of help from Simon Cobb, visiting Wikidata scholar
  • Challenges
    • Finding information on authors and publishers
    • Looking at national bibliography of Wales (lots of modern books, earlier books don’t have unique identifiers, obscure publishers, etc.)
    • Finding good information challenging, focused on 100 most common publishers
    • Potential copyright and license issues with catalog data
    • Some catalogers nervous about re-using data purchased from OCLC or other sources
    • Third party data may have copyright issues (gray area)
    • There may also be licensing issues
    • Scale of project, Welsh bibliography has 1,000,000+ works
    • Data maintenance? How do we automate catalog updates => wikidata
    • If we want to round trip that data how do we do it, and how do we monitor quality of data added in Wikidata
  • Benefits
    • Having richer, more accurate, structured data
    • Easily accessible and reusable data
    • People can interact and explore a huge collection
    • Connecting with and building a larger dataset
    • Can crowdsource improvements to data to community
    • Working towards Wikicite and sharing data
  • Some quite lovely diagrams and visualizations about publishers and publishing can be created
    • Can begin to explore relationships between authors and items
    • And authors/items/the world/history of publishing in Wales
  • Identifiers
    • Can connect identifiers from different datasets
    • One of the most powerful/useful things you can do, especially when advocating sharing data
    • Useful for round-tripping, pulling identifiers back into own catalog
    • The more institutions that match data to external datasets the more we can share and enrich catalogs
    • Already seeing data being enriched in this way
    • People adding identifiers
  • Wikidata is multi-lingual
    • Can be very powerful for people working in a country with more than one language
    • Encourages reuse of data
  • Potential future projects
    • Recently created rich metadata for ~1000 manuscripts in a separate project
    • Added subject and genre, which can allow visualization of manuscripts organized by subject and genre
    • Shows how you could very powerful search and discovery tools by linking to entities for particular genres and subjects
    • A lot of books shared on Wikidata will be digitized and OCR’ed
    • Once you have OCR data you can use AI to determine things like subject and genre
    • Can also pull out entities from text (names, places, events, etc.)
    • National Library of the Netherlands has done this for their newspapers
    • Text can then be tagged and connected to items and external identifiers
    • Use of IIIF can allow you to overlay information onto text
    • Structured data can transform how libraries look at data

Questions[edit]

  • Saving time in MARC to Wikidata workflows
    • People want a programmatic way to do this, but creating mapping for authors without unique identifiers or works can be difficult. Make the initial cataloging as clean as possible (example: adding ISNI identifiers)
  • Is modelling the manuscript extensively (slide) labor intensive?
    • Were able to semi-automatically take out names and match them to people, many already in Wikidata. Fairly labor intensive but there may be ways to automate much of the work to a good degree of accuracy. Would be tricky for a giant collection of books.
  • Any plans to apply process to materials beyond books?
    • Always trying new datasets, discussing musical scores. Would love to do sound recordings or video, but you can’t share the actual recordings (likely to have copyright issues) which takes away some benefits of sharing data.
  • Any advice on thesis and subject headings in Wikidata?
    • University of Edinburgh has shared a thesis collection (Ewan McAndrew)
    • A: Looking at converting EDTs into Wikidata, both proposing new subjects to Library of Congress, and also creating Wikidata items (often already items, but creating when needed). Trying to figure if LCSH headings can be mapped, especially free floating subdivisions (use main part, use entire field). If Wikidata URIs can go in MARC fields, that would document exact Wikidata item).
    • Library of Congress has done mapping work, some have links some don’t. So many subject headings don’t have items that could be created.
    • National Library of Wales has volunteers tagging photographs, would be cleaner and easier with identifiers.
    • When do you switch to Wikibase for things you can’t describe with Wikidata? Is this all scalable? Think about what you’re trying to achieve, are Wikibase or Wikidata the best option?
  • Advice for mapping books not in English?
    • Tried to make sure the language was there, and labels were correct (English versus Welsh)
  • MARC to Wikidata mappings?
    • Will do a lot of heavy lifting when it’s done, people are working on it. Universal mapping would be very useful, take care of basic stuff.
  • Any pushback from Wikidata folks?
    • No pushback, asked in several areas if it would be acceptable to add that much information. No pushback or complaints, but uploads will get bigger and bigger and changes may be needed. One of the reasons this was done was to advocate for structured data generally and Wikidata can be used to take a sample and show how what can be possible for the future at a larger scale. That doesn’t mean everything in Wikidata, but it’s a fantastic showcase.
  • Are some of the visualizations online?
    • Shared slides in the agenda, a couple may be on commons
  • Some charts could be interpreted as music
    • Did a hackathon where someone turned data into music
  • Has anyone considered putting preferred terms (over problematic subject headings) in Wikidata?
    • Not something Jason has had to deal with on Wikidata
    • J: would be interested in collaborating on how to open up these preferred labels. It seems the lists for preferred labels are closed or internally managed currently.