Wikidata:WikidataCon 2017/Submissions/An open source tool for fishing Wikidata entities in text and PDF documents

From Wikidata
Jump to: navigation, search

Pictogram voting info.svg This is an Open submission for WikidataCon 2017 that has not yet been reviewed by the members of the Program Committee.

Submission no. 57
Title of the submission
An open source tool for fishing Wikidata entities in text and PDF documents

Author(s) of the submission

Patrice Lopez

E-mail address

patrice.lopez@science-miner.com

Country of origin

France

Affiliation, if any (organisation, company etc.)

SCIENCE-MINER & Inria Paris


Type of session
  • Talk (the usual conference format. 45min, large audience)

or

  • Demo (10min, similar to a lightning talk but focused on one of the tools people use to edit Wikidata or reuse the data)
Length of session

30min (preferred) or 10min

Ideal number of attendees

10-25

EtherPad for documentation
https://etherpad.wikimedia.org/p/WikidataCon-fishing

Abstract

entity-fishing (repo: https://github.com/kermitt2/nerd, demo: http://entity-fishing.science-miner.com, documentation: http://nerd.readthedocs.io) is an open source tool dedicated to the automatic identification and disambiguation of Wikidata entities in multilingual text and PDF documents. The tool is based on machine-learning techniques exploiting Wikipedia as training source. entity-fishing offers high performance and scalability and is totally generic in term of domains and languages. It can thus address a large variety of usages. Our work focuses more particularly on processing scholarly documents, taking advantage of the massive amount of scientific knowledge and links present in Wikidata.

A view of entity-fishing demo console.
What will attendees take away from this session?

We think that this tool can be useful to the Wikidata community, in particular for using Wikidata to process text in a vast range of applications like semantic search, key-concept and key-category extractions from documents, knowledge extraction, document filtering and linking, etc. It can also help to exploit Wikidata as a scientific data hub, further linking to external scientific databases referenced in Wikidata. In addition, thanks to the integrated GROBID automatic bibliographical reference extractions, it can support project like WikiCite, associating citations with key concepts and Wikidata items for further human curation.

Slides or further information

As most of the information created by Humankind is only available in textual form, tools for identifying automatically which Wikidata items are mentioned in a text could be very useful for many applications and for exploiting the richness of Wikidata at large scale. However, entity linking is difficult due to the inherent ambiguity of language, the heterogeneity and noisiness of text data and combinatorial explosion. In addition, as our primarily goal is text mining in scholarly content, the textual content is very often locked into PDF (for example in the ISTEX project, around 92% of 18 million scientific articles from the mainstream publishers are only available in PDF).

For addressing these different issues, we have developed entity-fishing, a machine-learning tool for identifying Wikidata entities in text and PDF documents. The tool uses state-of-the-art recognition and disambiguation techniques (NERD) using Wikipedia textual articles as source for statistics and training. By exploiting the context where recognized mentions are used, an entity is selected, associated with a confidence score and optionally n-best results. We are using Random Forest, Gradient Tree boosting and CRF models as main machine learning techniques.

entity-fishing annotations on a scholar article in PDF.

entity-fishing is an Open Source software distributed under Apache 2 license - with dependencies only to libraries under Apache 2 or compatible licenses. It can therefore be used for any purposes, including commercial applications, without any constraints. In addition to exploiting and linking to Wikidata, our implementation tries to go beyond existing NERD research prototypes by providing high robustness, performance and the flexibility of query language allowing various customization and filtering. The tool can process 1000 words per seconds, between 1 and 2 PDF pages per second, or a multi-term search query in less than 5 ms. It is also multilingual, exploiting both the language independent information of Wikidata and the language-specific resources of Wikipedia. It currently supports English, French and German (Spanish and Italian are in preparation).

entity-fishing also performs of structure-aware annotations of PDF documents. It is coupled to another of our open source tool called GROBID which is capable of transforming a PDF document into an XML structured representation appropriate for text mining. For instance, sections like bibliographical references, figures, head notes or tables can be ignored in the annotation process or be subject to different specialized processing.

An additional benefit of structural processing of PDF documents is the ability to preserve layout coordinates and to annotate directly the PDF. Instead of traditional text mining feeding a database with annotations, the tool can expose the annotation and the Wikidata information directly on the PDF, the preferred format for human reader.

If selected for a long communication, in addition to a general presentation/demo of the software, we propose to present 3 original use cases that we find of interest to the Wikidata community and which illustrates the possibilities of the tool:

Automatic identification of species

This use case illustrates how to use the tool as a general purpose scientific annotator tool taking advantage of the rich Wikidata content. By specifying in the query a filter on Wikidata property (e.g. the mandatory occurrence of P225 - a taxon name) and specifying a more aggressive threshold than the default one, we can obtain a automatic annotator for species name which will additionally disambiguate species mentions to their entity in Wikidata, competitive with specialized tools like LINNAEUS.

Automatic mapping of the UAT thesaurus to Wikidata

The Unified Astronomy Thesaurus (UAT, http://astrothesaurus.org, https://github.com/astrothesaurus/UAT) is used by the SAO/NASA Astrophysics Data System to organize publications in Astronomy and Physics. We used entity-fishing for automatically mapping the thesaurus entries into Wikidata items when possible. Around half of the entries could be fully disambiguated and mapped into Wikidata (e.g. Cepheid variable stars mapped to Q188593), with very high precision, and the rest only partially, because the entities were not covered by Wikidata (for example for Pulsating variable stars, only variable stars was mapped to Q6243).

The slides.

This exercise allows some interesting continuation:

  • complete Wikidata for fully covering the topics used by UAT,
  • use entity-fishing for automatic document classification and key topic extractions, exploiting the documents already classified in the UAT,
  • documenting the UAT by linking examples and Wikidata/Wikipedia entries.

Processing natural language queries in Astrophysics

entity-fishing can process search queries (the sequence one or several search terms) by disambiguating the terms in context - either several terms together or via the entities from previous search queries or from previous search results (relevance feedback). We have started to experiment entity-fishing for processing Natural Language (NL) query in Astronomy. Very large and rich database exist in astronomy. For instance SIMBAD covers more than 9 millions astronomical objects with many properties and pictures. These database are difficult to access for non-expert users. entity-fishing exploits Wikidata for spotting the common concept of a spoken NL query, which can be used to create more sophisticated SQL queries on several target databases.

Slides

The slides are on the right, they have been slightly augmented with screen captures corresponding to the live demos as compared to the actual presentation slides.

Interested attendees[edit]

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest.

  1. Ijon (talk) 04:01, 1 August 2017 (UTC)
  2. --YULdigitalpreservation (talk) 11:46, 1 August 2017 (UTC)
  3. --Sky xe (talk) 14:33, 23 October 2017 (UTC)
  4. Daniel Mietchen (talk) 22:33, 27 October 2017 (UTC) ...