Wikidata talk:WikiFactMine/Annotation for fact mining
"It is planned to set up a MediaWiki site, which would host the full text of open access scientific papers" Would Wikisource be an viable solution? Is it allowed to have annotation on Wikisource? — Finn Årup Nielsen (fnielsen) (talk) 19:10, 17 October 2017 (UTC)
- Good questions.
- There is a proposed annotation guideline on English Wikisource: s:Wikisource:Annotations. It was mostly developed in 2013. It is not clear the community will accept it fully.
- There is a history, of around three years now, of proposals for posting scientific papers to Wikisource. A discussion at the 2015 Vienna Wikisource conference showed that this kind of content (i.e. papers already digitised) is not so interesting to the existing Wikisource communities, who work on digitisation. On the English Wikisource, there has been some resistance to mass imports of files for current scientific papers, from 2014.
- So there is an existing context. While Wikimedia should be closely allied to open source science, we are not currently seeing how it would work on existing sites. I have been on English Wikisource since 2009, and I know the community there does not take decisions quickly. My interpretation, though, is that this "change of use" is not really going to be acceptable.
- In other words, while Wikisource might at some point in the future host this kind of material, currently the initiative to do this work there has stalled. There was some discussion at the Wikisource meetup this year at Wikimania. It clarified the position a little for me. But my conclusion is the same: we should not hope for a direct way to the goal of hosting current open access papers on Wikimedia sites, even though it would be desirable for some purposes (e.g. referencing on Wikipedia). Charles Matthews (talk) 08:26, 18 October 2017 (UTC)
The specific and the general
As far as the process goes I think it would be worthwhile to work through a few examples of scientific papers by hand and discuss what kind of Wikidata statements can be extracted from them. Having clear examples and goals would make it a lot easier to design a system that's fulfills those goals.
While we are at goals, I already voiced my opposition to 3D ontology in the past. This tool is not going to be welcome when it makes statements that are triples. At a minimum the code of a statement would be in the form "Q30123456 A321 Q7654321 S248 Q458935045". Many claims in scientific papers are also going to be more complex than that.ChristianKl (talk) 16:54, 18 October 2017 (UTC)
- Of course, point taken. The annotations will be on the paper, so all the metadata needed to create a reference will already be in the same data structure. The actual addition to Wikidata would, we hope, be by passing everything to QuickStatements, and some quite long suffix to the triple would be added at that point. Charles Matthews (talk) 09:58, 19 October 2017 (UTC)
- As for illustration, the SciLite system might be a good starting point. We would obviously want to use Wikidata, and have various decisions taken by a community. They use the same W3C standard we would use, so the underlying data model would be comparable. We would use code in annotations more freely; or at least our system would be more open, with separation of content and presentation. Current WikiFactMine work draws on that repository, so some of the major uses of properties would be the same. Briefly, a Wikimedia version of that. Charles Matthews (talk) 10:12, 19 October 2017 (UTC)
- So, do you mean "front end", or "data structure"? The way we are approaching this, we have a particular goal in mind, which is to address the referencing issue for biomedical facts in Wikidata. For that application, we need to draw in data, to a structure set up for the particular paper, and thinking about particular Wikidata properties. Then we run an algorithm. The output then can be added to Wikidata.
- On the other hand, we are proposing to a system that might have a back end very like that of https://web.hypothes.is/, which is a large existing system (see https://web.hypothes.is/blog/2-million-annotations/ which is from a few days ago). There is a Hypothesis client on Github, which is open source, naturally: https://github.com/hypothesis/client . In terms of what we might do, we are probably intending to adapt this client, taking into account the need to handle machine-readable content. I suppose we'd do a prototype front end: but I'm not the ContentMine developer. Charles Matthews (talk) 08:29, 23 October 2017 (UTC)
I have a suggestion (or more of a question). Why are we starting with open-access papers instead of already existing cited (and well-referenced) good-quality Wikipedia articles. Starting from a limited dataset (like articles related to persons, places etc.) and extracting triples and loading them to Wikidata may be a good start. John Samuel 17:52, 21 October 2017 (UTC)
- @Jsamwrites: Wikipedia is no good reliable source for the purposes of Wikidata and Wikipedia. On the other ahnd peer-reviewed open-access papers are good sources. ChristianKl (talk) 23:33, 21 October 2017 (UTC)
Indeed, the underlying idea, which has been around for at least three years, is to mine the scientific literature directly. ContentMine does this now for new papers, around one month after publication. There are other proposals around for "recycling" Wikipedia content into Wikidata: but there are still issues. In particular, one has to check that what is said in the paper actually supports the factual claim.
Fact mining has more than one application, because it can act as a type of advanced search engine also. See Wikidata:Property proposal/cell line used for a use case that is under discussion now. Charles Matthews (talk) 08:36, 23 October 2017 (UTC)