Wikidata talk:WikiProject Source MetaData/Archive 1

From Wikidata
Jump to navigation Jump to search

Scope

This wikiproject seems redundant with Wikidata:WikiProject Books (or at least the description is exactly the same). I agree that the name "Wikiproject Books" doesn't represent well the scope of the project, which is now more general than when it started. What do you think of merging both Wikiprojects into a more general "Wikiproject Sources"?--Micru (talk) 07:22, 9 August 2014 (UTC)

@Micru: See Wikidata:WikiProject_Periodicals too if you want to create a unique project. As the current projects are not to developped, we can imagine to create one project with several tabs like for the chemistry project. Snipre (talk) 09:17, 9 August 2014 (UTC)
The scope might be overlapping, but I think it is in the interest of Wikidata that a WikiProject that can attract members and generate constructive edits should be able to coexist. WikiProjects might even engage in healthy competition that will lead to the survival of the fittest :) --Tobias1984 (talk) 11:02, 12 August 2014 (UTC)
If this branding help attract more users, then ahead with it! I prefer healthy collaboration than competition, so for now I'll sign to both :D --Micru (talk) 20:10, 12 August 2014 (UTC)

History

A little recent history. After speaking to a few people at OKFest in July about the possibility of hosting citation and bibliographic data in Wikidata, a Wikimania fringe event (Open Scholarship Weekend) led to the start of the bib2wikidata tool. See the event's etherpad for details.

There are a number of people interested in this kind of data so a Citathon (citation hackathon) took place at Wikimania last week. This WikiProject was then created. Many people have suggested similar things over the years, but now we have Wikidata so a lot more is possible. - Lawsonstu (talk) 14:15, 12 August 2014 (UTC)

Another really helpful thing for Wikidata would be ISBN and DOI lookup similar to what Wikipedia does. It is really painful to create source items or fill in the source information for a statement. --Tobias1984 (talk) 15:09, 12 August 2014 (UTC)
Agreed! - Lawsonstu (talk) 07:53, 13 August 2014 (UTC)

Added Goal

Added a goal corresponding to community stakeholdership:

to reveal, build, and maintain community stakeholdership for the inclusion and management of source metadata in wikidata.

It seems an intuitive goal to me that this project exists to show that there is an existent and growing community that will manage and make use of the data in question. Any improvements? - Mattsenate (talk) 17:24, 14 August 2014 (UTC)

Creating entries for journals before articles?

Would it be useful to create a Wikidata item for every journal that exists (that has an ISSN), as a preliminary step to having mass amounts of items for individual articles or citations? I think this would be compliant with Wikidata:Notability ('to serve as a general knowledge base for the world at large', and 'it fulfills some structural need, for example: it is needed to make statements made in other items more useful.') - Lawsonstu (talk) 13:31, 21 August 2014 (UTC)

As far as I know John Vandenberg has created these items allready. --Succu (talk) 13:44, 21 August 2014 (UTC) PS: ISSN (P236) has around 29,000 usages. --Succu (talk) 13:47, 21 August 2014 (UTC)
Well, there are still some lacunae. Just one example: Bullettino dell'Istituto Storico Italiano per il Medio Evo ISSN 0392-5242, 1127-6096. --HHill (talk) 16:57, 21 August 2014 (UTC)
On the talk page of Wikidata:WikiProject Periodicals, I have indicated the types of academic journals I have created items for; they tend to be current international accepted research journals. There are many other sets of periodicals that we need to create items for, in advance of them being used in references as Lawsonstu suggests, I am I happy to assist or cheer from the sidelines. John Vandenberg (talk) 11:08, 24 August 2014 (UTC)

Source type names

So, I thought a good place to start would be to decide on source type names.

I've created a document with some possibilities side-by-side: Wikidata:WikiProject_Source_MetaData/Source_types

This is by no mean complete, it has Zotero and CSL types pretty accurately (b/c I copied that from what is actually used to export Zotero to CSL!).

I've tried to also include the types that are actually used on en wiki in the third column, however, matching them to their respective CSL types might need additional work. (There are lot of unmatched types).

Also please add other options for source types in additional columns!

Pros of using CS1 template source titles: translates more easily into en wiki citations

Pros of using a bonified standard like CSL: it's a standard

Cons of everything listed: These are all machine readable, wikidata properties tend to be human readable (i.e. "encyclopediaArticle", "entry-encyclopedia", and "encyclopedia" are three types that might be called "Encyclopedia article" on wikidata instead.)

Also, wikidata has the property "subclass of" so we could make this hierarchical too (i.e. "AV media" includes both "audio": "song" and "podcast", as well as "video": "vlog" and "film") Mvolz (talk) 18:56, 22 August 2014 (UTC)

@Mvolz: The current proposal is to use any item from the work tree when referring to a work in general, and to use any item from this tree of manifestations when referring to something more specific that is bound to a physical medium. If you think that any of the items of your list is not in those trees, please add them, the more complete our vocabulary is, the more compatible with all existing standards we'll be.--Micru (talk) 21:43, 22 August 2014 (UTC)
@Micru: Thanks for linking those! Can you point me to where this discussion was happening elsewhere?

Those trees are great, but do not meet the use cases I'm thinking of in terms of coming up for source types for the purposes of this particular project. The two use cases I'm thinking of are:

a) Define a certain set of fields for each source type.

b) Be easily importable and exportable as citations in standard formats.

In the former case, clearly we cannot do that for every wikidata item on that tree, so we have to pick only a few. These would be standard fields that those source types generally use, akin to what is listed in the "books" and "periodical" task forces. Works or manifestation which were annotated with types lower in the tree would use fields defined from the nearest parent.

In the latter case, we still have to decide *which* of these many types correspond to the types we are importing or exporting. When we import a source from Zotero, for instance, we are guaranteed one of the Zotero types, and we have to decide which wikidata item to put in as a type. This is the function of the grid I've created, to make these translatable. Ease of translation may also influence which fields we pick (the former case), since each of these CSL/Zotero types have distinct although usually overlapping sets of fields. Mvolz (talk) 10:27, 24 August 2014 (UTC)

@Mvolz: There is no need to decide on a few, you can map all elements 1:1 and when exporting you can collapse the tree to the nearest matching element of any particular standard. For instance, take a look to this autogenerated list of works. It looks for the elements that are of a certain type by a certain author, BUT if you look at any of those elements, few are matching exactly, probably none. The way it works is by performing a tree search, which happens in the background with no user intervention. And the same principle can be used for properties as soon as it is possible to add statements on property pages.
@Micru: Okay, so what you're saying is, when importing, use the type as written, and just make sure that all types fit into the tree somewhere, and for exporting, traverse the tree to find a matching element. Makes sense! Although the algorithm isn't as simple as "nearest"- maybe the exporting algorithm should only go *up* the tree and not down it, since it is better to be less specific than more. The other issue is that subclass of isn't classical inheritance; one class has multiple parents. Do we pick the closet parent? Not always the best one. I notice in the tree you supplied, items with multiple parents sometimes get placed in what I'd consider the wrong section using the *its* algorithm. (This involves me having to pick out new roots a lot- I don't suppose there are different tree tools that display dupes?) In addition, the source type names don't always correspond to what they *are*- for instance, "encyclopedia" in en wiki cite templates usually actually means "encyclopedia article". So we should definitely create equivalencey tables to pick the best wikidata type or create a new one as needed for each possible import/export type. Each export type has a fairly limited set of types so it shouldn't be too bad.
Properties pose more of an issue. On export and import we still are expecting certain fields, i.e. properties, which are different depending on source type. The same solution- as many properties as you like and in a property tree- is not possible. For the purposes of many citation formats, "author" denotes *position* in the citation, and not the actual characteristics of the creator. But in actuality this information is more discrete, and various creator types get cast as author depending on the manifestation. For patents "inventor" is cast to author, for a recording of song "performer" is the author, etc. All in all I think it's still helpful for us to list properties like in wikiproject books and periodicals but for other types.Mvolz (talk) 21:05, 25 August 2014 (UTC)
@Mvolz: items with multiple parents sometimes get placed in what I'd consider the wrong section: then we'll have to clean up the tree. Sometimes it is just because an item is representing two concepts, those should be split. You can also intersect trees, so you make sure that you are taking an item from a certain tree.
As for properties, take a look to the existing one and if you find something missing you can propose it, or create a Wikiproject and place there the specific properties for that field.--Micru (talk) 06:59, 26 August 2014 (UTC)
There have been conversations everywhere... at Help talk:Sources, at the book task force, on property creation, etc. I think I will open a new RFC so we can address the last remaining concerns in a central location before before bringing forward a new revision of Help:Sources.--Micru (talk) 12:09, 24 August 2014 (UTC)

Project to associate drugs with interactions

I cross posted this information to English Wikipedia's WikiProject Medicine at en:Wikipedia_talk:WikiProject_Medicine#Wikidata_project_to_associate_drugs_with_interactions.

A new Wikipedian, Alepfu, is a researcher who would like to make a database connecting information about drug interactions to Wikidata items for drugs. This would mean that for Wikidata items like (RS)-warfarin (Q407431), there would be a Wikidata property, significant drug interaction (P769), which makes a claim that some other drug interacts with Warfarin. This project could stand alone.

Something interesting is that it is proposed that this project also tie sources to the claims, as in the case of High-priority drug–drug interactions for use in electronic health records (Q17505343) being used to back claims about drug interactions of Warfarin. That source has been integrated into Wikidata.

Alepfu's project is still running test edits, but comments would be welcome now on the concept at d:Wikidata:Requests for permissions/Bot/AlepfuBot. I would love for it to be possible for projects such as this to be able to import citations as they make claims. Blue Rasberry (talk) 20:06, 26 August 2014 (UTC)

Could DrugBank help? They have an open database of drug interactions HLHJ (talk) 22:28, 31 January 2015 (UTC)

the books logo

There are some unused variations of the Wikimedia community logo available for some project to adopt as its own. This Wikidata Source project could become big. If members want a logo, this books logo seems appropriate. Blue Rasberry (talk) 16:15, 5 September 2014 (UTC)

@Bluerasberry: That logo is already taken ;) See: meta:Wikisource Community User Group.--Micru (talk) 09:55, 6 September 2014 (UTC)
That's awesome! I will think about a logo for this group. Blue Rasberry (talk) 14:38, 10 September 2014 (UTC)

Should there be a new property "sortkey" for personal names?

Discussion opened at Wikidata:Project_chat#Sortkey of interest to this project:

Should there be a new property "sortkey" (or some better name to be suggested), to record personal names in <Family Name>, <Given Name> <Middle Names>, <title> format, as available from Persondata items on en: and de-wiki, as often used for sorting/indexing, and as often needed in citation schemes?

Follow-ups to Project Chat. Jheald (talk) 22:38, 20 September 2014 (UTC)

RfC: IEG grant for this project?

The Source MetaData WikiProject does not exist. Please correct the name., @Maximilianklein:

So before we've talked about actually using source metadata on wikis as citations.

I've drafted a proposal on IdeaLab for this as well as (one) possible implementation: https://meta.wikimedia.org/wiki/Grants:IdeaLab/Tools_for_using_wikidata_items_as_citations

Please comment, feel free to edit as well. Anyone feel like doing this for this round's Individual Engagement Grants? Projects proposals for this round are due Sept. 30.

Mvolz (talk) 22:27, 22 September 2014 (UTC)

@Mvolz: brilliant :) but isn't a proper data model to represent various types of source metadata in Wikidata a precondition for this to happen? -DarTar (talk) 07:32, 23 September 2014 (UTC)
@DarDar: I think for the most part the properties are actually pretty comprehensive at this point- and we can always request more as we go along :). Is there anything about the data model right now that you find particularly problematic? Mvolz (talk) 14:47, 23 September 2014 (UTC)
I've added "community organiser" roles to the IEG in case anyone would be interested in doing that kind of role. Also looking for people with experience in or interest in writing templates or accessing wikidata programmatically. Any recommendations of people that might be interested? Mvolz (talk) 18:50, 26 September 2014 (UTC)

New property "manifestation of"

manifestation of (P1557). Thanks Micru! Mvolz (talk) 23:44, 11 October 2014 (UTC)

Launch of WikiProject Wikidata for research

Hi, this is to let you know that we've launched WikiProject Wikidata for research in order to stimulate a closer interaction between Wikidata and research, both on a technical and a community level. As a first activity, we are drafting a research proposal on the matter (cf. blog post). It would be great if you would see room for interaction! Thanks, --Daniel Mietchen (talk) 01:46, 9 December 2014 (UTC)

Database size

We may need a bigger database. This project was originally started with the intention of bringing in sources that wikipedians have not yet cited, and not only making them easier to cite, but perhaps automatically adding useful information to other projects (as the Open Access Media Importer Bot has). If we (just as an example) include just all open-source peer-reviewed articles, we will need to add more than one article per minute. Even ignoring really low-quality publications, as suggested in Open Access Reader, there are a lot of excellent sources out there. We could make a fantastically useful database, but run into technological constraints. Does anyone have a doc on exactly what the size constraints of the current database are? HLHJ (talk) 17:49, 1 February 2015 (UTC)

To get a minimal estimate of the number of entries we need, I did a quick survey with the pseudorandom article link. I was going to do ten articles, but the first ten had no sources that were books or academic articles sources, so I did 25; it's way too few, but this is just a rough estimate. Please consider this a pilot study and feel free to do a proper survey. Here is the R script that enters the data and gives basic stats, results included as comments:

#database references and census references
c(1,0,0,0,2,0,0,0,0,0,0,1,0,5,0,0,0,0,2,3,3,4,0,1,0)->db

#news articles (New York Times articles, etc.)
c(0,3,0,0,0,0,0,2,0,0,2,0,0,3,2,0,0,0,0,1,10,0,0,1,0)->n

#scholarly sources like books and academic articles
> c(0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,3,0,12,21,0,1,0,3)->s

total=s+n+db
total
# 1  3  0  0  2  0  0  2  0  0  2  1  4  8  2  0  0  3  2 16 34  4  1  2  3

summary(total)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    0.0     0.0     2.0     3.6     3.0    34.0 
summary(db)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   0.00    0.00    0.00    0.88    1.00    5.00
summary(s)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   0.00    0.00    0.00    1.76    0.00   21.00 
summary(n)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   0.00    0.00    0.00    0.96    1.00   10.00

Not really enough data. But roughly, of articles on the English Wikipedia, it seems that there is more than one relevant source per article (ignoring sources that have little obvious structured metadata beyond a URL). The split between news articles, references to entries in online databases, and books and scholarly articles may or may not be about even.

Let's assume about one source per article. That's about five million sources from the English Wikipedia alone, and, wildly assuming other languages are similar, millions more from them. So "On the order of millions, ish, maybe into the tens of millions" is a conservative estimate of the absolute minimum number of sources we would want to enter into the database. How does it compare to the maximum number of entries technically possible in Wikidata? Someone, I think Daniel Kinzler told me this verbally at Wikimania, but I don't recall the number. HLHJ (talk) 13:29, 16 May 2015 (UTC)

@HLHJ: Most of the sources are not specific to one language: in the case of English you will have more than 5 millions or documents, but you can't just multiply this number by the number of WPs or languages. In most cases like scientific documents you will find only a limited number of English documents reused in all WPs. So my estimation is below 50 millions for the documents used in all WP articles as references. Then the question of all documents published: WD won't list all documents but a natural filtering will occur and only pertinent docs will be imported (meaning documents which are used as references). For books we will have certainly most of the work items but not all editions. For articles only the ones which can be used as source will be added. Personally I am more afraid of initiatives like the import of streets data: no filtering is applied. Snipre (talk) 15:58, 18 May 2015 (UTC)
@Snipre: Merci pour la correction. Vous avez raison, j'aurais du prendre compte des citations communes; au Wikipedia anglais même, il y a beaucoup de citations non-anglaises. Je vient récemment de parler avec User:Magnus Manske, et il se revient d'une estimation meilleure, qui trouvait a peu près quelques dizaines de millions, alors la graundeur ne serait pas ne problème. Il m'a tout même aussi suggéré qu'on considère une project distincte; il m'a dit qu'on n'a jamais voulu que Wikidata soit la seule implementation du code Wikibase. Ce me semble aucune mauvaise ideé, ici ou pour l'information cartographique ou choses semblables. C'empecherait que nous fassions des dégâts dont la communité Wikidata devrait s'occuper, et comme nous avons une tache bien délimité et des cas d'utilisation claires, comme Wikispecies, ca sera bien possible q'on viendrait a créer une communité semblement forte. Je le suggèrerai aux autres ici. HLHJ (talk) 16:36, 25 May 2015 (UTC)
Translation (realised that French alone might be less than helpful to others!): Thank you for the correction. You're quite right, it was a thoughtless of me to omit language overlap; some of the sources cited in the English Wikipedia are written in other languages, too. I spoke to User:Magnus Manske recently and he said he recalled a rather better estimate coming up with something on the order of tens of millions. He also strongly suggested that we consider starting a specialized sister project; he said that it was never intended that Wikidata should be the only Wikimedia implementation of the Wikibase software. I think he has a point. For things like this (or, I imagine, street data like OSM has) a sister project might work well, and it would certainly reduce the risk of making a mess that the Wikidata community would have to clean up. We have a well-defined task and a clear use case, like Wikispecies, and so hopefully we'd be able to form a similarly sociable community. I'm going to suggest this to the project at large. HLHJ (talk) 16:36, 25 May 2015 (UTC)
Just to clarify, if we have separate wikibase installations for things like Commons images and Wikiquotes (and we will/we should), then storing all sources (and likely many more publications, even if not used as a source) would also fit into a new project rather than Wikidata itself, IMO. --Magnus Manske (talk) 19:53, 25 May 2015 (UTC)
Taking the (very) long view: this will overlap strongly with Wikisource, because all our sources will eventually be copyright-expired (and of course many are already; or are PD or open access). Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 21:46, 25 May 2015 (UTC)
This is too early to split the current Wikidata into several databases connected together. WD should reach a stable version which not yet the case. We need all datatypes defined at the beginning (still missing the numeric one with unit), the arbitrary access and the query tool to create lists of items. Then we need a first period of use in Wikipedia (infoboxes) in order to test the use of WD. Only after these steps we can think about splitting data between different databases. I really don't know if it is possible to call different databases when opening an WP article. Or do we have to use WD as intermediate between WP and others databases ? This should be discuss with the development team first. Snipre (talk) 09:58, 26 May 2015 (UTC)
I have some interest in this from the point of neuroimaging and related area. The relevant number of articles is only 100s to 10'000s, but I would be interested in representing the data in the articles in a Wikibase-installation. There could potentially be 50 items per paper. I do not know if Wikidata would be the appropriate place. — Finn Årup Nielsen (fnielsen) (talk) 14:19, 17 September 2015 (UTC)

PMID

It would be a powerful research tool to add an item for each PMID to Wikidata. I think that this would be an important first step to bringing in a wider audience of academics.

The data is nicely structured as XML such as seen here [1] and should be fairly easy for a bot to create items. We just need to decide what items we all wish to have the bot pull in. Doc James (talk · contribs · email) (if I write on your page reply on mine) 09:50, 2 October 2015 (UTC)

  • seems like a very good idea ... allow us to have a bot that creates a list of all RCTs within medical articles for which their is a systematic review that contains the RCT in question. This will make updating / releasing RCTs on WP easier.[2]--Ozzie10aaaa (talk) 20:03, 2 October 2015 (UTC)
  • As discussed here, this proposal is NOT to have WPs references be pulled from Wikidata. IMHO, WikiData should not replicate data already found in external databases such as w:AdsAbs, w:arXiv, w:CrossRef, and w:PubMed. Citation bot queries these databases to automatically create new citations as needed. These external databases maintain this data better than WikiData could ever hope to. WikiData should concentrate on unique data not contained in these external databases such as translation of titles (e.g., |trans-title=, |trans-chapter=) into the native language for each of the official Wikipedias and corrections of any errors in the external databases. (It should also be noted that PubMed corrects errors if they are pointed out to them.) Boghog (talk) 09:48, 3 October 2015 (UTC)
These external databases don't really know anything about the authors of individual papers. Wikipedia on the other hand can link papers with authors. If I look up a university professor in Reasonator all the papers that the professor wrote and that are linked in Wikidata are listed. The paper-author link would also add value for citation managers like Zotero that integrate Wikidata. Having such a feature in Zotero would also encourage Zotero users to browse author items on Wikidata and add useful information such as academic affiliations. Linked data provides synergy in many ways. ChristianKl (talk) 14:04, 17 July 2016 (UTC)
  1. Add some citation data for some citations
  2. Add that same amount of citation data for the rest of the citations (all 38 million)
  3. Add the rest of the National Institutes of Health citation data for all these references
  4. Openly encourage anyone to add metadata to the citations. Much of the interest will be in categorizing how sources are used when cited by other sources.
  5. Far in the future beyond what can be scheduled, eventually, Wikimedia projects call citations from Wikidata. I think there is agreement that if and when this happens it will be more than 5 years from now and happen in a way that the current software does not anticipate. Still, the citation database here will never go obsolete, because we are talking about only building a dataset.
The potential drawback here is deciding whether we want an additional number system for large special sets, like citations, astronomical objects, or proteins. Citations is the smallest of these but still 38 million is big compared to Wikidata. Blue Rasberry (talk) 16:46, 5 October 2015 (UTC)
Yes agree. I think it should be within this one. We will want to link authors names etc. Doc James (talk · contribs · email) (if I write on your page reply on mine) 13:12, 6 October 2015 (UTC)
Exactly. And that would be useful to researches.Doc James (talk · contribs · email) (if I write on your page reply on mine) 04:38, 4 November 2015 (UTC)

Example Q16749642

I tried to improve this example entry by adding metadata of the print edition version of the article, which has own (different) title, publishing date, page number. Also It might be published in a part of the NYT, the New York edition. I failed with some of those. So please could one add the missing information, as in:

A version of this article appears in print on May 5, 2014, on page B8 of the New York edition with the headline: Gary S. Becker, 83, Nobel Winner Who Applied Economics to Everyday Life.

what can be found at the end of the online version of the article. --Matthiasb (talk) 11:13, 7 October 2015 (UTC)

Example of using Wikidata as cite database

Lua time usage: 0.639/10.000 seconds
Lua memory usage: 4.21 MB/50 MB
Number of Wikibase entities loaded: 85
Transclusion expansion time report (%,ms,calls,template)
49.06%  743.072     22 - Шаблон:Source

-- VlSergey (трёп) 12:48, 15 December 2015 (UTC)

Annual Bibliography on German History (1949-2015)

A dataset of nearly 760.000 items has been published under CC0, cf. the announcement (in German). As those are valuable bibliographical data for scholarly publications, is it possible to import them into wikidata? --HHill (talk) 10:53, 19 February 2016 (UTC)

Wikidata Game Source MetaData

Magnus wrote a module to link scientific paper authors to their Wikidata entries, as part of the Distributed Wikdata Game. The game is here: https://tools.wmflabs.org/wikidata-game/distributed/#game=9 . Its source code (and issue tracker) is here: https://bitbucket.org/magnusmanske/sourcemd/src/master/public_html/sourcemd_distributed_game_api.php

This is not (yet) linked from the description, so to help other confused searchers, I thought I'd mention it here (due to the matching name). JesseW (talk) 08:13, 13 March 2016 (UTC)

WikiCite applications closing soon

A reminder that open applications for WikiCite 2016 – an event that should be of interest to The Source MetaData WikiProject does not exist. Please correct the name. – close this Monday April 11. We have a limited number of travel grants to support qualified participants. If you wish to join us in Berlin to participate either in the data modeling or engineering effort, please submit an application --DarTar (talk) 16:09, 9 April 2016 (UTC)

Working backwards from metadata to DOI

Currently I am importing NIOSH's database of research and while I have DOIs for many of the articles, I do not have DOIs for all of them, even when I know DOIs exist. Is there a reasonably reliable, (semi-)automatic way to work backwards from journal article metadata (in the form of citations) to see if there are DOIs I can insert into Wikidata items? Once I have this, I can then use the Source Metadata tool to add the metadata fields. James Hare (NIOSH) (talk) 14:25, 27 June 2016 (UTC)

For articles known to PMC, there is the ID converter API. Otherwise, it's difficult, which is one of the reasons I am focusing on articles from PMC for now. --Daniel Mietchen (talk) 21:44, 28 June 2016 (UTC)

Applying controlled vocabularies to items about scholarly works

One value I see in uploading source metadata to Wikidata is the ability to describe works in terms of the controlled vocabulary of Wikidata. We indeed already have the main subject (P921) property. I was wondering what other sorts of properties might look like. "Main subject" should be very general, while I think a "keywords" property would be really too vague. I understand that a proposal for "mentions" was rejected in the past. For my work specifically, I am interested in associating works with the following:

  • Types of workplace exposures and hazards
  • Chemicals
  • Affected industries and occupations
  • Diseases and other biohazards
  • Equipment
  • Demographic factors such as gender and ethnicity
  • Types of medical disorders; affected parts of human anatomy
  • Types of emergencies

While having properties for all these things might be hyper-specific, is there a good strategy for coming up with properties that are suitable for associating works with concepts? James Hare (NIOSH) (talk) 15:20, 28 June 2016 (UTC)

I suggest you take an existing item and annotate it at your desired level of detail using just main subject (P921). On that basis, we can then discuss whether there are better ways of doing that, and whether or how they scale. --Daniel Mietchen (talk) 21:46, 28 June 2016 (UTC)
Reading the alternative and non-English labels of main subject (P921) I would not stress "main" too much. As long as you don't start to include all minor side-aspect, it's a good choice. By the way we could add qualifiers for more precision. -- JakobVoss (talk) 12:43, 30 June 2016 (UTC)
Authority properties are also (mis?)used to indicate the subject: see The Party Journalist (Q25555) for an example with DDC, UDC and RVK notations. Each of these notations refers to a subject that may also have an item of its own in Wikidata. For instance DDC 943 is "history of Germany", so using Dewey Decimal Classification (P1036) on a work with value 943 should imply main subject (P921) with value history of Germany (Q122131) (and vice versa). For scientific articles you often have Medical Subject Headings (Q199897)=MeSH descriptor ID (P486) or similar authority files (instances of subclasses of knowledge organization system (Q6423319)) -- JakobVoss (talk) 12:58, 30 June 2016 (UTC)

Add reference items via ProveIt

Hi there! ProveIt is a Wikipedia gadget that scans the wikitext of any article you're editing and shows a friendly interface to create, update and delete references. Some weeks ago I got a grant to enhance the gadget and perhaps the coolest feature I want to add involves connecting it with Wikidata so that new references inserted with the gadget are saved as items here, so that the next time a user wants to reference the same work, the gadget autocompletes the fields. This way the data only needs to be entered once, and future users will only reuse, correct and extend existing data.

I announced this project in the community chat (archive) and someone linked to this project. I'll be working on the gadget for the next few months and I will post updates here. I may also request some help or guidance, especially during October and November, when I integrate it with Wikidata. Until then, any early comments are more than welcome, just be sure to tag so that I get notified, cheers! --Felipe (talk) 16:48, 20 July 2016 (UTC)

This sounds as an interesting project. — Finn Årup Nielsen (fnielsen) (talk) 18:17, 20 July 2016 (UTC)

Proposal for Librarybase: an online reference library

Hello everyone. I have submitted a grant proposal for Librarybase, an online reference library. My goal is to create a unified lookup database for sources based on source metadata gathered from Wikipedia. I think there is in particular a good opportunity for Librarybase to make sure open access research is adequately represented on Wikidata. It could also help make it easier for editors to discover open access research. Please review and leave your feedback on the grant proposal. Harej (talk) 00:28, 3 August 2016 (UTC)

Sources used as references on Wikidata

Not much to see yet in a query for sources used as references on Wikidata. Hope this will change soon. --Daniel Mietchen (talk) 16:35, 18 August 2016 (UTC)

Adding references to uses of cites work (P2860)

Currently cites work (P2860) is used to describe around 360,000 citation relationships between publications. For example, of the 179,202 instances of scholarly article (Q13442814), 239 of them cite The Hallmarks of Cancer (Q221226), a hallmark paper published in 2000. I've been adding these relationships in bulk using public domain resources PubMed Central (Q229883) and OpenCitations Corpus (Q26382154). (I haven't been creating new items for this purpose; I skip over citations if they cannot be represented on Wikidata using the papers already on there.) I am wondering if I should add references to these claims, in the form stated in (P248) → PubMed Central (Q229883) or stated in (P248) → OpenCitations Corpus (Q26382154). I think it would be good to document the provenance of the data, especially if it does not come directly from the paper. What do people think of this? Harej (talk) 23:17, 21 August 2016 (UTC)

I think it would be ok. I suppose it does make the page somewhat larger. I hope that is not a problem. — Finn Årup Nielsen (fnielsen) (talk) 08:08, 22 August 2016 (UTC)
I'm all for better documenting the provenance of the data stored here, but I do not find stated in (P248) statements pointing generically to a corpus very useful in this regard, since they may well hamper verifiability (depending on how accessible the corpus is). For the case of PMC, that's less of a problem, since PMCID (P932) should be stated somewhere on the item having the PMC-supported citation claims. For Open Citations, we do not have such a property yet, so we'd either have to create one or always use an article-specific reference URL (P854) in addition to the generic stated in (P248): OpenCitations Corpus (Q26382154). --Daniel Mietchen (talk) 09:29, 24 August 2016 (UTC)
Daniel Mietchen, publications in the OCC have a bibliographic resource (BR) identifier. Example. Could we have that as a property? And in the meantime, should I go ahead with adding references for those statements that come from PubMed Central? Harej (talk) 00:06, 25 August 2016 (UTC)
Yes, I think proposing a property for that BR ID is the way to go. I'd be happy with you going ahead with adding references to the PMC-based statements. --Daniel Mietchen (talk) 00:26, 25 August 2016 (UTC)
Property proposed. Harej (talk) 01:07, 25 August 2016 (UTC)

I didn't see this proposal or didn't understand it until I saw reference posted to the cite property today on Twitter. I think at a basic level the name of this property -- as well as how it functions -- is completely confusing. This is my continued concern for the usage of terms that mean other things on other Wikimedia projects, i.e., English Wikipedia. So having this term and having it populate the way it does is very problematic. Dario recommended I come over here and discuss my concern versus on Twitter. Which is why I am writing this here. But I understand that this is probably implemented under WP:BOLD, etc. Just wanted to express concern here. -- Erika aka BrillLyle (talk) 18:20, 10 November 2016 (UTC)

I concur with Erika that we're doing a pretty poor job documenting these proposals. I posted the other day on the list a call for help to improve the documentation and tease apart discussions from data models that have somehow been blessed or approved by a substantial part of the community. I'll try and spend some time over the next weekend to spec out a proposed structure for these separate functions. On the specific choice of the English label for cites work (P2860): I personally don't feel it's ambiguous or misnamed, it represents "a citation between two works", a relation that most knowledge bases and ontologies call in a similar way. The problem of how to represent citations of a work from a Wikipedia article is an important but also a different one, and right now a statement using cites work (P2860) to link two items that represent works is doing a correct job. The use of stated in (P248) to represent the provenance of a claim that a citation between two works exists may be confusing. I agree with you that fundamental issues related to the modeling of provenance should be discussed in an appropriate place that's not buried in a talk page--DarTar (talk) 06:44, 15 November 2016 (UTC)

Umbrella WikiCite project on Meta

Heads up that I finished refactoring a few pages on Meta related to this WikiProject as follows:

For generic resources related to the WikiCite proposal (i.e. not individual events) please use the parent page from now on. The Source MetaData WikiProject does not exist. Please correct the name. Unfortunately, the ping project template is not available on Meta so I'm unable to create a centralized page for people interested in the proposal across projects. Also a quick reminder that we have distinct Twitter handles: @WikiCite for the parent initiative, @WikiCite16 for the 2016 event.--DarTar (talk) 19:09, 5 September 2016 (UTC)

General query on adding references to source metadata

@Daniel Mietchen: has a request for bot flag here for "Research Bot" which has been adding a lot of items for scientific articles. I raised the question of references for the claims on the article itself - for example see How Did Zika Virus Emerge in the Pacific Islands and Latin America? (Q27468554). Obviously if the article has a working external ID (DOI, PubMed ID, etc) then the claims can be verified just by following that link. Nevertheless it seems like many of the claims such as for title (P1476), publication date (P577), published in (P1433), page(s) (P304) etc really should have their provenance recorded within Wikidata. What if one of the external ID's becomes invalid at some point, or some of the metadata changes in some way - say there is a correction or a retraction of the article and there is actually different metadata provided by two different external ID's for the same article - how do we handle that? And in general, Help:Sources seems pretty clear when it says "The majority of statements on Wikidata should be verifiable insofar as they are supported by referenceable sources of information [...] references are used to point to specific sources that back up the data provided in a statement." ArthurPSmith (talk) 15:17, 20 October 2016 (UTC)

Input needed on the best way to connect ProveIt with Wikidata

Hi! ProveIt is a powerful gadget that adds a visual interface to better manage references when editing Wikipedia articles. So far it has been enabled on the Spanish and French Wikipedias, and the English, Russian, Italian and Portuguese are well underway. As part of a grant to enhance the gadget, I'm trying to connect it with Wikidata, so that references are autocompleted with data taken from here, and so that the gadget can be used to improve Wikidata items. But the integration is challenging, so I started a task at Phabricator to get some input on the best way to do it. Please, take a look and let me know your thoughts. Thanks! --Felipe (talk) 17:46, 23 October 2016 (UTC)

How to model joint publications?

Sometimes, different outlets publish the same content, e.g. to signal some form of consensus to their wider community. How should we model that? To stimulate discussion, I have created a test set of presumably identical articles published in several journals (Defining Authorship for Group Studies (Q27556360)/ Defining Authorship for Group Studies (Q27556362)/ Defining Authorship for Group Studies (Q27556470)) and linked them via said to be the same as (P460). --Daniel Mietchen (talk) 21:44, 25 October 2016 (UTC)

One way would be to model it similar to the case of multiple editions of a book. So create an item of the general article, no matter where published and link its publications via edition or translation of (P629)/has edition or translation (P747). -- JakobVoss (talk) 11:41, 27 October 2016 (UTC)

Modeling papers in conference proceedings

I started working on papers in conference proceedings and I'd like some input on bibliographic data modeling decisions. Here's an example, which I think is problematic the way it's currently modeled:

Questions

Now, a couple of questions:

  • is it appropriate for the volume to be considered an instance of proceedings (Q1143604) or should it be an instance of a (currently missing) proceedings volume?
It is fine to consider it an instance of proceedings (Q1143604) - however if it has been also published as a book (which is relatively common in academic conferences) then I would also assign it as instance of book (Q571). However it is possible that a proceedings of a conference is actually published in two distinct books. Thus, to handle everything properly, you should always refer to two distinct (FRBR...) levels: the one talking about proceedings (independently from the actual physical/electronic manifestation they have), and the other one about the embodiment they may have. Essepuntato (talk) 08:28, 28 October 2016 (UTC)
A proceedings cannot be part of any event, while it should be related to an event such as a conference. Of course, the proceedings can be part of a series (e.g. ISWC 2016 Proceedings part of the LNCS series). Essepuntato (talk) 08:28, 28 October 2016 (UTC)
Something general like "related to", or something like "has related event", which allows you to link the proceedings to the particular conference edition (e.g. ISWC 2016), and even the conference edition to the general conference series (e.g. ISWC). Obviously, to be more precise, we would need two distinct properties for that, such as "relates to conference edition" (from the paper to the conference edition) and "relates to conference series" (from the conference edition to the conference series). But still, I don't know if we would need such structured information at this stage. Essepuntato (talk) 08:28, 28 October 2016 (UTC)
It should be instance of a academic conference (Q2020153) - and eventually related to a specific convention series (Q15900647). Essepuntato (talk) 08:28, 28 October 2016 (UTC)

The Source MetaData WikiProject does not exist. Please correct the name. Some librarian help would be very welcome! --DarTar (talk) 04:19, 28 October 2016 (UTC)

See also:

Further notes and questions

Conference proceedings are normally of one particular dated conference, in an (at least intended) series of conferences, no? This must of course be distinct from proceedings of some institution, such as PNAS: these are better described as a particular type of journal. There are of course other cases. One of particular complexity is [3], where one society (AIP) runs multiple series of conferences, then publishes all of them as one unified serial. In this example, one volume of the serial publication contains the proceedings of one annual meeting of one serial conference. LeadSongDog (talk) 14:39, 28 October 2016 (UTC)
As I read proceedings (Q1143604) and its label in French and its description in English, French and German, the item is describing a conference proceeding, - not something like PNAS. That is also the intepretation I get when skimming the items. Perhaps the English label should be changed? — Finn Årup Nielsen (fnielsen) (talk) 16:13, 28 October 2016 (UTC)
@Fnielsen: agreed, I support changing the label to clarify the scope.--DarTar (talk) 08:06, 30 October 2016 (UTC)

Data model proposal

@Fnielsen, Bluerasberry, Essepuntato, TomT0m, LeadSongDog: thanks a lot for your input. I created a new class for conference proceedings series (Q27785883), which now allows us to represent all the relations relevant to conference proceedings as follows:

A. Paper in conference proceedings volume

From Freebase to Wikidata: The Great Migration (Q24074986)

instance of (P31)scholarly article (Q13442814)
published in (P1433)Proceedings of the 25th International Conference on World Wide Web (Q27187137)
DOI (P356) → 10.1145/2872427.2874809
ACM Digital Library citation ID (P3332) → 2874809

Notes

B. Conference proceedings volume in conference proceedings series

Proceedings of the 25th International Conference on World Wide Web (Q27187137)

instance of (P31)proceedings (Q1143604)
part of the series (P179)Proceedings of the International Conference on World Wide Web (Q27680385) (qualifier: series ordinal (P1545) → 25)
ACM Digital Library citation ID (P3332) → 2872427

Notes

  • the property representing the association between a conference proceedings volume and a single conference (e.g. 25th International Conference on World Wide Web (Q27786154)) is currently missing.
  • the item for this volume has an ISBN, which is a possible FBRB violation, per Essepuntato, but the priority should be about creating items for the volume as the venue for a paper (work), not the associated book (edition).

C. Conference proceedings series

Proceedings of the International Conference on World Wide Web (Q27680385)

instance of (P31)conference proceedings series (Q27785883)
has part(s) (P527)Proceedings of the 25th International Conference on World Wide Web (Q27187137)

Notes

  • the property representing the association between a conference proceedings series and a conference series (e.g. The Web Conference (Q3570023)) is currently missing.

D. Conference (event)

25th International Conference on World Wide Web (Q27786154)

instance of (P31)academic conference (Q2020153)
part of the series (P179)The Web Conference (Q3570023) (qualifier: series ordinal (P1545) → 25)

Notes

  • the property representing the association between a single conference and a the corresponding conference proceedings volume is currently missing.

E. Conference series (event series)

The Web Conference (Q3570023)

instance of (P31)convention series (Q15900647)
has part(s) (P527)25th International Conference on World Wide Web (Q27786154)
ACM Digital Library event ID (P3333) → RE334

Notes

  • the property representing the association between a conference series and a conference proceedings series is currently missing.

Comments?

Please let me know if this approach looks sensible, if you wish to make any changes, or if you have suggestions regarding the missing properties linking a proceedings (Q1143604) to a academic conference (Q2020153) or a conference proceedings series (Q27785883) to a convention series (Q15900647).--DarTar (talk) 06:45, 8 November 2016 (UTC)

The proposal should also cover conference proceedings of events that are not part of a series. -- JakobVoss (talk) 12:59, 8 November 2016 (UTC)
Do you have any example handy? Most scholarly conferences I am familiar with are part of a series but I imagine the above proposal would apply to single conferences as is.--DarTar (talk) 02:19, 9 November 2016 (UTC)
An example of a one-off conference that has been cited in Wikipedia, and is apt to be cited in Wikidata: http://www.casinapioiv.va/content/accademia/en/publications/extraseries/calendar.html Jc3s5h (talk) 14:16, 9 November 2016 (UTC)
I would represent it as either a book (Q571) or proceedings (Q1143604), I don't see this as a problematic case. @Jc3s5h, JakobVoss: can you expand on what you see as a challenge with this type of publication?--DarTar (talk) 03:16, 10 November 2016 (UTC)
Presumably one would like to be able to use properties to link between a one-off conference and the associated proceedings. I don't know how you would do that. (talk) 12:32, 10 November 2016 (UTC)
I just noticed the note "the property representing the association between a single conference and a the corresponding conference proceedings volume is currently missing." I think there should be two properties, one to link the conference to the proceedings volume, and one to link the proceedings volume to the conference. They should be designed so that they work regardless of whether the conference is part of a series or not, and whether the proceedings are part of a series or not, since the relationship between the conference organizers and the proceedings publishers seems to vary quite a bit, especially in the case of one-off conferences. Jc3s5h (talk) 12:38, 10 November 2016 (UTC)
Agreed. I'm pretty agnostic on how to represent proceedings (Q1143604)academic conference (Q2020153) and convention series (Q15900647)conference proceedings series (Q27785883) relations. I feel we should not reinvent the wheel and ideally reuse existing Wikidata properties that capture the relation between a work and an event. Any suggestion from people familiar with related Wikidata properties would be fantastic.--DarTar (talk) 18:15, 10 November 2016 (UTC)
The Source MetaData WikiProject does not exist. Please correct the name. ^
How does the proposal relate to presentations given at an event? We have several instances of lecture (Q603773), what if one has a corresponding proceedings paper? -- JakobVoss (talk) 12:59, 8 November 2016 (UTC)
I know of lecture series published as a monograph as well as keynotes/plenary talks at conferences published as abstracts in conference proceedings. Is that what you're getting at? Any example?--DarTar (talk) 02:19, 9 November 2016 (UTC)
I added a proceedings and proceedings series section in Template:Bibliographical_properties--DarTar (talk) 05:34, 10 November 2016 (UTC)
  • Not sure I do not find it easy to differentiate all these sorts of events and would prefer that they be contracted into a single item that would be an obvious choice for anyone to use. This seems complicated. Blue Rasberry (talk) 10:40, 11 November 2016 (UTC)
  • Comment The BIBO approach (seen here) gives the paper a "presentedAt" property to identify the linked event, while giving the event a "presents" property to identify the linked presentations. An event gets "subEvent" and/or "superEvent" properties, which might be used to place a conference in the context of a conference series.LeadSongDog (talk) 18:40, 15 November 2016 (UTC)
Interesting, looks pretty appropriate. See below for a very similar proposal (but slightly more general).--DarTar (talk) 06:15, 16 November 2016 (UTC)
If the proposal is to adopt the BIBFRAME name for things and way of organizing them then that is much easier to support. Wikidata does well to copy the established terms. Blue Rasberry (talk) 17:57, 16 November 2016 (UTC)

Notes from Wikicite