Wikidata:SPARQL federation input

From Wikidata
Jump to navigation Jump to search

One of the cool features of SPARQL is federation. It allows you to query several SPARQL endpoints together to get a combined query result. In order to enable better integration of data available in Wikidata with other linked data sources, we plan to enable SPARQL Federation on Wikidata Query Service to a selected number of other SPARQL endpoints. For security and performance reasons, we can not just allow any endpoint without filtering. We need to have a whitelist of approved endpoints. This page is for nominating and discussing which endpoints should be supported. Currently supported endpoints are listed in the User Manual.

The suggested SPARQL endpoints must satisfy the following conditions:

  • Complies with the SPARQL 1.1 protocol, "query operation" part, at least to the extent necessary to make federated SERVICE clause work (most SPARQL endpoints do). To test: this query should work:
SELECT  ?a ?b ?c WHERE {
    SELECT ?a ?b ?c WHERE { ?a ?b ?c } LIMIT 1
} VALUES ?x { "a" }
  • Contains data that can be linked to Wikidata - i.e., either contains Wikidata IDs or can be queried by values contained in one of the Wikidata properties.
  • Has data freely available under license compatible with CC0 (preferred) or other free database license allowing unrestricted reuse. Attribution licenses like CC-BY are ok too. Currently, we do not accept endpoints with reuse restriction clauses like NC/ND.

Please post the URL of the endpoint, a short description of it, and, if available, the URL of its documentation. Check first that the endpoint is not already in use or rejected. Thank you for helping to improve Wikidata.

Nominate new endpoint

Suggestions[edit]

Licence suitable[edit]

Endpoints that are immediately suitable for inclusion.

Attribution licences (like CC-BY)[edit]

Looks like attribution license are OK too, we will acknowledge them on licensing page for the service. If for some reason such acknowledgement is not enough, please do not add the endpoint here.

See also:


Licence unclear[edit]

Unclear license status, please help us to figure it out.

LOD Cloud Cache[edit]

Endpoint
http://lod.openlinksw.com/sparql
Documentation
Licence
Background
https://lists.w3.org/Archives/Public/public-lod/2013May/0154.html
https://sourceforge.net/p/virtuoso/mailman/message/32005015/

FactForge[edit]

SPARQL Endpoint
http://factforge.net/sparql
Federation endpoint
http://factforge.net/repositories/ff-news
Documentation
http://factforge.net/about
Background
FactForge represents a large scale public demonstrator of many of GraphDB‘s advanced features: reasoning, geo-spatial indexing, RDFRank, full-text search connectors and owl:sameAs optimization. It loads several LOD datasets in a single GraphDB repository. On top of that, cleanup and other corrections are applied to some of these datasets and ontologies.

3cixty[edit]

Endpoint
https://kb.3cixty.com/sparql
Documentation
http://www.eurecom.fr/~troncy/Publications/Rizzo_Troncy-iswc15swc.pdf
Licence
Background
https://www.3cixty.com

3cixty provides comprehensive knowledge bases covering entire territories and cities. It contains millions of triples describing all point of interests, local businesses and events happening in the city. The Knowledge Base is updated every night. The SPARQL endpoint has 99% availability since 2 years.

--Rtroncy (talk) 20:02, 23 February 2017 (UTC)

@Rtroncy: any idea about the licensing terms? --Smalyshev (WMF) (talk) 19:58, 11 April 2017 (UTC)
@Smalyshev: Sorry for the late reply, strangely, I didn't get any notifications! I control the endpoint. What license would be suitable for you? Rtroncy (talk) 07:35, 6 June 2017 (UTC)
@Rtroncy: CC0 ideally, but we agreed that CC-BY would be fine too, if you're ok with acklowledgement like here: https://query.wikidata.org/copyright.html --Smalyshev (WMF) (talk) 22:03, 6 June 2017 (UTC)

Not suitable[edit]

Endpoint suggestions rejected for license or other reasons. May be reconsidered if license or circumstances change.

data.admin.ch[edit]

SPARQL Endpoint
http://data.admin.ch/query/
Documentation
http://data.admin.ch/ and https://github.com/zazuko/fso-lod/blob/master/doc/eCH0071/sparql.md
Licence
https://opendata.swiss/en/dataset/historicized-municipalities-register
Background
Contains data from Swiss government agencies as Linked Data, in particular from the Swiss Federal Statistical Office (FSO). Where possible URIs on this endpoint link to Wikidata URIs.
@TheKtk: Unfortunately, the License link above returns 404. Could you update it? (done)

@TheKtk: The license requires prior permission for commercial use. Since Wikidata has no way of separating commercial and non-commercial users, this creates uncertainty for us which I'd rather avoid. Smalyshev (WMF) (talk) 21:15, 28 July 2018 (UTC)

geo.admin.ch Linked Data Service[edit]

SPARQL Endpoint
https://ld.geo.admin.ch/query
Documentation
https://ld.geo.admin.ch/
Licence
https://opendata.swiss/en/dataset/swissboundaries3d-gemeindegrenzen
Background
Linked Data representation of swissBOUNDARIES 3D dataset by geo.admin.ch, the Swiss federal geoportal. URIs link to Wikidata URIs where appropriate, useful for visualizing data related to Swiss entities. One can get up to date shapes of these entities as Well Known Text (WKT).
@TheKtk: Unfortunately, the License link above returns 404. Could you update it? (done)

Orphanet[edit]

SPARQL Endpoint
https://www.orpha.net/sparql
Documentation
http://www.orphadata.org/cgi-bin/index.php#ontologies
Item about database/website/endpoint
Orphanet (Q1515833)
Licence
https://creativecommons.org/licenses/by-nd/4.0/legalcode
Background
The Orphanet Rare Disease Ontology (ORDO) was jointly developed by Orphanet and the European Bioinformatics Institute (EMBL-EBI) to provide a structured vocabulary for rare diseases, capturing relationships between diseases, genes and other relevant features, forming a useful resource for the computational analysis of rare diseases. ORDO is derived from the Orphanet database (www.orpha.net), a multilingual database dedicated to rare diseases populated from literature and validated by international experts. It integrates a nosology (classification of rare diseases), relationships (gene-disease relations, epidemiological data) and connections with other terminologies (MeSH, UMLS, MedDRA),databases (OMIM, UniProtKB, HGNC, Ensembl, Reactome, IUPHAR, Gene atlas) or classifications (ICD10). The ontology is maintained by Orphanet and further populated with new data. The Orphanet Rare Disease Ontology is updated twice a year and follows the OBO guidelines on deprecation of terms. It constitutes the official ontology of rare diseases produced and maintained by Orphanet (INSERM, US14).
Wikidata contains links to many other similar resources related to diseases and genes, but also knowledge from non-biological resources. Being able to federate from the Wikidata query Service, would enable a better integration of knowledge on rare diseases captured in Orphanet with knowledge in Wikidata and other sparql endpoints enabled in the WDQS, making a wikidata a true hub in link data of the life sciences.
We would, for example, be able to enrich orphanet with geospatial data or vital statistics on regions, e.g. population size, etc. --Andrawaag (talk) 09:52, 14 November 2018 (UTC)

@Andrawaag: I'm concerned about ND license - is query output (which can transform data) considered "Adapted Material"? Smalyshev (WMF) (talk) 20:00, 14 November 2018 (UTC)

@Smalyshev (WMF): As Orphanet manager, we won't consider query output as "Adapted Material". The main idea of having ND is to avoid modified distribution of the Orphanet Rare Diseases Ontology in medical context (such as integration in hospital software etc.) which could be done only trought an "official" distribution.

Marc.hanauer (talk) 11:04, 15 November 2018 (UTC)

DisGeNET[edit]

SPARQL Endpoint
http://rdf.disgenet.org
Documentation
http://www.disgenet.org/web/DisGeNET/menu/rdf
Item about database/website/endpoint
DisGeNET (Q10465472)
Licence
https://creativecommons.org/licenses/by-nc-sa/4.0/
Background
DisGeNET is a discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases (Piñero et al., 2016; Piñero et al., 2015). DisGeNET integrates data from expert curated repositories, GWAS catalogues, animal models and the scientific literature. DisGeNET data are homogeneously annotated with controlled vocabularies and community-driven ontologies. Additionally, several original metrics are provided to assist the prioritization of genotype–phenotype relationships.
Allowing federated queries from Wikidata allows the mean to check and verify for consistency within knowledge in Wikidata, or use Wikidata items as starting points in research question that require more fine grained information not yet captured in Wikidata. The DisGeNET database integrates human gene-disease associations (GDAs) from various expert-curated databases and text-mining derived associations including Mendelian, complex and environmental diseases. The integration is performed by means of gene and disease vocabulary mapping and by using the DisGeNET association type ontology.
Being able to query this from Wikidata and combine that with knowledge from Wikidata provide the means to do some very interesting queries that extends what is now possible by only query DisGeNET

--Andrawaag (talk) 14:04, 14 November 2018 (UTC)

External lists[edit]

Other discussion[edit]

Please discuss on the talk page.

Incoming nominations[edit]

The nominations are initially placed here and then sorted and moved into the specific topics above.

European Bioinformatics Institute[edit]

SPARQL Endpoint

https://www.ebi.ac.uk/rdf/services/sparql

Documentation

https://www.ebi.ac.uk/rdf/documentation/

Item about database/website/endpoint

European Bioinformatics Institute (Q1341845)

Licence

https://www.ebi.ac.uk/about/terms-of-use

Background

EMBL-EBI's RDF platform provides access to EMBL-EBI databases as linked open data. In includes BioModels, Biosamples, ChEMBL, Gene Expression atlas, Bio-Ontologies, Reactome and Ensembl.

NDBC RDF Portal - DDBJ[edit]

SPARQL Endpoint
https://integbio.jp/rdf/ddbj/sparql
Documentation
https://integbio.jp/rdf/?view=detail&id=ddbj
Item about database/website/endpoint
National Bioscience Database Center (Q45131356) maintains the SPARQL endpoint, but National Institute of Genetics (Q576466) provides the data.
Licence
http://www.insdc.org/policy.html
Background
The SPARQL endpoint contains the semantic representation of DDBJ annotated sequence records. This is about Genes, Sequence data and other biological and life science data and specifically also contains sample data. Since other life science data is already represented in Wikidata, being able to run federated queries allows the life science community to combine their data with knowledge in Wikidata and de NDBC RDF Portal.

--Andrawaag (talk) 18:09, 14 November 2018 (UTC)

NBDC RDF Portal - DBKERO[edit]

SPARQL Endpoint
http://integbio.jp/rdf/kero/sparql
Documentation
http://integbio.jp/rdf/?view=detail&id=kero
Item about database/website/endpoint
National Bioscience Database Center (Q45131356) maintains the SPARQL endpoint.
Licence
http://integbio.jp/rdf/?view=detail&id=kero
Background
The SPARQL endpoint contains the semantic representation of DBKERO, which is a collection of multi-omics data sets including SNV, RNA-seq, ChIP-seq, BS-seq and TSS-seq. Since other life science data is already represented in Wikidata, being able to run federated queries allows the life science community to combine their data with knowledge in Wikidata and the NDBC RDF Portal.

--Skwsm (talk) 18:27, 14 November 2018 (UTC)

NBDC RDF portal[edit]

SPARQL Endpoint
https://integbio.jp/rdf/sparql
Documentation
http://integbio.jp/rdf/?view=manual
Item about database/website/endpoint
National Bioscience Database Center (Q45131356) and Database Center for Life Science (Q11346475) maintains the SPARQL endpoint.
Licence
http://integbio.jp/rdf/?view=manual
Background
National Bioscience Database Center (NBDC) Japan has developed NBDC RDF portal which is a repository service of the RDFized life science databases developed by Japanese research groups. As of Nov. 2018, the SPARQL endpoint contains 19 RDF datasets. Since other life science data is already represented in Wikidata, being able to run federated queries allows the life science community to combine their data with knowledge in Wikidata and the NDBC RDF Portal.

--Skwsm (talk) 09:53, 15 November 2018 (UTC)