Topic on User talk:Smalyshev (WMF)

Jump to navigation Jump to search
Rotpunkt (talkcontribs)

Hi, could you check this discussion Property talk:P856 (the discussion is "Please normalize?"). P856 is a very important property and Jura1 has made two edit: and , that are wrong for me. Unless there are some implementation details that have to be explained, we can't impose this rule (P856 URL ending with a slash), even if it's a limited subset of items (Wikimedia sites items) just for making a query work.

Smalyshev (WMF) (talkcontribs)

Thank you for bringing it to my attention. Jura's proposal sounds fine for me - canonical form for Wiki* site URLs makes sense, especially in the context of the database (i.e. aimed at automated processing, including by tools having not enough brains to know that http://en.wikipedia.org and http://en.wikipedia.org/ is the same URL). Why do you think we can't impose this rule? Of course, it is somewhat arbitrary - as many formats and conventions are, including most human and programming languages - the point is not that one way is better than the other, the point is having one way of saying it instead of several, because it makes understanding and automatic processing much easier, and Wikidata is, in part, aimed at automatic processing of data.

Rotpunkt (talkcontribs)

Hi, if there aren't other solutions it's ok. However I think that this rule is an error because we have a property, P856, that can hold any value (with or without a ending slash)... BUT in a limited subset of items, a ending slash is mandatory for making a query work => from a programming point of view it seems like a workaround.

If the problem is only obtaining the wikimedia project item of a sitelink, why can't we get it directly?

For semplicity in the following example I use Q17518688 that has only one sitelink (language sv, Swedish Wikipedia => item Q169514).

We know that from Q17518688 using "schema:inLanguage", we can get the language directly from the sitelink:

  • SELECT ?language WHERE { ?sitelink schema:about wd:Q17518688 . ?sitelink schema:inLanguage ?language } => returns "sv".

So, why can't we get the wikimedia project item Q169514 from the sitelink, in the same way we get the language?

With a hypothetical predicate "wikibase:sitelinkitem", we should be able to get the item related to sv.wikipedia.org in the same way:

  • SELECT ?item WHERE { ?sitelink schema:about wd:Q17518688 . ?sitelink schema:sitelinkitem ?item } => In this case the query should return Q169514.

I am not an expert of sparql and rdf but it sounds reasonable to me.

Smalyshev (WMF) (talkcontribs)

> So, why can't we get the wikimedia project item Q169514 from the sitelink,

We kind of can. See:

SELECT ?wikiitem WHERE {

?sitelink schema:about wd:Q17518688 .

?sitelink schema:isPartOf ?wikilink .

?wikiitem wdt:P856 ?wikilink

}

However, for this to work, P856 should have the same values as schema:isPartOf does. Which means values of P856 for wiki sites should be in certain specific format. Adding schema:sitelinkitem (we can't really use schema: since it's not our namespace but that's beside the point) would be much harder to do as it'd require finding out which Wikidata item corresponds to every URL while generating the RDF data, and that's not trivial matter. schema:isPartOf can be generated from data available directly in the item data.

Rotpunkt (talkcontribs)

Hi Smalyshev, I understand that you use p:P856 as a link between the sitelink and the Wipedia project item. For example we can use also p:P424 instead of P856:

SELECT ?wikiitem
WHERE
{
  	?sitelink schema:about wd:Q17518688 .
	?sitelink schema:inLanguage ?language .
	?wikiitem wdt:P31 wd:Q10876391; wdt:P424 ?language
}

But, my question is why we need a property for creating a relationship between a sitelink (that obviously is related to a Wikimedia project item) and his Wikimedia project item? Isn't there a record that associates the Swedish sitelink for Q17518688 with the Swedisk Wikipedia Q169514 without the need of any properties at all? You said " that's not trivial matter." Why is it hard? Thanks for your time and your answers.

Smalyshev (WMF) (talkcontribs)

It is not trivial because going from URL to wikidata item requires either pre-generating the list (extra work) or queriying on each URL (very slow, unsuitable for 18M items as dump should finish in reasonable time). I'm not saying it's impossible, I'm just saying it requires a non-trivial amount of work, and given that this is already possible with rather simple query, such work would not be a high priority.

Smalyshev (WMF) (talkcontribs)

I think schema:isPartOf now implements this.