Wikidata:Property proposal/New York Times Semantic Concept

From Wikidata
Jump to navigation Jump to search

New York Times Semantic Concept proposals[edit]

New York Times Semantic Concept: Person[edit]

   Done: no label (P2690) (Talk and documentation)
DescriptionThe New York Times associates Semantic Concepts with articles, these concepts are exposed both through metadata embedded in the HTML of articles as well as through its Semantic API. This property identifies people, others identify places, organizations and descriptors.
Data typeString
Template parameternone
Domainpeople
ExampleBarack Obama (Q76) → "Obama, Barack"
Sourcehttp://developer.nytimes.com/docs/semantic_api
Formatter URLhttp://api.nytimes.com/svc/semantic/v2/concept/name/nytd_per/$1?api-key=your-API-key
Robot and gadget jobsThe Semantic API contains incomplete links to Wikipedia, Freebase and DBPedia, these could be imported on an ongoing basis. Additional unmaintained data is available from data.nytimes.com which could be done as a one time load.
Motivation

Being able to associate New York Times articles with semantic concepts opens up an avenue for many third party applications to interact with their data. The Times has already associated a subset of their controlled vocabulary with Wikipedia items, so part of the work is already done.

This submission is further to discussion with Thryduulf and Runner1928 from a previous submission Thinkcontext (talk) 18:49, 24 February 2016 (UTC)[reply]

Discussion

New York Times Semantic Concept: Location[edit]

   Done: no label (P2692) (Talk and documentation)
DescriptionThe New York Times associates Semantic Concepts with articles, these concepts are exposed both through metadata embedded in the HTML of articles as well as through its Semantic API. This property identifies locations, others identify people, organizations and descriptors.
Data typeString
Template parameternone
Domainlocations
ExampleAcapulco (Q81398) → "Acapulco (Mexico)"
Sourcehttp://developer.nytimes.com/docs/semantic_api
Formatter URLhttp://api.nytimes.com/svc/semantic/v2/concept/name/nytd_geo/$1?api-key=your-API-key
Robot and gadget jobsThe Semantic API contains incomplete links to Wikipedia, Freebase and DBPedia, these could be imported on an ongoing basis. Additional unmaintained data is available from data.nytimes.com which could be done as a one time load.
Motivation

Being able to associate New York Times articles with semantic concepts opens up an avenue for many third party applications to interact with their data. The Times has already associated a subset of their controlled vocabulary with Wikipedia items, so part of the work is already done.

This submission is further to discussion with Thryduulf and Runner1928 from a previous submission Thinkcontext (talk) 18:49, 24 February 2016 (UTC)[reply]

Discussion

New York Times Semantic Concept: Organization[edit]

   Done: no label (P2691) (Talk and documentation)
DescriptionThe New York Times associates Semantic Concepts with articles, these concepts are exposed both through metadata embedded in the HTML of articles as well as through its Semantic API. This property identifies organizations, others identify places, people and descriptors.
Data typeString
Template parameternone
DomainOrganization
Example3M (Q159433) → "3M Company"
Sourcehttp://developer.nytimes.com/docs/semantic_api
Formatter URLhttp://api.nytimes.com/svc/semantic/v2/concept/name/nytd_org/$1?api-key=your-API-key
Robot and gadget jobsThe Semantic API contains incomplete links to Wikipedia, Freebase and DBPedia, these could be imported on an ongoing basis. Additional unmaintained data is available from data.nytimes.com which could be done as a one time load.
Motivation

Being able to associate New York Times articles with semantic concepts opens up an avenue for many third party applications to interact with their data. The Times has already associated a subset of their controlled vocabulary with Wikipedia items, so part of the work is already done.

This submission is further to discussion with Thryduulf and Runner1928 from a previous submission Thinkcontext (talk) 18:49, 24 February 2016 (UTC)[reply]

Discussion

New York Times Semantic Concept: Descriptor[edit]

   Done: no label (P2693) (Talk and documentation)
DescriptionThe New York Times associates Semantic Concepts with articles, these concepts are exposed both through metadata embedded in the HTML of articles as well as through its Semantic API. This property identifies descriptors, others identify places, organizations and people.
Data typeString
Template parameternone
Domaindescriptor
Exampleabsenteeism (Q332278) → "Absenteeism"
Sourcehttp://developer.nytimes.com/docs/semantic_api
Formatter URLhttp://api.nytimes.com/svc/semantic/v2/concept/name/nytd_des/$1?api-key=your-API-key
Robot and gadget jobsThe Semantic API contains incomplete links to Wikipedia, Freebase and DBPedia, these could be imported on an ongoing basis. Additional unmaintained data is available from data.nytimes.com which could be done as a one time load.
Motivation

Being able to associate New York Times articles with semantic concepts opens up an avenue for many third party applications to interact with their data. The Times has already associated a subset of their controlled vocabulary with Wikipedia items, so part of the work is already done.

This submission is further to discussion with Thryduulf and Runner1928 from a previous submission Thinkcontext (talk) 18:49, 24 February 2016 (UTC)[reply]

Discussion
  •  Support all three New York Times semantic concept proposals. Thryduulf (talk: local | en.wp | en.wikt) 19:22, 24 February 2016 (UTC)[reply]
  •  Strong support for all four NYT proposals. http://data.nytimes.com/ has all data available in one file per semantic concept type. Those files have multiple ways to auto-match records, including geonames, dbpedia, freebase, and more. I'd love to hear more from @Thinkcontext: about ways to automatically populate this information. Runner1928 (talk) 16:16, 25 February 2016 (UTC)[reply]
    • @Runner1928: topics.nytimes.com has what I believe is a complete listing of the topics which correspond to the concepts. It should be a simple matter to scrape the topic text and do a lookup against the API to get Wikipedia, Freebase or DBPedia reference. data.nytimes.com contains RDF dumps which have all the information so it would be best to do a mass import of that first and then setup a bot to monitor the topic listing.
  •  Comment These "formatter URL" values will NOT work with wikidata - it requires a developer API key - the "your-API-key" piece. When you follow the links as given it returns an error page ("Inactive developer"). Unless we can find a URL link for these concepts that does not have that dependency on personal API keys, I am uncomfortable with adding these identifiers to wikidata. ArthurPSmith (talk) 20:27, 29 February 2016 (UTC)[reply]
  • @ArthurPSmith: I see, thanks for the comment. As you can tell I've been having difficulty figuring out how and where this information would best fit into wikidata. I think there's considerable value to having it in wikidata, do you have any suggestion as to where it would best fit? As a generic property? Thinkcontext (talk) 14:56, 1 March 2016 (UTC)[reply]
    • @Thinkcontext: I agree it would be nice to include this data somehow in wikidata. But I don't think you have the right data format here. Looking at their "about" page the ID's should be numbers, not the words you list, and the formatter should be simply http://data.nytimes.com/$1 - as that about page says, "For instance our subject heading for "Park Slope, Brooklyn" is associated with the URI http://data.nytimes.com/60694995023816375851. " So I think the ONLY thing wikidata should have is those numbers (represented as external identifiers) linked to the associated wikidata items via something like a "NY Times Semantic Concept ID". We definitely don't want to have some dependence on a developer API - they provide open linked data, we should use that. ArthurPSmith (talk) 15:04, 1 March 2016 (UTC)[reply]
    • Hmm, I just noticed this issue was brought up previously (earlier proposal still on this page). However, I think your argument is wrong. If NY Times is not providing this data without restricting it (via developer API) then it should not be in wikidata. We should take what they give us as open, even if it hasn't been kept up to date. There are English text labels associated with each concept at data.nytimes.com (the skos:prefLabel value) and there is also a search API link provided, so I don't think your concerns about being unable to look things up from the number are valid. And I think the number is the only reasonable thing we can use here. ArthurPSmith (talk) 15:19, 1 March 2016 (UTC)[reply]
      • @ArthurPSmith: The text labels are not only exposed by the API. In fact the reason I feel the text labels have great value for linked data applications is that they are exposed within the HTML of articles as metadata. Example from todays paper, What to Watch For on Super Tuesday has <meta name="per" content="Clinton, Hillary Rodham" /> (as well as a variety of other tags for other people and subjects) which NYT lets us know corresponds to http://dbpedia.org/resource/Hillary_Rodham_Clinton. Its extremely rare for a major news organization to connect their own internal processes to the linked data cloud in this way, we would potentially be giving up a great deal of value in not recording this information. As for the numerical ids, I personally don't see the value of them and feel having obsolete ids is confusing clutter. Thinkcontext (talk) 16:01, 1 March 2016 (UTC)[reply]
        • @Thinkcontext: Can you give a reference for why you think the numerical ids are obsolete? There is no sign of it on the NY Times web pages, including the one you linked to - the Semantic API page explicitly says: "As part of its work on linked data New York Times R&D created http://data.nytimes.com, a site where we publish the vocabulary used to index concepts. The site assigns a unique URI to each concept, and in the form of http://data.nytimes.com/concept_uri. This "concept_uri" is a mandatory parameter for accessing the Semantic API with a linked data reference." and that "concept_ui" is the numerical id we are talking about here. The REASON it is important to have a numerical id instead of a name is - suppose you have an article tagged with "Clinton, Hillary Rodham" - and then she gets divorced and changes her name back to "Rodham, Hillary". Suddenly all your old semantic concept id's are invalid. But the numerical id refers to the same person and is still good. It doesn't look to me like the NY Times is providing a strong guarantee regarding the permanence of the concept labels, and the concept_uri values are being promoted. Unless you can provide some more evidence on why you think the data.nytimes.com stuff is really abandoned it seems to me the numerical ids are the ones we need here. Or perhaps you can give a more complete use case describing an instance where the labels would be helpful to have in wikidata and a numerical id would not? I'm not seeing the problem right now. ArthurPSmith (talk) 16:42, 1 March 2016 (UTC)[reply]
          • @ArthurPSmith: I feel that the numerical id is obsolete because data.nytimes.com has not been updated for around 5 years.This snapshot from archive.org in 2011 has the exact same number of entities listed as today, so no new items have entered that data set. I have examined the SKOS data dumps and verified that they contain the same number. The individual record for Hillary Clinton on that site lists a last used time in 2010. I'd definitely prefer to have a permanent id but if they have abandoned it I don't think it would be good for wikidata to use it, especially because they are over 20,000 items short now. Your point about changing names is a good one, I think these would be better as generic properties rather than authority control ids, I guess I'll resubmit. Thinkcontext (talk) 19:37, 1 March 2016 (UTC)[reply]
            • well the snapshot you point to says exactly the same thing as the current page about the total too, so apparently the 20,000 missing hasn't changed since 2011: "The New York Times uses approximately 30,000 tags to power our Times Topics Pages. It is our intention to publish all of these tags as linked open data." Is there any third-party discussion on what the NY Times is doing on this, or a more recent quote from them? Also, how would you source the full dataset if it's not available from data.nytimes.com? ArthurPSmith (talk) 21:10, 1 March 2016 (UTC)[reply]
              • @ArthurPSmith: topics.nytimes.com contains most of the additional labels, scraping them is a simple matter. I'm not aware of any more recent statement from them about their open data plans. Thinkcontext (talk) 22:25, 2 March 2016 (UTC)[reply]
                • @Thinkcontext: Well, I'm not sure this is going anywhere between just the two of us; this will be my last comment here on this, I just don't feel I can support this proposal at this time though in principle I would like to see NY Times semantic concepts linked. Looking at topics.nytimes.com the list seems at least as out of date as whatever is on data.nytimes.com. Not a single one of the 2016 presidential contenders is listed on the "People" topics page, for example, not even Trump or Clinton. Maybe the best way forward would be to email NY Times and try to convince them to use wikidata ID's as semantic concepts, if they are having trouble maintaining their own database. ArthurPSmith (talk) 14:20, 3 March 2016 (UTC)[reply]
  • I'm not sure why this has been posted as a new proposal; but please see my comments in the previous discussion, which this proposal does not address. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:57, 3 March 2016 (UTC)[reply]

@Thinkcontext, Thryduulf, Runner1928: ✓ Done. This was a hard one. Frankly speaking, the linked data initiative of the New York Times is a mess. On the one hand you have an open, non-restricted API, using surrogate keys, with data under a CC license. This has seemingly been abandoned five years ago. On the other hand, you have a restricted, registration-required, unfree API, using natural keys, for which we have no stability guarantee, but which seems to be kept fairly up-to-date. (I registered and got an API key. I checked two items. Zaha Hadid (Q47780), who died today and has an obituary on www.nytimes.com had not been updated. On the other hand, the entry for Bernie Sanders was last updated in February.) Both APIs have no references to each other. Nevertheless I think there is inherent value in linking our concepts to that of the New York Times as a major news outlet. The outdated data is no options, so we have to use the current API, despite its problems. Let's hope their data becomes more open in the future. --Srittau (talk) 20:08, 31 March 2016 (UTC)[reply]