User talk:ProteinBoxBot

From Wikidata
Jump to: navigation, search


Hello from English Wikipedia's WikiProject Medicine[edit]

I saw your post on English Wikipedia about a template for presenting health information in Wikidata on English Wikipedia, and then other languages. It is an exciting idea!

On English Wikipedia bold projects like this are eventually discussed a bit at en:Wikipedia talk:WikiProject Medicine. Talking about this at the template where you are talking now is an appropriate place to begin. I see that here on Wikidata you are requesting bot permission, which is also good.

I am writing to say hello and to say I support your project. Also I wanted to invite you to English Wikiproject Medicine to chat anytime. Blue Rasberry (talk) 17:00, 27 July 2015 (UTC)

@Bluerasberry: We already have the bot flag, but we are currently seeking tasks approval for adding genes, diseases and soon registered drugs. We could use some support on the respective request pages. Wikidata:Requests_for_permissions/Bot/ProteinBoxBot_2 Wikidata:Requests_for_permissions/Bot/ProteinBoxBot_3 Andrawaag (talk) 17:44, 27 July 2015 (UTC)

Bad formatting of some numbers[edit]

Your bot create items and add some identifiers with bad formatting: see Wikidata:Database_reports/Constraint_violations/P661#.22Format.22_violations. The number is without decimal. Can you please correct this ? Snipre (talk) 18:08, 12 August 2015 (UTC)

@Snipre:This happened accidentally and I fixed it immediately after the problem occured. Sebotic (talk) 20:17, 12 August 2015 (UTC)
No problem, I already saw that problem in the past so I understand the case. But if you can correct it by using your bot this will reduce my load work. Thanks Snipre (talk) 08:41, 13 August 2015 (UTC)

Don't delete sourced data[edit]

Hello, sorry to complain again but can you avoid to replace existing statements which are sourced by other statements ? If I look at that edit and I take the PubChem ID identifier for example you replace a correct value by the same value and you provide a less detailed source. You should modify your code to include an analysis step before doing any replacement. The period of large data import is finished and there are some people which are working on curating so please respect their job. Can you include before any change a check which compares your data and the existing and if there is no difference, please don't change the value. Snipre (talk) 09:20, 13 August 2015 (UTC)

@Snipre: Sorry for overwriting those citations, I noticed that this occurred only on very few elements. I will add a check for such detailed source statements. The reason I replaced all source statements of matching values is simply because the vast majority are "imported from <language> wikipedia". Sebotic (talk) 09:32, 13 August 2015 (UTC)
I understand your approach but this approach is not more relevant when you consider that some persons or bots are working in parallel with yours. You have to analyze the value and the reference before modifying anything:
  1. value is different and no source -> replace
  2. value is different and WP as source -> replace
  3. value is different and sourced with "stated in" -> add a new statement
  4. value is the same and no source -> add the source
  5. value is the same and WP as source -> replace the source
  6. value is the same and sourced with "stated in", source is the same as yours -> do nothing
  7. value is the same and sourced with "stated in", source is different -> add a new statement
By the way, can you please modify your source structure to match the requirement defined in Help:sources ? How do you want to cite the source when you have only "stated in: DrugBank" ? There is no possibility to create a link to the database in WP article with that lonely information.
Snipre (talk) 09:52, 13 August 2015 (UTC)


Hi, your bot is creating two entrances to the same source, the retrieved (P813) is being placed as a new source. --Almondega (talk) 11:45, 29 September 2015 (UTC)

Hi, could you give an example of an entry where this happens? Andrawaag (talk) 12:30, 29 September 2015 (UTC)
Here: Ibritumomab (Q635415) -Almondega (talk) 17:28, 29 September 2015 (UTC)
@Almondega: This behavior was actually on purpose in order to stay consistent within all bots we created. But I agree that it would be bettter to have it all in one reference, so I will change the behaviour. Thx! Sebotic (talk) 17:21, 29 September 2015 (UTC)

Bot malfunction: images[edit]

The bot appears to add inexisting images with external urls (unsupported): Special:Diff/254499529. Please cease operations until this is fixed and correct any errors. --- Jura 06:33, 3 October 2015 (UTC)

Thanks for reporting. I was already aware of this issue and already fixed this issue in the code of the bot. I fixed the one reported and there appears to be no other incorrect image Andrawaag (talk) 07:44, 3 October 2015 (UTC) .

genomic start/end[edit]

Is it right that genes have mutliple genomic start (P644) and genomic end (P645) values, e.g. Q18062557? If so, the single value constraints on these two properties should be removed. --Pasleim (talk) 08:37, 5 October 2015 (UTC)

@Pasleim: It is indeed correct that genes can have multiple location values, as you can see here on the example given. Another reason we need multiple values on start/end position is that when other genomic builds will be added, most gene items will have multiple positions reported.. This is mainly why we recently added the build number as a qualifier to the location statements. Andrawaag (talk)

Bot not cleaning the images parameter correctly[edit]

Be careful when importing images with the bot, it seems that you haven't take into account that images can be put as thumbs and the bot is not importing the clean name as it can be seen here and it also is adding HTML comments that are in the image field as it can be seen here. I am already correcting the ones with the error that are detected by the mandatory constraint violations, but there can be more. -- Agabi10 (talk) 00:17, 16 October 2015 (UTC)

@Agabi10: Sorry, I am about to fix that. Sebotic (talk) 17:17, 20 October 2015 (UTC)
Yeah, don't worry about that, I'm some kind of programmer, so I understand that software can do things we don't expect. While their unexpected behavior is solvable there should not be big problems. The important thing is detecting the problems and solving those. That was the reason of my message, not blaming or anything like that. -- Agabi10 (talk) 21:57, 20 October 2015 (UTC)
@Sebotic: Hi again, I'm here again mostly because the problem when importing images is still there. It doesn't only continue adding the image parameter without the correct cleaning. It also removed the corrections I did the last time to all the incorrect imports (they are all back and they replaced all my corrections). Would you mind checking again that bug or, at least, stop importing the images? Thank you. -- Agabi10 (talk) 22:53, 28 October 2015 (UTC)
@Agabi10: oh, well, you are right. I rerun the old bot code instead the one where I fixed the issue. Sorry for that. I will also replace the thumbnail images added in the GeneAtlas image property with the full version. Bot running. Thanks for your input! Sebotic (talk) 23:41, 28 October 2015 (UTC)
@Sebotic: Thanks for running the bot to correct the incorrect images. By the way, how complicated is to make the bot check if a claim already exists for a entity before updating it? In this diff I realized that it is updating the retrieved reference values but the properties' values are the same. Or is it the way it should behave? -- Agabi10 (talk) 23:57, 28 October 2015 (UTC)
@Agabi10: Currently, this is intended behaviour, but we ar thinking about moving away from it. BTW: the GeneAtlas Images will be removed from protein items as soon as we are sure that all protein items are in Wikikdata. Sebotic (talk) 00:08, 29 October 2015 (UTC)

@Sebotic: There is no problem on maintaining the GeneAtlas images, anyway, the incorrectly added images I mean are the ones in image (P18) and you can check how it is added for example in EIF2AK3 (Q18034214). -- Agabi10 (talk) 00:15, 29 October 2015 (UTC)

NCI Thesaurus ID (P1748)=TCGA[edit]

Can you have a look at [1]. TCGA and OMFAQ are invalid codes on NCI-Thesaurus. --Pasleim (talk) 08:54, 18 October 2015 (UTC)

@Pasleim (talk) Thank you for the heads up. Fixing them right now. Emitraka (talk) 15:19, 18 October 2015 (UTC)

setting unspecific description[edit]

You are setting a rather unspecific English description when a better one was already available in the item. Can you please look into this? One example is the recent edits in Marfan Syndrome (Q208562). --LydiaPintscher (talk) 08:14, 20 October 2015 (UTC)

Indeed, thanks for noticing. We will adapt the bot to keep better descriptions. Andrawaag (talk) 10:25, 20 October 2015 (UTC)

Proposing a common reference model for Wikidata items added by the ProteinBoxBot[edit]

Currently different reference models exists for content added to Wikidata. I have been using the following set of properties in references added::

  • Stated in (P248)
  • Imported from (P143)
  • retrieved (P813).

"Stated in" is used here to point to a specific version of a database used (ie. Disease ontology release 2015-04-16 (Q19832982), imported from to the database name in general (e.g. Disease ontology) and a timestamp on when the data is retrieved into Wikidata.

This reference model has gradually grown and is not in par with the reference model proposed in the help section, where the following is suggested:

  • database property → database property ID (the unique identifier for the data as per the database)
  • title (P1476) → the title of the dataset in the database
  • language of work (or name) (P407) → the language of the database
  • publication date (P577) → the publication date for the data. If no publication date is provided use retrieved (P813), the date when the data was taken from the database

I will adopt the reference model to the one suggested in the help section. However, I believe it is too limited for our purposes and I would like to extent the list of references with the following properties:

  • database property → database property ID (the unique identifier for the data as per the database)
  • title (P1476) → the title of the dataset in the database
  • language of work (or name) (P407) → the language of the database
  • publication date (P577) → the publication date for the data.
  • retrieved (P813) → the date the data got imported into Wikidata
  • imported from (P143) → The api platform ID (eg.,, etc)
  • stated in (P248) → A specific release of that database, if the number of releases is scattered or to frequent use:
  • data version (to be proposed)

Any objections, adjustments, suggestions? --Andrawaag (talk) 21:54, 21 October 2015 (UTC)

Odd edit to disease[edit]

Could you please explain what caused this edit? The bot removed the subclass of (P279) no label (Q15281399) and replaced the description with an inaccurate one. It's not just clearing out all subclasses, is it? Because that would probably be a big problem. --Yair rand (talk) 01:13, 27 October 2015 (UTC)

This was due to a bug in our bot code. When a term from the Disease ontology is added to Wikidata, it is set to be of subclass of Disease. However, the tem "disease" itself is a term in the Disease ontology. This wasn't covered in our bot. Now it is. I have reverted to the previous edit, and adapted the bot to cover this case. Thanks for reporting --Andrawaag (talk) 12:41, 27 October 2015 (UTC)
Why did it remove the existing "subclass of" statement, though? Does it normally do this? --Yair rand (talk) 17:50, 27 October 2015 (UTC)
Same thing happened again. More important than the odd addition is the removal of previous subclasses, though. --Yair rand (talk) 15:17, 18 November 2015 (UTC)
And now a third time... --Yair rand (talk) 18:33, 10 December 2015 (UTC)
@Yair rand: Thank you for letting us know. I am leaving a quick note to say that we are looking into it again. Andrawaag is traveling at the moment so I'm not sure when he would get a chance to respond. And we also need to confirm that he is receiving emails for changes to this page. More soon... Best, Andrew Su (talk) 19:52, 10 December 2015 (UTC)
@Yair rand: First apologies, I was under the impression I had fixed the issue and clearly I didn't. The original safeguards are still in place, but I was running a tutorial last week, and I am afraid that caused some sloppiness from my part. I agree this should not happen and i will change the bot code that this case is specifically monitored. Currently, the subclass overwriting is prevented by an attribute by a generic edit functions in our bot. There are cases, where you want to remove previous statement calls, since we are adding statement from original sources. But again, this is should never happen with subclasses and instance of and the updating of subclass of properties deserves its own function call. --Andrawaag (talk) 20:26, 12 December 2015 (UTC)

Microbial Genes and Proteins[edit]

We have proposed a plan for loading microbial genes and proteins to Wikidata using a similar, but Microbe-specific data model. A discussion concerning this has been initiated on the [Project Molecular Biology Talk Page]

RefSeq Protein ID (P637)[edit]

Can you have a look at the unique value violations of RefSeq Protein ID (P637) at [2]. Thanks. --Pasleim (talk) 09:56, 17 November 2015 (UTC)

Pasleim I am on it. A first impression suggests that refseqs identifiers can be shared over multiple items e.g. (P0CW24 and P0CZ20) and (P0CV99 and P0CW01) --Andrawaag (talk) 11:27, 23 November 2015 (UTC)

Scientific articles listed as biological processes on protein items[edit]

Hi, I noticed that there are currently 117 protein items that have a statement with biological process (P682) where the value is an item about a scientific paper. I am very much looking forward to ProteinBoxBot being more closely integrated with Wikidata:WikiProject Source MetaData, but it seems we're on the wrong track here. --Daniel Mietchen (talk) 21:08, 23 November 2015 (UTC)

@Daniel Mietchen: Thanks for noticing this. There appeared to be a bug in the protein bot. This is fixed and the issue should be fully resolved soon.--Andrawaag (talk) 21:23, 24 November 2015 (UTC)

Reversing redirect?[edit]

Are this, and especially this, supposed to happen? In essence, what I'm talking about is the fact that initially, Q14877161 was a redirect to Q309513, but the bot repopulated Q14877161 and essentially blanked Q309513 (without even redirecting, though it probably shouldn't have edited the redirect, to begin with).  Hazard SJ  03:48, 3 December 2015 (UTC)

@Hazard-SJ: Apologies for the late reply. The two items you reference should actually be separate items -- one for the gene and one for the protein. This distinction is an important one biologically, and also for our lua script to populate our gene infobox on en wikipedia. Hope that makes sense! Best, Andrew Su (talk) 20:53, 10 December 2015 (UTC)


Hello, protein Q12039093 seems to be be a component (property P681) of membrane ruffles. However, it links to item Q2204140 which represents the piece of clothes, not the molecular structure. What can be done about that? --Vojtěch Dostál (talk) 13:57, 11 December 2015 (UTC)

@Vojtěch Dostál: You are exactly right of course, and I've fixed this error. The edit that added that statement was from October -- before we added more rigorous type checking to our bot. So I believe new edits like this would not be made, but this example does highlight that we need to clean up any remaining examples of this. I've created a ticket to make sure that gets done soon. Thanks for pointing out this error! Best, Andrew Su (talk) 16:59, 11 December 2015 (UTC)
@Andrew Su: Thank you for your quick and kind answer! I am currently looking carefully into the protein-related data on Wikidata and would love to create a protein infobox for Czech Wikipedia that uses these data exclusively, without input from Czech Wikipedia. As Czech Wikipedia currently has no protein infobox whatsoever, it will be quite easy because we won't have to build on a preexisting structure. BTW: Do you know of any Wikipedia version that already uses WD for protein articles? Better to use something that already works than to reinvent things. Any help appreciated. Thank you! --Vojtěch Dostál (talk) 09:32, 12 December 2015 (UTC)
@Vojtěch Dostál:We have an infobox working off of wikidata that combines gene and protein information. This may solve your problem or at least should provide a good start. See: --I9606 (talk) 00:44, 14 December 2015 (UTC))
@I9606: Very cool! I cannot understand lua enough to know how you link correct gene to correct protein item. In a draft infobox I created, I use the P688 property to determine the protein which goes with a particular gene. However, many are not linked together properly, the data are probably dirty. Do you think your infobox module approaches this problem in a better way? --Vojtěch Dostál (talk) 10:14, 14 December 2015 (UTC)
@Vojtěch Dostál:We also use the encodes property to form that connection. The lua infobox I linked to there is our second version and I do think is the right way to go. Lua actually isn't that bad and it is really worth learning a bit of it if you want to do wikidata-driven infoboxes. Our first version only used the basic wikidata module but we ran into problems with error checking and general code organization. Would be awesome if you could use our stuff.. That would also allow you to benefit from our ongoing work in populating and maintaining data in wikidata according to a structure that fits this infobox. Any 'dirty' data you see about genes and proteins wikidata is on our bot's todo-list to clean up. --I9606 (talk) 16:48, 14 December 2015 (UTC)
I'll certainly try to adapt your system. --Vojtěch Dostál (talk) 10:44, 15 December 2015 (UTC)
Great, there are more examples of it working here: Let us know how it goes! --I9606 (talk) 17:19, 15 December 2015 (UTC)

Same problem here: Q5980114 --Vojtěch Dostál (talk) 09:59, 12 December 2015 (UTC)

@Vojtěch Dostál: Can you clarify the issue here? Not noticing the problem... Best, Andrew Su (talk) 01:08, 15 December 2015 (UTC)
@Andrew Su:@Vojtěch Dostál: Sorry, I fixed that one too early. The issue was that the Wikidata item had a GO term on it, but the interwiki links pointed to articles about a NATO radiofrequency band. Item revisions show that the interwikilinks were added first and the GO term got added a little bit after (all happened back in 2013), apparently because the person responsible for adding GO terms did string matching to the label, not a good idea in general. Sebotic (talk) 08:05, 15 December 2015 (UTC)

Also, should I be bold and merge items Q14330621 and Q29548, or, similarly, Q21095088 and Q903634? --Vojtěch Dostál (talk) 15:36, 12 December 2015 (UTC)

@Vojtěch Dostál: On these potential merges, I would actually suggest we hold off. There are many examples of items that are highly-related but distinct, and incorrect merges might create problems downstream. We're building better infrastructure now to detect cases where two biomedically-related items should be merged based on ontology mappings. So I'm hopeful we'll have a more robust system in place soon. Best, Andrew Su (talk) 01:08, 15 December 2015 (UTC)

Thank you everyone. Although gene/protein data are in a bit of a mess right now, I have a good feeling that we are heading in a good direction. Some of the properties are already well-referenced, now we need to link the various items to the proper Wikipedia entries to make them more usable. --Vojtěch Dostál (talk) 10:44, 15 December 2015 (UTC)


Please don't add any more P:P1805 claims [3]. The property is marked as deprecated and will soon be deleted. Use instead World Health Organisation International Nonproprietary Name (P2275) --Pasleim (talk) 12:16, 27 December 2015 (UTC)

@Pasleim: Sorry for that, this happened just for testing purposes, I did not intend to add new values. I was about to write a bot which would transfer all of no label (P1805) to the new multilingual World Health Organisation International Nonproprietary Name (P2275), so P1805 could be deleted without losing data. But apparently, this conversion has already happened now. Thx! Sebotic (talk) 23:55, 28 December 2015 (UTC)

Stop undoing me[edit]

Why does it keep undoing this? Something's dodgy with this bot. -- numbermaniac (talk) 01:32, 12 January 2016 (UTC)

@numbermaniac: Sorry for that, I tried to move the interwiki link to a new item by undoing earlier changes, because the Wikipedia page deals with the enzyme class and not with the human specific type of this enzyme. Will create a new item manually. Sebotic (talk) 02:22, 12 January 2016 (UTC)
Oh, ok. Sorry, was not aware of that. -- numbermaniac (talk) 02:25, 12 January 2016 (UTC)


I saw this edit today and wanted to verify the Q15330504#P361 part. Yet despite clicking around quite a bit on Wikidata and searching through GO, I could not find that information. It should certainly be easier to do that — why not simply use reference URL (P854)? The problem is not specific to GO-related info but much broader and should thus be tackled more generally than just for this one example. --Daniel Mietchen (talk) 02:26, 29 January 2016 (UTC)

@Daniel Mietchen: Well, if you click on the Gene Ontology ID value (P686), this will bring you to AmiGO and there you can, by clicking on the "Graph View" tab, see the relations 'subclass of' and 'has part'. I can certainly include a direct link to the data item in the Ontology Lookup service, but the URL is quite long and we will end up with more info in the references than in the actual claim. My take was so far to point to the resource where the user can find the original data but keep the references as concise as possible while attempting to sticking to the Wikidata referencing guidelines. I modified your example so you can see what it would look like. Adding a link to Amigo, imho, does not add more information than if the user clicked onto the Gene Ontology ID. But I agree that a ref url might be useful for machine readable single value references. Sebotic (talk) 02:48, 29 January 2016 (UTC)
Thanks — I could now verify that particular claim, and I agree that having "more info in the references than in the actual claim" would be odd, but perhaps this needs some more thought. --Daniel Mietchen (talk) 03:07, 29 January 2016 (UTC)
Looking at that diff, it seems that whoever is running this bot should read Help:Description. Descriptions are not supposed to be overly long nor written in full sentences. - Nikki (talk) 16:53, 16 February 2016 (UTC)

Please stop changing the capitalisation of water (Q283)[edit]

See Special:Diff/293725481 and Special:Diff/292137795. The correct label for water (Q283) is "water", please stop changing it to "Water". - Nikki (talk) 16:51, 16 February 2016 (UTC)

@Nikki: I adapted the bot so it will start all English chemical compound names lowercase. Best, Sebotic (talk) 06:32, 17 February 2016 (UTC)

Een Menselijk Gen[edit]

Please replace "Een Menselijk Gen" with "menselijk gen". I don't know who come up with the translation, but this is worse. Sjoerd de Bruin (talk) 17:09, 1 March 2016 (UTC)

That would be me. Will change to "menselijk gen". I was more in doubt between "humaan gen" and "menselijk gen", given "humaan" is also synonym to "humane". Either way, as said, I am fine with "menselijk gen", so will change --Andrawaag (talk) 17:18, 1 March 2016 (UTC)
Sorry for my harsh language, thanks. Sjoerd de Bruin (talk) 17:24, 1 March 2016 (UTC)
When will the items be updated? Sjoerd de Bruin (talk) 18:27, 4 July 2016 (UTC)
There are indeed some remaining. The current bot is already adapted to change all those issues. We are currently working on a more automatic way of reporting these issues. We have paused the bots now to implement this workflow. I do expect this to be update shortly though. --Andrawaag (talk) 19:14, 4 July 2016 (UTC)

Favism vs. glucosephosphate dehydrogenase deficiency[edit]

I don't know the source for the error, but you (the bot) created a new item yesterday no label (Q23897268) linked to MeSH ID D005955 (which certainly seems to be the right MeSH ID for that disease) but there was an already existing entry Glucose-6-phosphate dehydrogenase deficiency (Q848343) which included both that MeSH ID and another D005236 for Favism, a closely related illness but which is categorized separately by MeSH. Some of the wikipedia pages linked on Glucose-6-phosphate dehydrogenase deficiency (Q848343) seem to point to one condition, some to the other, or as with the English page, consider both to be the same thing. Along with creating the new item yesterday, ProteinBoxBot removed the D005955 id (and presumably other related ones) from Glucose-6-phosphate dehydrogenase deficiency (Q848343) with this edit. If it's felt that we need to split the two items then the label and some of the wiki links on Glucose-6-phosphate dehydrogenase deficiency (Q848343) need to be changed also; currently the two items seem to refer to the same thing by label, but by link to different things, and it doesn't make sense. ArthurPSmith (talk) 14:59, 22 April 2016 (UTC)

@ArthurPSmith: Thanks for reporting this issue. I did a first inspection on the bot, but it worked as it should have, except the removal of the identifiers. Our bot only removes an external identifier if that reference was removed from the original source (in this case the Disease Ontology). The bot is implemented to create or update a wikidata item based on a set of existing (disease) properties, I need a bit more time to see if this issue needs to be resolved by finetuning this approach, or that it is an issue that needs to be discussed with the curation team of the Disease Ontology. I will report back shortly. --Andrawaag (talk) 05:43, 23 April 2016 (UTC)
@ArthurPSmith: Even though Favism is widely used as a synonym to G6PD deficiency that is not exactly the case. Not all individuals with G6PD deficiency exhibit symptoms when they ingest fava beans, it depends on the variant (see Which means that Favism is a subclass of G6PD deficiency. The german Wikipedia has to separate articles for the two conditions, one for G6PD deficiency and one for Favism. The Wikidata item from the german article for Favism favism (Q1398913) looks a bit problematic right now. no label (Q23897268) and Glucose-6-phosphate dehydrogenase deficiency (Q848343) should be merged again, but MeSH ID D005326 and the alias of Favism need to be moved to the Wikidata item linking to the Favism article in the german Wikipedia favism (Q1398913). What are your thoughts? - Emitraka (talk) 10:57, 23 April 2016 (UTC)
That makes sense - I'm definitely not an expert on these diseases; I ran across it because in the process of these changes we somehow had a subclass-of-self relation created, one of the class relationship loop types that I've been working on recently. So if you can fix it as you suggest above then that makes sense to me. go ahead. ArthurPSmith (talk) 17:48, 25 April 2016 (UTC)
@ArthurPSmith: All entries edited/merged. References will be added on the next bot run. Thank you again for bringing it to our notice. Cheers Emitraka (talk) 19:45, 25 April 2016 (UTC)

subclass relationships[edit]

Best practice regarding subclass relationships is that only the most-specific class relationship is added to an item. (Exception: differing sources place an item in a different place.)

Please fix this bot such that it does not add both e.g. "disease" and "viral disease", but only "viral disease".

This applies across the whole set of disease-related items. --Izno (talk) 17:01, 26 May 2016 (UTC)

@Andrawaag, I9606: Since I know you are active, could you please respond to this comment I left nearly half a year ago? --Izno (talk) 18:02, 18 October 2016 (UTC)
@Izno: It is indeed a valid point you make. Initially, we only stated it being a subclass of disease, but since the child classes are now included as well, it makes sense to only to store the most-specific class relation. I will adapt the scripts accordingly. --Andrawaag (talk) 15:38, 19 October 2016 (UTC)

--Micru (talk) 21:46, 24 August 2014 (UTC) Tobias1984 (talk) TomT0m (talk) Genewiki123 (talk) Emw (talk) 03:09, 9 September 2014 (UTC) —Ruud 16:15, 9 December 2014 (UTC) Emitraka (talk) 14:32, 14 October 2015 (UTC) Bovlb (talk) 19:10, 21 October 2015 (UTC) Peter F. Patel-Schneider (talk) 22:21, 23 October 2015 (UTC) ArthurPSmith (talk) 15:51, 5 November 2015 (UTC) --Daniel Mietchen (talk) 20:53, 3 January 2016 (UTC) --Harmonia Amanda (talk) 22:00, 27 February 2016 (UTC) --Lechatpito (talk) --Andrawaag (talk) 14:42, 13 April 2016 (UTC) --ChristianKl (talk) 16:22, 6 July 2016 (UTC) --Cmungall Cmungall (talk) 13:49, 8 July 2016 (UTC) Cord Wiljes (talk) 16:53, 28 September 2016 (UTC) DavRosen (talk) 23:07, 15 February 2017 (UTC) Vladimir Alexiev (talk) 07:01, 24 February 2017 (UTC) Pintoch (talk) 22:42, 5 March 2017 (UTC) Fuzheado (talk) 14:43, 15 May 2017 (UTC) YULdigitalpreservation (talk) 14:37, 14 June 2017 (UTC) PKM (talk) 00:24, 17 June 2017 (UTC) Fractaler (talk) 14:42, 17 June 2017 (UTC) Andreasmperu Diana de la Iglesia Jsamwrites (talk) Finn Årup Nielsen (fnielsen) (talk) 12:39, 24 August 2017 (UTC) Alessandro Piscopo (talk) 17:02, 4 September 2017 (UTC) Ptolusque (.-- .. -.- ..) 01:47, 14 September 2017 (UTC) Gamaliel (talk) --Horcrux92 (talk) 11:19, 12 November 2017 (UTC)

Pictogram voting comment.svg Notified participants of WikiProject Ontology Its actually quite handy for query purposes to have a way to define the nature of an item (e.g. that it is a disease) without having to traverse a class hierarchy. Should we consider using instanceOf relations for this purpose? e.g. breast cancer instance of disease? Or perhaps we should add a new property 'semantic type' that allows for this kind of broad grouping of items?  – The preceding unsigned comment was added by I9606 (talk • contribs).

Symbol support vote.svg Support As this is being done already with chemical compounds, I support using 'instance of'. A classification is very helpful, the larger the number of items of a certain class gets, the more essential it is to have a classification, especiallly, if no unique identifer covers all of these items of the same type (as would NCBI entrez gene ID, Uniprot ID). Sebotic (talk) 18:00, 19 October 2016 (UTC)
I am unsure you understood the question; these are not instances (of cancer), they are clearly subclasses. The fact that you are using instance in the gene domain (and elsewhere c.f. books and etc.) does not make it the correct solution. That said, I know for example that @TomT0m: advocates a wide use of OWL punning (I support a more limited use) which could use the instance relation. --Izno (talk) 18:04, 19 October 2016 (UTC)
@Izno: Looking at the definition of desease I'm not totally sure a specific diseases is always a subclass of disease as the term is polysemic. For example in the phrase Disease is often construed as a medical condition associated with specific symptoms and signs. I understand of this : a disease is a specific kind of condition. Say I have a cold. I have a condition. Then "cold" is a (subclass of) condition. Cold has some symptoms and cause. Then if a disease is characterized by some symptoms, the "cold" item will specify this symptoms ...
Say a symptom "fever" is defined as "high temperature", then if I have a high temperature, qualify for this symptom. If I qualify for all the "cold" disease symptoms, I qualify for having a cold. But what qualifies for beeing a disease is not me having the symptoms, its the symptoms themselves ... I would probably be a "type" (as in taxonomic types) if I'm the firt person to catch this desease.
< disease > has quality (P1552) View with SQID < symptom >
< Cold > has symptom search < fever >
 would advocate for
< Cold > instance of (P31) View with SQID < Desease >
and NOT
< disease > subclass of (P279) View with SQID < condition >
< condition > subclass of (P279) View with SQID < disease >
- this would make desease essentially a condition metaclass. author  TomT0m / talk page 18:38, 19 October 2016 (UTC)
There are many kinds of colds--this is something the ontology from which the bot is importing recognizes at least. Or, better, for example, there are many kinds of dengue fevers (named), usually differentiated by their specific symptoms or duration or carrier and etc. These are subclass relationships with the broad "fever as a disease". --Izno (talk) 19:25, 19 October 2016 (UTC)

Avoiding junk edits[edit]

Would it be possible to avoid edits like this one? All it did was update the accessdate. Unless the source database has changed, the bot should avoid these edits. --Izno (talk) 17:05, 26 May 2016 (UTC)

+1 Sjoerd de Bruin (talk) 17:25, 26 May 2016 (UTC)
@Andrawaag, I9606: Since I know you are active, could you please respond to this comment I left nearly half a year ago? --Izno (talk) 18:02, 18 October 2016 (UTC)
@Izno: We have recently implemented features in our bot to downsize this type of zero update edits, where only the access data has changed. So yes these can be avoided. However, having a reference with a retrieved property set on a date some time ago, suggests that that statement hasn't been synced with the source for quite some time. To indicate the up-to-date state of that statement requires these type of "0"-edits from time to time. But I agree, this should not happen on every update cycle. We are currently considering a quarterly full update cycle (with 0 edits) and fast run update cycles where only statements are updated if other values changed as well.--Andrawaag (talk) 18:24, 18 October 2016 (UTC)
I don't see the value regardless. Someone is free to review the site if they see that the accessdate is old (which is why these are added). I would strongly stress that these should be avoided at all times, unless the value actually changes. And if it changes, what should probably happen is a new claim created, with the old claim qualified with start/end times. (And the new one should get a start time, I suppose.) --Izno (talk) 17:32, 19 October 2016 (UTC)
I assume that the concern about having edits with low significance is that it obscures the edit histories. (Please correct me if I'm wrong there). From my perspective, the edit histories are already extremely difficult for humans to parse regardless. The problem isn't with 'junk' edits, its with the user interface for viewing them. Having a 'last updated' signal as part of the data is very useful and seems like quite reasonable use of the API. Perhaps one way to alleviate the symptom is to stamp the edit with a more clear description of what happened: e.g. in this case 'updated timestamps for properties XYZ'. I think the bot could definitely do a better job in that respect. --I9606 (talk) 17:48, 19 October 2016 (UTC)
I agree that the UI should more clearly indicate that only a source/qualifier changed on a certain claim rather than the entire lcaim, but I don't know if anything can be done about that since I suspect either the diff-er or the bot making the edit or both are processing at the statement array level rather than at the sub-statement level. My main concern is when it pops up on watchlist and in recent changes--while I can trivially remove bot edits, this is not the only bot and certainly not the only bot that should have an eye kept on it. There is the edit history concern--it produces a lot of cruft and so it is difficult to tell which edits actually made a change and which ones did not. A better edit summary would help, but I, and I suspect others, would clearly prefer never to see the edit. See also the principle at en:WP:COSMETICBOT--while we don't operate by en.WP's rules, there is clearly-enough a strong preference to ban edits which don't resolve in a difference to the user (and while changing the access date probably doesn't qualify under such in the Wikipedia frame, on the Wikidata frame it probably could). --Izno (talk) 17:59, 19 October 2016 (UTC)
@Izno: If you specifically refer to the edits to GO terms happening currently, these are actually some 500 real writes out of 25,000 potential edits. So my mechanisms to reduce these 'empty' writes have reduced the number of these writes already by 98%. The ones remaining are due to description length conflicts between Wikidata and Gene ontology and can hardly be avoided at this point. Sebotic (talk) 18:56, 18 October 2016 (UTC)
No, the edit I linked to is from 8 months ago. That said, what do you mean by description length conflicts? Is your bot overwriting descriptions? It probably shouldn't be. --Izno (talk) 17:32, 19 October 2016 (UTC)
@Izno: I overwrite the descriptions of Gene Ontolgy items, because Gene Ontology is the authoritative instance here. Sebotic (talk) 17:50, 19 October 2016 (UTC)
Yes, that's a problem IMO. I would expect the description to be set once and then not/never again, not least because there are intersections between that domain and less-specialist domains. --Izno (talk) 17:59, 19 October 2016 (UTC)
@Izno: Well, Gene Ontology is not some static resource where you import things once and then you're fine. It's constantly evolving, there are weekly releases/updates, this means also that descriptions change over time. So if they are imported once and then left as they were, this is just making the whole resource useless. E.g. if we had left GO data as they were since the fist imports in 2013, it would be completely outdated by now. So I think using GO descriptions is a good enough compromise, as there is also no way to programmatically determine if a user contribution is better than the original GO description, especially if you think about vandalism. What would be your ideas about that? Sebotic (talk) 18:24, 19 October 2016 (UTC)

Descriptions are for disambiguation. I expect that having simply the name of the entity as the label, and subsequently either its class or instance is sufficient for a first-pass and only change. This means descriptions on the level of "gene", "protein", and etc. are sufficient.

Regarding vandalism, I think that's a non-case--it's not the purpose of this bot, and quite frankly, that's just a bad rationale. --Izno (talk) 18:31, 19 October 2016 (UTC)

@Izno: That's exactly what I do, I disambiguate a term with descriptions of a maximum length of 250 characters, as stated on the description help page. And for that, as a Wikidata user, my opinion is that the best way to do it is to use the Gene ontology term descriptions for Gene Ontology terms. That said, I do that only for Gene Ontoolgy terms, for Proteins, I use a different approach, involving the species and Uniprot ID. Classification with subclass of and instance of is a completely different topic. Sebotic (talk) 18:51, 19 October 2016 (UTC)
You should never need 250 characters to disambiguate an entity, which is why I brought up the instance/subclass--which are usually sufficient for such. Have you found them not to be? --Izno (talk) 19:22, 19 October 2016 (UTC)
@Izno: Well, I see your point that you'd like to have short descriptions, and technically, in order to prevent Wikidata from complaining about same label/description, it is sufficient to add e.g. 'Gene Ontology term' as a description. But semantically, imo, for Wikidata users, it can be hard to see what the difference is between 2 Gene Ontology terms is, if they don't have a proper description. That said, I would like to have everything what's currently in a GO term description as machine-readable statements on the item itself. But to achieve that, we have a long way to go, need many properties and items which talk about the things contained in the descriptions. Replacing them with a generic description at this point does not make things better, imo. Sebotic (talk) 21:23, 19 October 2016 (UTC)


As per Property:P932#P1630, the PMCID should be given without the leading "PMC". I have fixed that for a few of the items you set up. Please change the bot accordingly. Thanks! --Daniel Mietchen (talk) 18:16, 9 June 2016 (UTC)

Adding underscores to labels[edit]

The bot is making some strange mistakes. See diff. --Yair rand (talk) 06:52, 20 June 2016 (UTC)

@Yair rand: I have noticed that, and it does it only with this particular item because for some strange reason, the data source has it as a label, as you can see here: [4] I will think about how to solve this, but it's an annoying exception. Best, Sebotic (talk) 17:03, 22 June 2016 (UTC)

Logged off[edit]

In case it's you who is editing as an logged off bot, please log in. Thanks. --Pasleim (talk) 09:19, 24 June 2016 (UTC)

If it was a time write, it was me and the bot has been logged in since. --Andrawaag (talk) 14:23, 24 June 2016 (UTC)

How to deal with dead link references in wikidata?[edit]

The wikidata for Osteoporosis (Q165328) has the following URL for a number of "references": archive URL:

That link results in "404: Not Found". Not a big thing for the English WP, as the reference doesn't show up on the article page. However, that reference shows up on the article page in the Arabic WP; see:

I tried putting dead link at the end of the url, but wikidata sees it as a corruption of the url and won't accept the change. Should I just try the "remove" option? (I'm not even sure that would work).

The reason I'm asking you is that you (more precisely, your bot) apparently added that nonfunctioning URL to the OMM ID on April 21 of this year. (

P.S. that nonfunctioning URL shows up as "reference" for ICD-10, DOID, ICD-9, and OMIM ID Identifiers.

P.P.S. I misspoke when I said that your bot had "added" the reference on 21 April. It had merely "updated" it, but had not actually changed it at all. It looks like the ProteinBoxBot had started it all on 24 January 2016, when it added which led to an incomprensible (to me, anyway) data dump of sorts. See

Since that addition, your bot has "updated" several times, each one leading to the same data dump until the update on 11 March which leads to the 404 error. Your bot has then "updated" that a few more times (but with no change to the address). I'd say the bot has got some problems.

Thank you for your help. --Akhooha (talk) 18:11, 4 July 2016 (UTC)

Thanks for raising this issue. With disease the bot considers Disease Ontology being key. The Disease Ontology is being update regularly, where the updates are being maintained on github. What the bot does, is nothing more then downloading the latest DO update from github, uses the URL of the specific github version as reference and then update all diseases in Wikidata. So the file you have seen is actually the full machine readable file used by the bot to update diseases in Wikidata. It is strange that the github reference is a dead link. I will investigate to see what is going on here. Any thoughts on how you would prefer the "data dump of sorts" to render? --Andrawaag (talk) 20:31, 4 July 2016 (UTC)
Thanks for your quick reply. As to how I would prefer the "data dump of sorts" to render, I have no idea. It would seem that it was not meant to be viewed by the casual WP reader anyway, as it is not something normally encountered on article pages.
Which brings me to what is perhaps more important. The wikidata "reference" is somehow insinuating itself on Arabic WP articles. As I mentioned before, it shows up on the Arabic version of the article for "Osteoporosis". If you cannot read Arabic, to give you an idea, here are screen shots showing how it shows up as footnotes in the wikidata template, and in the "references" section:
(wikidata template intrusion)
(reference section intrusion).
This apparently is not an isolated case, as the same story happens with the Arabic version of the article on "Arthritis":
(wikidata template intrusion)
(reference section intrusion)
I don't know whether this "leak" from wikidata references is an Arabic WP issue, or whether it is a wikidata issue, but I think it should be looked into, as the "references" are pretty much useless to the casual reader. By the way, the "archive url" ( (as stated before) results in a 404 error, and the primary url ( results in the following message: "The server at is taking too long to respond.". Very weird. Thanks for your reply. Hoping you can get to the bottom of this.
P.P.S. To muddy the waters a bit further, it would seem that ProteinBoXBot's addition of the error 404 URL doesn't have anything to do with the "leak from wikidata to the Arabic WP. I investigated a bit and found that the same intrusion on the page template goes back to when it was first placed on the Arabic "Osteoporosis" page on 19 June 2011. And it's even a bit stranger still --- the "archive url" (the current one that results in 404) shows up on the 19 June 2011 page exactly the same as it is showing up on today's page, in spite of the fact that it has undergone numerous "updates" since 19 June 2011.--Akhooha (talk) 01:24, 5 July 2016 (UTC)

Akhooha (talk) 22:43, 4 July 2016 (UTC)

@Akhooha when you say "incomprehensible" what are you referring to? The link goes (should go) to the raw owl file of the Disease Ontology, see breast cancer for example. The references for NCI Thesaurus IDs go to the DO file. I have no issues accessing the primary url (, maybe there was a server problem when you tried it. -Emitraka (talk) 13:45, 5 July 2016 (UTC)
@Emitraka : What I mean by "incomprehensible" is that I couldn't really make sense of it. However, since it's apparently "raw data", this is not surprising, and the presentation of it is not an issue I'm pursuing, as it seems that's the way it's supposed to look and the average casual WP reader does not see this data. By the way, today I tried ( and the data came though --- although the URL had changed in the address bar to "". Strange.... In any case, I hope someone is able to figure out why these wikidata references (including the "archive url" that leads to a 404 error) seem to be leaking into the articles in the Arabic WP --- this will confuse and frustrate readers of the medical articles in the Arabic WP (it seems to be the case in all the Arabic medical articles I've looked at that use that wikidata template). There is definitely a problem with the way the Arabic WP and wikidata are interfacing one another. Akhooha (talk) 15:32, 5 July 2016 (UTC)
@Akhooha thank you for clarifying that. -Emitraka (talk) 15:40, 5 July 2016 (UTC)


As far as the problem with the display in the templates in the Arabic WP, that problem has been solved. I contacted the Arabic WP, and Mr._Ibrahem altered the template so that links to wikidata "reference" material do not show up on the article page.

I notice that Wikidata is still showing an "archive URL" of which results in "404: Not Found" (see ICD-10, DOID, ICD-9, and OMIM ID Identifiers for Osteoporosis (Q165328) ). Since this is not showing up on any pages, it's probably not considered a problem, but nevertheless I think it's something that you might want to look into. Thanks everyone for your replies. Akhooha (talk) 21:09, 8 July 2016 (UTC)

Item for drug - disease interaction[edit]

Recently there was concern on Project Chat over drugs where the Cochrane Collaboration considers them ineffective for a particular disease but where the link is filled with "drug used for treatment" and "medical condition treated".

ProteinBoxBot integrates data from ChEMBL. ChEMBL has data about "Max phase for indication". In the list it lists drugs that passed a stage 2 clinical trial but that haven't passed stage 4. That data is currently not important into Wikidata. Information about FDA approvals is also not included.

If we take as a example risperidone (Q412443) medical condition treated (P2175) no label (Q18557827) I think it would be worthwhile to add main subject (P921) "Risperidone treatment for cocaine abuse".

The new item can then contain information about FDA approvals for the conditions and the clinical trial stage that's passed. Having a separate item would also make it easier to link other items like clinical trials to the interaction. ChristianKl (talk) 15:15, 28 August 2016 (UTC)

@ChristianKl: I think this a generally good idea to capture that additional information, but I'm not immediately sure how to model it properly. In your example, what item would the main subject (P921) be attached to? It seems you might want to reverse things a bit. If we created an item for the interaction itself first, then we could have properties that link the interaction to its constituents (e.g. the drug and the disease) and then to the characteristics of the interaction itself. But that expands the size and complexity of the graph pretty quickly. Perhaps there is a way that we could add a qualifier pattern that could be used to capture this information. e.g. (drug a treats disease b) with qualifiers like (in context 1,2,3) and (approval phase x, y, z)? I'm sure @Sebotic, Andrawaag: will have something to say here. --I9606 (talk) 16:44, 29 August 2016 (UTC)
I would use it as a qualifier. I just looked at the usage of shares border with (P47) within Germany (Q183). It seems like statement is subject of (P805) is used there when I remembered main subject (P921) being used. I would create a new item "Risperidone treatment for cocaine abuse" and then create the 5D statement: risperidone (Q412443) medical condition treated (P2175) no label (Q18557827) statement is subject of (P805) Risperidone treatment for cocaine abuse (Q26715035).
Afterwards the approval phrase doesn't have to be within risperidone (Q412443) or no label (Q18557827) but can be within Risperidone treatment for cocaine abuse (Q26715035) as a simple 3D statement. ChristianKl (talk) 17:27, 29 August 2016 (UTC)
@ChristianKl: Making an item for the interaction it self, would make sense if a related wikipedia entry can be drafted on the merits of the interaction alone. I doubt that a bit. I would expect that those type of interactions will typically be described in either the wikipedia page describing the disease or/and the drug. But then again I am not a pharmacist. Maybe we should try to model it by example, ie hand picking 2-5 exemplary drug-disease interactions, which are then fully modelled into wikidata by hand? Doing so, we might discuss every quirky detail of that model on the respective talk pages. Having such hand-drafted model, makes it also easier to update/create the needed update/create bots to add this interaction information --Andrawaag (talk) 18:54, 29 August 2016 (UTC)
I don't see a reason why a Wikipedia article should be required for having a item. I think there are valid reasons for having drug-disease items.
Clinical trials study drug-disease interactions. OpenTrials wants to release data on all clinical trials under an open license. It will be possible to integrate that data into Wikidata sooner or later and it's useful to be able to link those trials to the specific interaction.
The FDA and EMA specifically approve a drug for treating a specific disease. Qualifiers like start time (P580) and end time (P582) seem to be useful at this point.
There are rating scores like the one on for drug-disease interactions. I don't know about the exact copyright status of those numbers but they could be theoretically useful to have. With CureTogether and PatientsLikeMe the situation is similar.
It's possible that various EHR's have data about how often a drug is perscribed for a certain condition.
Various insurance providers decide whether or not they cover the usage of certain drugs to certain conditions.
Cochrane has articles on drug-disease inteactions like
Most of the things I spoke about will need new properties so it's not possible to model them fully on examples at the moment. I would advocate to start by having items with a minimum amount of information and then slowly expand. ChristianKl (talk) 20:08, 29 August 2016 (UTC)
@ChristianKl: Having a Wikipedia article on that topic I wouldn't state as a requirement, more a rule of the thumb. In the end Wikidata is the linked database of Wikipedia and its sister projects. My concern is more with the number of wikidata items that it might generate. A different approach could be one currently being investigated by @Egon Willighagen: , who is enriching Wikidata with metabolites from Wikpathways. In stead of pushing pathways from Wikipathways to Wikidata, pointers (chemical identifiers) are added to wikidata, which can subsequently be used to link wikidata concepts with pathways concepts.
Why not consider the propositions for new properties part of the modeling fase? Here is for example the modeling sprint we did for genes --Andrawaag (talk) 10:38, 30 August 2016 (UTC)
Okay, that's makes sense. I started with ChristianKl (talk) 13:39, 30 August 2016 (UTC)
@ChristianKl, Andrawaag, I9606: The development stage qualifier is not there, because I did not yet add it. But that's not a big deal, will take care of that. For the case of drugs which turn out not efficient for a certain indication, I would assume that they will be phased out and this information will make it into Wikidata from the original source via bots or via users. Regarding the drug-disease interaction items: Even if there is a perspective that some of the data will be available under an open license at some point, all the properties required to represent these info does not exist yet (e.g. if you go to a NEJM drug study publication, you will find lists of dozens of side effects, efficacies applying to different patient populations, kinetics, etc. Modeling that on one item will be a major effort and requires well curated data). If we created these interaction properties, we would create ~6000 new interaction terms which have a label and some categorization (e.g. subclass of, instance of), but that's it. So currently, these are of very limited use. I would argue that we leave it as it is currently. If datasets become available that justify creating separate interaction items, that's fairly easy to achieve. Using 'subject of' to establish the link seems like a way to go. Generally, I agree that drug-disease interactions are so complex that they might require one or more items describing them. But I think what should be avoided is introducing items which will never be properly populated with data because either no good sources are available or WD users never start assembling the data to make WD the best source. Certainly, this might create a bit of a chicken and egg problem, but the proposed model above (5D) would allow flexible addition of drug-disease items, as data becomes available (and WD properties) or users decide to create/populate these drug-disease interaction items. What we also should look into how ontologies model these things, three major ontologies are available: Drug Interaction and Evidence Ontology, Drug-Drug Interactions Ontology and Drug Ontology. Btw: Drug-disease interactions are only one of several interactions with similar problems. Sebotic (talk) 09:11, 30 August 2016 (UTC)
ChEMBL does provide "Max phase for indication" than can be included with a "point in time" qualifier. I will try to findout the exact definition of that phrase and propose a property for it. I think that would be enough to make the creation of entries valuable. It has the advantage of not doublicating that information which would happen if it's stated as qualifier. Do you think that information would be enough for the start?
ContentMine currently has the WikiFactMine proposal going on and ContentMine also does the datamining for the OpenTrials project, so I would expect that content will become available. The Primary Sources tool doesn't create new items so it would be valuable to have items in place at that point in time. ChristianKl (talk) 11:07, 30 August 2016 (UTC)
@ChristianKl: I think we can even use the use (P366) property, together with 4 items as stages, e.g. stage 4 compound in drug development. But I am also happy to support any property proposal you would like to do here. Sebotic (talk) 17:29, 31 August 2016 (UTC)
I think it makes more sense to have specific property. I started with the first under : . If you agree with that property adding a comment for supporting it might speed up the process.
When it comes to the number from ChEMBL I would like to first try to understand better how they produce the number. ChristianKl (talk) 22:43, 31 August 2016 (UTC)

The drug / substance distinction[edit]

In trying to model drug-disease interaction I noticed the problem that drug approvals are usually for drugs and not just for substances and Wikidata currently only has information about substances. Both the EMA and the FDA seem to provide information in the drug approval process about the drug name and the relevant substance (they even provide an ID for the substance). I think it would be great if that information could be imported. There also a good chance that the information could be used in Wikipedia. It would allow the creation of a template to automatically list all the commerical drug names behind a substance. ChristianKl (talk) 09:46, 1 September 2016 (UTC)

@ChristianKl: By substance, you mean chemical compound (chemische Verbindung, auf deutsch)? Because there is a distinction between chemical compound and chemical substance. Compound is the pure form of a certain chemical compound and substance is a mixture of 2 or more compounds which may or may not have a set of known chemical properties. I am aware of the fact that we should also represent the chemical substances approved by FDA or EMA as separate items, (has part property will work here) for different formulations of a drug. Certainly, the approval process is for an active chemical compound + salt form + dosage + patient group + disease + potentially a ton of other criteria. This is on my agenda, it's just a little bit difficult to get a reliable data source for these things. NDF-RT would work, but it is frequenlty out of date. They give good formulation data for the USA, but frequenlty lack the indication for newer drugs. Sebotic (talk) 19:08, 1 September 2016 (UTC)
I mean the difference between Provigil and Modafinil. Provigil is a drug that has with "Cephalon, Inc" as a producer. Modalert also contains modafinil but it get's produced by Sun Rise International Labs Ltd. Modapro also contains the substance modafinil and is produced by Cipla Limited.
The FDA/EMA gives the generic drug it's own approval. Fortunately the list the substance (e.g. Modafinil) of the drug alone with the drug name (e.g. Provigil) in the approval.ChristianKl (talk) 20:06, 1 September 2016 (UTC)

ExAC data?[edit]

I think the ExAC data could be interesting, both in terms of simple identifier mapping (they link out to a good number, as well as to enwiki, and we might just create some ExAC ID property) but probably in a more profound sense as well, especially in conjunction with information available from places like ClinVar. --Daniel Mietchen (talk) 06:08, 8 September 2016 (UTC)

@Daniel Mietchen: I agree that this is are very interesting and important dataset, but on the one hand it is vast (7,404,90 variants), on the other hand, we have not found a good way on how to represent genetic variants in Wikidata, except creating a new item for each and everyone of them, which is not a very viable strategy, imho. I have a few ideas how this could be achieved, though Sebotic (talk) 17:19, 9 September 2016 (UTC)
Seems like a good time to lay out these ideas here. Happy to follow up when we meet in person. --Daniel Mietchen (talk) 00:36, 10 September 2016 (UTC)

"ratten gen"[edit]

Please correct the Dutch description "ratten gen" to "rattengen", thanks. Sjoerd de Bruin (talk) 08:33, 13 September 2016 (UTC)

That change is in the pipe line, to be changed shortly. The general dutch description of all species will be "Gen van de soort .........", where ...... will be scientific name homo sapiens, rattus norvegicus, etc. The change is now made on human genes. mouse and rat gene items will follow shortly. --Andrawaag (talk) 10:17, 13 September 2016 (UTC)

molecular function (P680) molecular function (Q14860489)?[edit]

Hello, I think property molecular function (P680) with the value molecular function (Q14860489) is wrong (eg. BSCL2, seipin lipid droplet biogenesis associated (Q21100511), Nestin (Q3874932), etc...).

SELECT ?item ?itemLabel
	?item wdt:P680 wd:Q14860489 . 
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }

Try it!

--Okkn (talk) 10:11, 18 September 2016 (UTC)

@Okkn, I9606: Thanks for your message! I checked it, and the problem seems to be with QuickGO, not with our bot code. See here for the 2 protein examples you gave Seiping and Nestin, it also applies to the others. I can try to modify my bot code so it will not add these upper level terms and do full bot runs. I will also contact QuickGO, to clarify if this is intentional or a mistake. I agree that these should not be there. Sebotic (talk) 09:21, 19 September 2016 (UTC)

Bad data importation[edit]

The bot is not importing the correct data for some EINECS number: see mesotrione (Q409390) as example. The parsing is not good. Snipre (talk) 22:37, 10 October 2016 (UTC)

@Snipre: This is directly taken from the FDA UNII raw data, and of the 7500 EINECS in WD, this occurs 4 times. This is also like it's stated in PubChem. Furthermore, in the 82k UNII entries, it occurs only 6 times. I will take care to filter them out, but that means that there will just be no EINECS number then. Sebotic (talk) 23:04, 10 October 2016 (UTC)

Wikidata:Database reports/Constraint violations/P591[edit]

It seems that many of the constraint violations were done by this bot. --Leyo 15:08, 17 October 2016 (UTC)

The constraint violations happened, because the regex pattern for detecting correct EC numbers is wrong. EC categorizations are up to 4 levels deep, if only the uppermost level should be indicated, the others need to be specified with a hyphen, e.g. 3.-.-.- The regex does not account for that. But there is indeed an issue with these EC numbers, for all which do not have the 4 level string. e.g. 3.-.,, the formatter url will fail. The problem here is primarily the original resources, as this is how they do it. I will do 2 things, I will change the regex pattern in the property, and I will also change my bot in a way so it always has 4 level deep EC numbers. Cheers, Sebotic (talk) 17:38, 17 October 2016 (UTC)
OK, thanks. --Leyo 18:37, 17 October 2016 (UTC)

PDB coverage[edit]

I just noticed that of the currently 123870 structures indexed in PDB, about half (i.e. 64778 as of now) are used in Wikidata, largely in statements added by this bot. Are you targeting the remaining ones as well? --Daniel Mietchen (talk) 18:20, 28 October 2016 (UTC)

@Daniel Mietchen: I add PDB structures to Uniprot IDs based on the PDBe SIFTS, so the number of PDB Ids grows based on Uniprot entries in Wikidata. Furthermore, I add PDB IDs to chemical compound if they are an essential part (e.g. ligand) of a certain PDB structure (bases on the chemical compound dictionary). But certainly, it would be interesting to have the full collection of PDBs in Wikidata, imho, that only is useful if the protein and a wider context about the species is also being imported (e.g all genes and uniprot annoated proteins of a certain species). Sebotic (talk) 18:43, 28 October 2016 (UTC)
Thanks. Perhaps we could use the number of PDB structures that a species has as one criterion when deciding about whether to create items for genes/ proteins of a given species. --Daniel Mietchen (talk) 21:26, 28 October 2016 (UTC)
Wikidata items now exist for the PubMed IDs used in PDB. --Daniel Mietchen (talk) 11:32, 29 November 2016 (UTC)
@Daniel Mietchen: Thx, that's great, so I can include the direct reference to the original publication! Sebotic (talk) 18:57, 29 November 2016 (UTC)

Edits that only touch timestamp[edit]

Can this sort of thing have value, or is it just noise in the edit history? LeadSongDog (talk) 22:51, 21 November 2016 (UTC)

In some cases, particularly for scientific applications, it is valuable to know when a database (like Wikidata) that aggregates information from other sources was last updated. Occasionally checking for an update and marking the date - even if no values have changed - is thus useful. (It would certainly be great if we had an edit history viewer that could show edits by property.. the current view makes much more sense for a page of text than it does for a database...) --I9606 (talk) 23:55, 21 November 2016 (UTC)
Surely the date of last change is more relevant than the date last scanned for aggregation, but I'll acknowledge they both have some significance. It is akin to tracking both a publication-date and an archive-date for cite web instances. If the live url goes dead, it is helpful to know when it last was live. LeadSongDog (talk) 19:54, 30 November 2016 (UTC)

Long list of empty items[edit]

On User:Multichill/Kladblok permalink I made a list of items that don't have any sitelinks and don't have any statements. Quite a few of them are created by this bot, take for example with no label (Q18066798). Can you please clean up your mess by either nominating unused items for deletion or to populate them with data? Thank you, your friendly Wikidata janitor Multichill (talk) 19:51, 24 November 2016 (UTC)

@Multichill: I have nominated most, if not all, for deletion. I will continue monitoring both your list and the nominated items for deletion until they are all gone. May I ask, how you generated this list. I would very much like to query for patterns - such as empty items - linked to our account. --Andrawaag (talk) 22:53, 24 November 2016 (UTC)
Thank you for nominating them Andra. query and result. In the revision table you can filter on the username id or the username. You can run this query on Toollabs or use the web based service at . Multichill (talk) 22:59, 24 November 2016 (UTC)
  • Why does the bot create and then empty these items?
    --- Jura 13:39, 30 November 2016 (UTC)
These additions were made early 2015. Basically, because the bot then still relied on labels for concept resolution, and also due to an unforeseen interaction with another bot, which created erroneous statements. These ststements were then removed, but in some cases, this correction step lead to empty items, which were hard to catch. Since then the bot evolved, and currently concept resolution is done by checking for existing properties. I also like the toollabs suggestion above, because this makes catching these issues easier. --Andrawaag (talk) 15:31, 30 November 2016 (UTC)

Undo bot run[edit]

Would you please undo the botrun that added this stuff? That run added bad data. Do let me know. Thanks. Jytdog (talk) 21:46, 12 December 2016 (UTC)

this run of 776 diffs added listings of chemicals into the "drugs used for treatment" field of disease articles, using data from CHEMBL (bad) and NDF-RT (which ~might~ be OK).
this immediately prior run of 1169 diffs added entries in the "medical condition treated" field of the chemicals, using the same sources.
Those were from April of this year. Jytdog (talk) 22:10, 12 December 2016 (UTC)

The post above arises from this discussion at en-wiki WP:MED, by the way. I am looking forward to your reply. Jytdog (talk) 23:35, 12 December 2016 (UTC)

Or at least move the content to something different. "Like chemical studied in" and "Diseases studied in". Doc James (talk · contribs · email) (if I write on your page reply on mine) 00:08, 13 December 2016 (UTC)

@Jytdog, Doc James: apologies for the slow and brief reply -- busy IRL today. I still contest the label of "bad data", but concede that we don't have the data modeling quite right here yet. So we will remove those statements shortly and come up with a better data modeling plan. In the mean time, if either of you has a suggestion on definitive data sources that could be used to populated the "treats" properties that reflect current clinical best practices, we're all ears. (Right now, our team seems to be converging on dailymed...) More soon... Best, Andrew Su (talk) 04:32, 13 December 2016 (UTC)

Wonderful. I hope you remove them from both kinds of entries - the drugs as well as the diseases. Thanks very much! Jytdog (talk) 04:38, 13 December 2016 (UTC)
User:Andrew Su we are looking at converting infoboxes to more human useful info such as you see here[5]. After I am finished the update of the 380 or so leads of essential medicine articles (done 210 over the last 2 years which will likely take me one more year to finish) I plan to turn to this for the top 500 medical conditions.
You will than be able to pull the info with references from these new infoboxes. Approved indications are a start I guess aswell. Doc James (talk · contribs · email) (if I write on your page reply on mine) 07:08, 13 December 2016 (UTC)

Format of OMIM-ID[edit]

I think you shouldn't add the prefix "PS" to OMIM ID (P492) like in Q268667 and a few more items. The prefix leads to inconsistencies because the majority of the values don't have it and it breaks the links to the OMIM webpage. --Pasleim (talk) 15:03, 1 January 2017 (UTC)

The prefix is not added by the bot, but comes from the source consulted (i.e. disease ontology). I will investigate and work towards a solution. Thanks for noting the issue. --Andrawaag (talk) 09:51, 2 January 2017 (UTC)

Fertization and fertilisation[edit]

Hi, can we merge fertilization (Q14890574) (that you created) and no label (Q21192952) (with all Wikipedia articles)?

Thank you. Tubezlob (🙋) 20:51, 10 January 2017 (UTC)

There are also no label (Q14897831) (created by the bot) and isotype switching (Q2386881). Tubezlob (🙋) 12:29, 11 January 2017 (UTC)
Both items have been merged. We are looking why the fertilization/fertilisation was not picked up by the bot. The other one is a classic case of non existing synonyms of Wikidata. Thank you for pointing them out. Emitraka (talk) 18:05, 11 January 2017 (UTC)
I merged no label (Q22294724) and karyogamy (Q1454719) too. Should I notify to this talk page when I merge? Tubezlob (🙋) 20:39, 11 January 2017 (UTC)


May Tinzaparin (Q20817252) get merged into Tinzaparin sodium (Q411198)? --Leyo 23:09, 12 January 2017 (UTC)

@Leyo: In this case, they should be merged, because the heparins are a mess regarding chemical characterization and all resources only talk about Tinzaparing sodium. But usually, salt form and pure compound should not be merged. Sebotic (talk) 23:33, 12 January 2017 (UTC)
Ok, I revise my earlier statement, they should not be merged, because at least FDA UNII has Tinzaparin and Tinzaparin sodium as separate substances, this is relevant as they influence drug formulations. This is the case for all 8 low molecular weight heparins. As they are chemically not well defined, they all carry the same CAS number and lack PubChem CIDS, etc. Sebotic (talk) 00:02, 13 January 2017 (UTC)
Well, then I guess it's not a good idea to have Tinzaparin as a label and Tinzaparin sodium as an alias (added by ProteinBoxBot). Could you please have a look? --Leyo 19:16, 13 January 2017 (UTC)
That has been added one year ago, the data source back then was Drugbank, bot code and data sources have changed in the meantime, so this should not happen anymore. The general issue is that these drug databases are not doing a good job in precise representation of chemical data as they frequently do not discern between active compounds, salt forms and inactive compounds. E,g. the DrugBank entries primary label is Tinzaparin but it has Tinzaparin sodium as a synonym. I have corrected that now. Sebotic (talk) 19:28, 13 January 2017 (UTC)

ProteinBoxBot seems to be reverting descriptions?[edit]

I changed the description of Q6993502 from "Human disease" (which is vague) to "Medical condition in newborn babies caused by drugs taken by the mother before birth". ProteinBoxBot appears to have reverted that. This is not exactly wrong (although it's more precisely described as "a medical condition" instead of "a disease"), but it's imprecise to the point of being nearly useless to any reader. Is there a way to stop ProteinBoxBot from "fixing" stuff that isn't broken? WhatamIdoing (talk) 19:22, 20 January 2017 (UTC)

This should not haven't happened. There are safeguards in place that should prevent this. I will look into it. --Andrawaag (talk) 09:02, 21 January 2017 (UTC)
Thanks. I do appreciate ProteinBoxBot filling in this slot when it's empty. WhatamIdoing (talk) 23:19, 21 January 2017 (UTC)
Sorry about that! This was due to a bug in our bot code that is now fixed. It won't change any disease descriptions unless they are blank. Thanks for notifying us. Gstupp (talk) 04:59, 22 January 2017 (UTC)
WhatamIdoing I also wanted to let you know that we went back into the logs and checked all disease items that we've modified descriptions for, and reverted all the others that had been overwritten (~400 items). We also checked them for vandalism before restoring them. See this run of edits here. Gstupp (talk) 20:13, 30 January 2017 (UTC)
Thanks for the update. You all are awesome. WhatamIdoing (talk) 03:56, 31 January 2017 (UTC)

Wrong edit[edit]


@Andrawaag, Sebotic: could you take a look at Special:Diff/439992602. It is (at least) partially wrong (a self-reference was added), not sure is everything is to revert of not. Cdlt, VIGNERON (talk) 14:29, 31 January 2017 (UTC)

Hi VIGNERON, I was testing out a new bot. It is fixed now. Thanks for letting us know. Gstupp (talk) 16:02, 31 January 2017 (UTC)

Wrong date[edit]

The bot seems to have parsed a date in a wrong way: [6]. "1986 Jul 10" has become "10. juli 2017". — Finn Årup Nielsen (fnielsen) (talk) 22:26, 2 March 2017 (UTC)

I have fixed it [7], but you might want to check the bot. — Finn Årup Nielsen (fnielsen) (talk) 22:27, 2 March 2017 (UTC)
Thanks Finn Årup Nielsen (fnielsen). Looks like the citoid api has parsed the wrong date [8]. I'm working on switching over to the europepmc API [9], which seems more accurate (and has ORCID IDs, when available). Gstupp (talk) 23:37, 2 March 2017 (UTC)

Item to be delete[edit]

In RFD there are a lot of item proposed for the deletion created by you. If you do not agree you can participate in the debate --ValterVB (talk) 21:08, 3 March 2017 (UTC)

@ValterVB: Thanks for notifying, I fixed the items and commented on the RFD. Sebotic (talk) 22:06, 3 March 2017 (UTC)


Can we merge no label (Q22329155) and ocelloid (Q20725146)? Thanks, Tubezlob (🙋) 16:18, 9 April 2017 (UTC)

Done, thanks Gstupp (talk) 19:46, 9 April 2017 (UTC)

standard atomic weight[edit]

Per CIAAW 2016 (and the 2015 Yb refinement), we have 84 standard atomic weight values. All clear and sourced. PubChem is not needed (and not trustworthy) as a secondary source. Aim in this: let Wikidata have & produce the right "standard atomic weight". Others values are derived and suspect. -DePiep (talk) 22:08, 9 April 2017 (UTC)

Yes, these could (and most likely should) be used to calculate the monoisotopic weight. But it does not give you the actual atomic weight of a compound which exists in nature (or is being synthetized by a human). Question is what weights we want to have in WD. Do you know of a WD discussion which handles this topic? Sebotic (talk) 17:00, 10 April 2017 (UTC)

Instance of disease[edit]

User:Andreasmperu believes that one should use subclass of (P279), not instance of (P31) for diseases. I tend to agree. So you should to adjust your bot. --Infovarius (talk) 11:30, 19 April 2017 (UTC)

I disagree. While there are reasonable ontological arguments that diseases are more logically classes than instances, there are also very clear practical arguments for assigning wikidata entities with high level types that allow users and software to understand basically what they are without the need to execute chains of inference. Note that a disease entity can be both an "instance of disease" and a subclassOf of some disease class. This is a lossless enhancement to the relevant items. This is a technique known as 'punning' that is now common practice in many semantic web circles, even making it into W3C documentation about the OWL ontology language (e.g. In fact, I think expanded use of this approach would be a great benefit to the utility of Wikidata on many fronts, not just diseases. --I9606 (talk) 04:11, 20 April 2017 (UTC)

--Micru (talk) 21:46, 24 August 2014 (UTC) Tobias1984 (talk) TomT0m (talk) Genewiki123 (talk) Emw (talk) 03:09, 9 September 2014 (UTC) —Ruud 16:15, 9 December 2014 (UTC) Emitraka (talk) 14:32, 14 October 2015 (UTC) Bovlb (talk) 19:10, 21 October 2015 (UTC) Peter F. Patel-Schneider (talk) 22:21, 23 October 2015 (UTC) ArthurPSmith (talk) 15:51, 5 November 2015 (UTC) --Daniel Mietchen (talk) 20:53, 3 January 2016 (UTC) --Harmonia Amanda (talk) 22:00, 27 February 2016 (UTC) --Lechatpito (talk) --Andrawaag (talk) 14:42, 13 April 2016 (UTC) --ChristianKl (talk) 16:22, 6 July 2016 (UTC) --Cmungall Cmungall (talk) 13:49, 8 July 2016 (UTC) Cord Wiljes (talk) 16:53, 28 September 2016 (UTC) DavRosen (talk) 23:07, 15 February 2017 (UTC) Vladimir Alexiev (talk) 07:01, 24 February 2017 (UTC) Pintoch (talk) 22:42, 5 March 2017 (UTC) Fuzheado (talk) 14:43, 15 May 2017 (UTC) YULdigitalpreservation (talk) 14:37, 14 June 2017 (UTC) PKM (talk) 00:24, 17 June 2017 (UTC) Fractaler (talk) 14:42, 17 June 2017 (UTC) Andreasmperu Diana de la Iglesia Jsamwrites (talk) Finn Årup Nielsen (fnielsen) (talk) 12:39, 24 August 2017 (UTC) Alessandro Piscopo (talk) 17:02, 4 September 2017 (UTC) Ptolusque (.-- .. -.- ..) 01:47, 14 September 2017 (UTC) Gamaliel (talk) --Horcrux92 (talk) 11:19, 12 November 2017 (UTC)

Pictogram voting comment.svg Notified participants of WikiProject Ontology

Intuitively I would also go for instance of (P31), but without having thought much about the issue and with no biology background. − Pintoch (talk) 06:43, 20 April 2017 (UTC)
in this particular case of tooth disease (Q7824266) this does not seem like a specific disease, but indeed a generic class of diseases and so should be part of a hierarchy of subclass relationships to "disease". However specific named diseases such as anodontia (Q771310) I might expect to be instance of (P31). This really revolves ontologically on whether "disease" (and relatives like "dental disease") are metaclasses of a different order than the specific named diseases - human language is subtle enough that this is often not obvious but I think that is the case here. ArthurPSmith (talk) 16:05, 20 April 2017 (UTC)
I want to reiterate that this isn't an either/or situation. 'Tooth disease' can be an instance of the metaclass 'disease' while also being a subclass of 'mouth disease'. It is very useful to have an 'instance of' statement attached to wikidata items because it greatly simplifies and speeds up basic, common queries like 'show me all the diseases'. The alternative is to write queries that traverse class hierarchies which, within wikidata, is both time and resource intensive and prone to errors because of the dynamic nature of the knowledge base. I wonder if a solution here would be to create a new property 'hasMetaClass' that could be used to connect items that describe what people think are classes to metaclasses like 'disease' without raising the 'instanceOf' / 'subclassOf' argument? --I9606 (talk) 18:07, 20 April 2017 (UTC)
Actually class is linkd with its metaclass by "instance of", so there is no need of other properties. And I am OK with having both P31 and P279 in one item but we should distinguish simple classes and metaclasses. And I don't see why disease (Q12136) should be a metaclass. May be to create some "type of diseases" metaclass? --Infovarius (talk) 20:52, 21 April 2017 (UTC)
If creating a 'disease metaclass' item is well-aligned with other uses of this pattern on Wikidata I think that would be fine for me. We would then carry that pattern forward in this domain with additional metaclass items for: biological process, molecular function, cellular component and the roots of any other large biomedical hierarchies that are added into wikidata. I'm not sure the roots of the current metaclass hierarchy make a lot of sense to me though. Its odd to me that class or metaclass of Wikidata ontology (Q21522864) is a subclass of Wikidata internal entity (Q21281405) subclass of Wikimedia internal stuff (Q17442446) subclass of MediaWiki page (Q15474042) subclass of web page (Q36774) subclass of electronic page (Q5358404).. etc. Perhaps there is a reference you could point out that explains why that structure is in place and why it would make sense to connect e.g. 'disease class' up into it? --I9606 (talk) 16:55, 24 April 2017 (UTC)

Instance of and subclass of on same Item[edit]

I just noticed that the bot is adding instance of and subclass of statements with the same value to at least certain items, see for example Q27543253 or Q27549581. What are the plans for that? Do you add both on purpose (why?)? Cheers, Hoo man (talk) 06:54, 21 April 2017 (UTC)

@Hoo man: Please see discussion thread immediately above. --I9606 (talk) 16:43, 21 April 2017 (UTC)
We're currently adding both so that and any one point in time all genes have at least a subclass of or instance of gene. Once all the instances are added, I will delete all of the subclass of gene claims. Gstupp (talk) 17:30, 21 April 2017 (UTC)

RNA processing (Q2673589) and Post-transcriptional modification (Q417379)[edit]

Hi, I'm not sure with English but these two items are the same, aren't? Thank you, Tubezlob (🙋) 12:28, 20 May 2017 (UTC)

@Tubezlob: I don't think they're exactly the same concept. If anything, the second seems to be a subclass of the first. But truthfully, I'm not sure how to handle Post-transcriptional modification (Q417379) since it's not anchored into any common biomedical ontology. In contrast, RNA processing (Q2673589) is drawn from Gene Ontology (Q135085). In any case, I do not think they should be merged... Best, Andrew Su (talk) 05:54, 24 May 2017 (UTC)
OK thank you! Tubezlob (🙋) 12:21, 24 May 2017 (UTC)

Importing drug data[edit]

Given that we know have National Drug Code (P3640) and European Medicines Agency product number (P3637) is there still a problem that needs to be solved before the relevant drug data can be imported from the FDA and EMA? I think importing that data would be very valuable. ChristianKl (talk) 22:10, 24 May 2017 (UTC)

Hi ChristianKl. Are you talking about drug indications or other data about drug products such as active ingredients, concentrations, route of administration, dosage form, appearance, scheduling, etc.? For indications, we lack an accurate, up-to-date, structured, CC0 resource for this data, although we do have the less up to date NDF-RT indications already in. We've also added about 1k EMA pharmaceutical products, linking them to their active ingredients, EMA and RxNorm CUIs (where possible). For example: Abraxane (Q29003737). The drug indications from the EMA need more work to accurately model, and we probably need input from the community on how to do this. For other information about specific packages, such as dosage, appearance, count, I was under the impression we wouldn't be creating an item for each package and this wouldn't be added. Gstupp (talk) 00:34, 25 May 2017 (UTC)
The EMA provides on their website information about marketing approval for drugs that's worth importing. I did write them an email and they don't have problems with their data getting imported into Wikidata:
Dear Christian
Many thanks for wanting to share our data in Wikidata. We believe that opening up of public data can contribute to foster more transparency and better collaboration with our stakeholders, therefore we would be happy to ease the access to our medicines database.
Our website uses a controlled medical vocabulary called Medical Subject Headings (MeSH). Further information can be found below:
Let us know if you have any further questions.
In the example of Abraxane (Q29003737) EMA gives "Carcinoma, Non-Small-Cell Lung", "Pancreatic Neoplasms" and "Breast Neoplasms" as therapeutic areas for which the drug is approved. That information could be imported with medicine marketing authorization (P3464). Given that we have the MeSH names of the illnesses already in our database it should be possible to match them.
I think that OpenFDA has information about marketing authorizations in the US that's CC0 licensed. ChristianKl (talk) 10:57, 25 May 2017 (UTC)
For drug indications, while we can just import the MeSH terms (such as "Breast Neoplasms", etc.), as approved therapeutic areas, there are several issues with doing this.
1) These aren't 100% accurate and need some manual curation. The "Therapeutic area" from EMA does not mean it is approved for use for that condition. For example: Akynzeo has as therapeutic area: "Cancer", however it is actually used to prevent chemotherapy-induced nausea & vomiting. This would be pretty easy to go through and curate. But more importantly:
2) The therapeutic areas lack a lot of detail. For example, a therapeutic area for Abraxane "Carcinoma, Non-Small-Cell Lung", does not capture the fact that its indicated for "first-line treatment of locally advanced or metastatic non-small cell lung cancer, in combination with carboplatin, in patients who are not candidates for curative surgery or radiation therapy". I think its important to capture at least some of this also in a structured way, but will require effort to both develop a way of modelling appropriately, and actually curating the unstructured text.
On the other hand, its probably also useful to have the therapeutic areas by themselves. But I don't think they should go onto drug used for treatment (P2176) without handling the specifics. What do you think?
The OpenFDA API does not have structured indications, its all free text.
Gstupp (talk) 17:44, 25 May 2017 (UTC)
@ChristianKl, Gstupp: National drug file reference terminology (NDF-RT) from veterans affairs/FDA also does not provide the staging info and other diagnostic criteria required for an MD to prescribe a medication. These things are very hard to model, require a lot of manual curation to even get them structured and they are also very hard to query, bc every drug has a bunch of these and for every disease type, these are somewhat different (e.g. cancer vs bacterial infection). So I vote for just importing them now instead of trying to capture all of that info. Bc if we wait for that, it's never going to happen. Sebotic (talk) 18:02, 25 May 2017 (UTC)
Is something isn't accurate in the sense that the string can't be found in MeSH, we might tell the EMA about inaccuracies in their database. I would expect that they are happy to fix inaccuracies in their dataset.
Given the past experience with drug used for treatment (P2176), you are right that it might not be the right best move to use it for this purpose. Maybe we should simply create a new property with the name "therapeutic area"? ChristianKl (talk) 19:14, 25 May 2017 (UTC)
I like the idea of adding these EMA "Therapeutic area"s under a new property: "therapeutic area" as opposed to using drug used for treatment (P2176). This would address the issues previously brought up with using this property, and are more applicable to this EMA data. I have the EMA data already mostly curated here. I can propose the property, if that makes sense to all in this discussion... Gstupp (talk) 21:45, 25 May 2017 (UTC)
I like this option too, though out of an abundance of caution, I'd suggest running this idea by the WD:Medicine folks. In fact, I think we can just do that by doing this... Andrew Su (talk) 21:54, 25 May 2017 (UTC)

Doc James
Daniel Mietchen
Andrew Su
Projekt ANA
Pavel Dušek
Was a bee
Chris Mungall
Dr. Abhijeet Safai
Sami Mlouhi
Netha Hussain
Abhijeet Safai
Pictogram voting comment.svg Notified participants of WikiProject Medicine

@Gstupp: I like the idea of you writing the proposal for the new property. ChristianKl (talk) 22:18, 25 May 2017 (UTC)
I got the medicine ping. Sorry, what is the question here? Please ping me again when you state it. Blue Rasberry (talk) 23:22, 25 May 2017 (UTC)
Hi @Bluerasberry: and others from wp:medicine. The question is this: The EMA has "Therapeutic Areas" listed for drugs that they authorize (example). These therapeutic areas do not mean the same thing as "drug used for treatment" because they lack a lot of specificity (see my 3rd level reply above for examples). Should we add these under a new property "Therapeutic Area"? (And reserve "drug used for treatment" for more specific, accurate, and curated statements) Gstupp (talk) 00:06, 26 May 2017 (UTC)
@Gstupp: I am not sure. I like the use in the example with Abraxane (Q29003737) above, and in that case, having a "therapeutic area" property seems useful. I am not familiar with this term "therapeutic area" though. Is this a defined term? How strict is it? If we made the property, would it be used for non drugs, like for example, can physiotherapy (Q186005) be matched with "therapeutic area" for sports injury (Q2093360), or counseling (Q4390239) for mental disorder (Q12135)? Blue Rasberry (talk) 18:48, 26 May 2017 (UTC)
I don't think therapeutic area would apply only to drugs, but could apply to procedures also. For some context on the term, there are some broader ones defined here. In the context of EMA drugs, the subject of the therapeutic area would be a drug, and the values would be mesh terms. But as constraints for the property, the subject would be some type of intervention (drug or procedure) and the value would be a disease, symptom, condition, or phenotype. Gstupp (talk) 21:15, 26 May 2017 (UTC)
Also, FYI, I have the information from EMA parsed/curated here: Gstupp (talk) 23:00, 26 May 2017 (UTC)
If we import some data by matching MeSH ID, I think we should make the imported statement have the information of not only EMA URL (or ID?) but also MeSH ID used to match its object item. It will help us to fix the statement when the MeSH ID is not suitable for the Wikidata item. This is true of other properties whose data type are Wikidata items mached by external IDs, too. --Okkn (talk) 23:55, 25 May 2017 (UTC)
FYI I made a property proposal. @ChristianKl: @Bluerasberry: @Okkn:
According to the FDA, "The term “therapeutic area” also includes diagnostic and preventive areas. Some areas may represent a disease/domain area." (per footnotes on p.4). So it is somewhat ambiguous by design.LeadSongDog (talk) 18:15, 7 June 2017 (UTC)


Hi Andrawaag, how can one contribute translations for the descriptions the bot makes? Jon Harald Søby (talk) 21:35, 12 June 2017 (UTC)



Can we merge no label (Q22274764) and fermentation (Q41760) ?

Thank you. Tubezlob (🙋) 16:40, 6 July 2017 (UTC)

Done, thanks! Gstupp (talk) 17:20, 6 July 2017 (UTC)


Looking for descriptions starting with "a ..", I came across a few generated by this bot (sample correction). For future runs, maybe the initial "a", underscores in text and final dots can be removed directly.
--- Jura 14:16, 7 September 2017 (UTC)

Hi Jura, I've made an issue on our github to address this. Gstupp (talk) 23:30, 7 September 2017 (UTC)
Is there an official style guide for descriptions? The suggestions seem reasonable, but unless they were part of an official style guide I think keeping the status quo could be convincingly argued as well (to keep in sync with the source databases). Thoughts? Andrew Su (talk) 00:23, 8 September 2017 (UTC)
We have one: It's Help:Descriptions. --Izno (talk) 20:02, 8 September 2017 (UTC)
Thanks for the link, very helpful... Best, Andrew Su (talk) 20:28, 8 September 2017 (UTC)
  • Maybe I need to ping the operators as well @Andrawaag, Sebotic:.
    --- Jura 07:26, 8 September 2017 (UTC)
Both Andrew and Gstupp are also part of the team maintaining this bot. I concur with Andrew, that maintaining the description as they are in the original source, makes sense. --Andrawaag (talk) 07:45, 8 September 2017 (UTC) 07:44, 8 September 2017 (UTC)
It would be good to have an operator for this bot that is knowledgeable about Wikidata rules. We already had to block it in the past and you should avoid that this re-occurs.
--- Jura 08:03, 8 September 2017 (UTC)
I've updated the bot to use take care of this: Gstupp (talk) 23:08, 2 October 2017 (UTC)

ICD-10-CM and ICD-10 codes are not necessarily interchangeable.[edit]

I've reverted your change of the ICD-10 code over at REM sleep behavior disorder (Q2103933). G47.52 code is not found in ICD-10. This can be double checked by searching for it on the online version of ICD-10.[10] Whilst it is within ICD-10-CM; this now has it's own property: ICD-10-CM (P4229) Little pob (talk) 20:17, 23 September 2017 (UTC)

I've edited the bot to use ICD-10-CM and switched over codes to the appropriate system (as determined by the Disease Ontology). Gstupp (talk) 23:29, 25 September 2017 (UTC)

Cerebrovascular disease and stroke[edit]

Hi! The en label of stroke (Q12202) should be "stroke", according to enwiki, and DOID:6713 (cerebrovascular disease) should be linked to cerebrovascular disease (Q3010352), instead of stroke (Q12202). Thanks. --Okkn (talk) 14:28, 22 October 2017 (UTC)

Hi Okkn. I think I've cleaned up the items and made an issue on DO. Gstupp (talk) 18:11, 2 November 2017 (UTC)

Thank you, Gstupp! --Okkn (talk) 10:18, 3 November 2017 (UTC)

Help with some ontology issues in diseases?[edit]

I've been cleaning up some ontology issues in wikidata, but I'm struggling to figure out what to do with diseases. One set of problems is cycles - x subclass y subclass x. The list of remaining cases is here and you'll see that all the ones left there are in the area of diseases. Looking at them, the relationships seem to be based on the Disease Ontology in ways that don't make sense. For example, aseptic meningitis (Q4804182) is stated as a subclass of viral meningitis (Q3301664) based on the DOID's, but logically the subclass relationship should be the other way round, if those definitions are correct (viruses are a subset of non-bacterial causes). Is this a problem in the Disease Ontology, or with the relationships that have been entered here? We really need an expert or two to help out on this, I'd appreciate if you aren't able to do it if you could point to somebody who can help. Thanks! ArthurPSmith (talk) 15:24, 2 November 2017 (UTC)

Thanks ArthurPSmith, I will look into this. Gstupp (talk) 18:12, 5 November 2017 (UTC)
I've gone through and fixed some of the ones that were added by users (not from Disease Ontology) that are wrong. There are others, like the virus one, that are possibly issues with DO, but are confusing to me. I've shared with with the DO team and will update. Gstupp (talk) 00:01, 8 November 2017 (UTC)
Thanks! There's also one three-level cycle in diseases maybe you could check out too? Wikidata:WikiProject Ontology/Problems/3rd-order subclass of self - I was trying to sort it out myself but got very confused between the different language wikipedias what was going on here regarding macular degeneration. ArthurPSmith (talk) 16:30, 9 November 2017 (UTC)

“remove deprecated statements”[edit]

You changed some items with the edit summary “remove deprecated statements” (example). What does that mean, and what shall we do with the remaining, almost empty items? —MisterSynergy (talk) 08:17, 3 November 2017 (UTC)

Hello MisterSynergy, Thanks for pointing this out. This is the result of inconsistencies in the external IDs between Robinow Syndrome and its subclasses. The DO bot failed to update the statements on this item because the external IDs conflicted with the external IDs in another item. You can see the error in the log file row 1337.
I've added back the DO ID on this, so when the bot runs again it will re-add the current statements. I've also created an issue in DO to fix the IDs.
There are 34 other items in the log that also probably have this issue, and so I'll take a look at them also..
Gstupp (talk) 00:14, 8 November 2017 (UTC)

Ok, I've fixed up the others. Found using this sparql query: link Gstupp (talk) 23:29, 8 November 2017 (UTC)

Thanks, looks good indeed! MisterSynergy (talk) 08:54, 9 November 2017 (UTC)

has listed ingredient (P4543) and drugs[edit]

We imported a lot of data about drugs from OpenFDA. As far as I remember the reason why only store has active ingredient (P3781) was that at the time we had no good property for the other ingredients. ChristianKl () 19:46, 22 November 2017 (UTC)

Hi ChristianKl, Most of the active ingredient information actually came from the EMA. (Example) The inactive ingredients are sadly not available in a structured form (that I can find..). Same story with OpenFDA. It looks like many drug labels have brand name and active ingredients, (along with with UNII so we don't have to string match!!), but no other ingredients. They only exist in the free text package labeling and would be a lot of work to pull out and normalize. In addition, the indications are not structured in any way, which is why we grabbed them from EMA. See the openFDA field in this. Gstupp (talk) 20:02, 22 November 2017 (UTC)

This bot created a subclass loop![edit]

Preprotein translocase subunit SecE (Q24738466) was just made a subclass of Protein translocase SEC61 complex, gamma subunit (Q24768152), which was just made a subclass of Preprotein translocase subunit SecE (Q24738466)! Something's gone wrong there.

Also, any progress on resolving the remaining subclass loops in diseases? See Wikidata:WikiProject Ontology/Problems/subclass of subclass of self - thanks! ArthurPSmith (talk) 15:02, 28 November 2017 (UTC)

Hello Arthur. You're too fast! The bot was in the middle of a bot run and hadn't finished updating all items when you posted. The run has completed and I don't see any loops.

I submitted two issues for the remaining subclass loops 1 2 Gstupp (talk) 20:27, 28 November 2017 (UTC)

@Gstupp: yes, it looks better, thanks, and thanks for posting those issues! ArthurPSmith (talk) 21:33, 28 November 2017 (UTC)

cell (Q7868) subclass of (P279) cellular component (Q5058355)?[edit]

Cell is a part of a cell? --Fractaler (talk) 07:22, 20 December 2017 (UTC)

Yes. According to the reference: Gene Ontology, cell is a cellular_component. Subclass does not mean "part of". Gstupp (talk) 18:22, 20 December 2017 (UTC)

Now cellular component (Q5058355) (cellular component) have description: "part of a cell". Right? Fractaler (talk) 06:07, 21 December 2017 (UTC)