User talk:Andrawaag

From Wikidata
Jump to: navigation, search
Logo of Wikidata

Welcome to Wikidata, Andrawaag!

Wikidata is a free knowledge base that you can edit! It can be read and edited by humans and machines alike and you can go to any item page now and add to this ever-growing database!

Need some help getting started? Here are some pages you can familiarize yourself with:

  • Introduction – An introduction to the project.
  • Wikidata tours – Interactive tutorials to show you how Wikidata works.
  • Community portal – The portal for community members.
  • User options – including the 'Babel' extension, to set your language preferences.
  • Contents – The main help page for editing and using the site.
  • Project chat – Discussions about the project.
  • Tools – A collection of user-developed tools to allow for easier completion of some tasks.

Please remember to sign your messages on talk pages by typing four tildes (~~~~); this will automatically insert your username and the date.

If you have any questions, please ask me on my talk page. If you want to try out editing, you can use the sandbox to try. Once again, welcome, and I hope you quickly feel comfortable here, and become an active editor for Wikidata.

Best regards! --Tobias1984 (talk) 18:10, 1 August 2014 (UTC)

ProteinBoxBot[edit]

I think your bot is editing logged out, I'm going to block the IP until your bot is logged back in. See here. --AmaryllisGardener talk 14:03, 11 October 2014 (UTC)

~ 1000 duplicates[edit]

Hello, looks like your bot created ~1000 duplicate items yesterday. Please see Wikidata:Database reports/Constraint violations/P351#"Unique value" violations. — Ivan A. Krestinin (talk) 17:34, 16 October 2014 (UTC)

@Ivan A. Krestinin: Thanks for noting. I will work towards a fix Andrawaag (talk) 19:56, 16 October 2014 (UTC)
Hey Andrawaag. You could use WikidataQuery to check if an item with a certain claim already exists. For example does [1] return all items with the claim Entrez Gene ID (P351)=1017. However, this will probably slow down your bot significantly. Another option is to get with [2] all items and P351 values within one call. --Pasleim (talk) 07:09, 18 October 2014 (UTC)

OMIM ID (P492)[edit]

Hi, your bot imported some values yesterday that created some violations; I tried a few of them with just the number (not the MTHU part), but they all still seem invalid, so I'm wondering if maybe you used the wrong property for this import? Jon Harald Søby (talk) 15:35, 12 February 2015 (UTC)

User_talk:ProteinBoxBot#Given_names[edit]

Please see the comment above. --- Jura 09:47, 9 May 2015 (UTC)

@Jura: Thanks for noticing. The duplicates were unfortunately created in a bot effort on may 5th. I did respond to the comments by requesting a Bulk deletion request, unfortunately that didn't went well. Since I have been travelling I haven't been able to respond earlier. I am working towards a fixAndrawaag (talk) 22:48, 10 May 2015 (UTC).

p2175 and p2176[edit]

medical condition treated (P2175) and drug used for treatment (P2176) are ready. Some discussion about labels and descriptions still needed. --Tobias1984 (talk) 12:29, 5 October 2015 (UTC)

ProteinBoxBot Mistake?[edit]

I think that this edit by ProteinBoxBot is a mistake because the formerly related article en:Huntingtin perfectly fits in the discrption of the label. And the UniProt-ID is also the same (P42858).--Sonabi (talk) 17:55, 8 October 2015 (UTC)

The bot is removing wikidata entries in many items just to add the entry on another item, whereby the first item describes the protein and the second the related gene, like here and for the gene describing item here. What is the sense of that? --Sonabi (talk) 22:58, 8 October 2015 (UTC)

@Sonabi: It is not a mistake, it is done on purpose and is motivated by the release of arbitrary access. The Wikidata model only allows the inclusion of a wikipedia page only once. This is limiting in the sense that a protein page can contain a lot of gene information. Due to this one-page limit, a link to the appropriate Wikipedia page can't be set to the Wikidata item on that gene. However, with arbitrary access in place, gene information can be harvested from Wikidata items on genes to be used on Wikipedia pages on proteins. Moving the Wikipedia links from protein Wikidata items to the appropriate gene items, is needed to start using the arbitrary access in our workflows. Andrawaag (talk) 14:10, 9 October 2015 (UTC)
Yes, but the actually problem which I forgot to mention is that the bot is not transferring the entries of the other languages, while transferring the English entries. The result is that the connection between the related articles in the different languages is going to be lost. Or will the bot also move these entries in future? --Sonabi (talk) 14:53, 9 October 2015 (UTC)
@Sonabi: Its crucial for successful use of wikidata content within gene articles on all the Wikipedias that we make use of a stable data structure. After quite some discussions on wikidata 1 2 and on EN Wikipedia [3] , the community decided on a model where the structured information that would typically be consolidated on a single textual article is distributed in multiple, interlinked wikidata items corresponding to genes, proteins (or other gene products), and orthologous genes. To make use of that structure to build wikipedia infoboxes, we need the interwiki links to originate in the gene. During an initial import of wikipedia articles into wikidata items, many of the gene pages were brought over and classified as protein articles. These edits are correcting that and improving the data. Note that nothing is being lost. All the data in the protein items that aren't more appropriate for the gene (e.g. the gene expression images) are still there and can be reached from the gene via the encodes property, including the labels in the other languages. We now have two test articles up in EN Wikipedia that use this structure to compose their infoboxes directly from wikidata. See ARF6 and RREB1. It should be possible to re-use these patterns for any language Wikipedia, you would just need to set the interwiki link from the gene item to the correct page in that language wikipedia. We know what these should be in EN Wikipedia, but not necessarily in other language wikis. If you can help us identify the correct links in your languages, we could help with those mappings. --I9606 (talk) 16:10, 9 October 2015 (UTC)
@Sonabi: Just to emphasize that the connections are not lost. If a Wiki article is interwiki-linked to a protein data item and properties are moved from the protein item to a gene item, all of that information can still be retrieved by requesting it through the 'encoded by' connection between the protein and the gene. The conversion that Andrawaag is implementing is just adding content and giving it a better semantic structure that has been agreed upon by the wikidata molecular biology community.--I9606 (talk) 16:28, 9 October 2015 (UTC)

RE: Proteins and genes should not be merged[edit]

Hello. I didn't intend to merge proteins with genes, I'm well aware of their differences. It happened that I was reading Tafazzina on it.wiki and noticed that the same article do exist on en.wiki (Tafazzin), only unlinked, so I linked one with the other. Just compare the two links, these are the same protein - not the corresponding gene, called TAZ - so maybe the problem lies on Wikidata infos. But those are the same thing, that's for sure.Khruner (talk) 13:56, 17 November 2015 (UTC)
EDIT - For some reason the Italian article seems to encompass both the protein and the gene, so maybe the problem lies here. Khruner (talk) 14:02, 17 November 2015 (UTC)

Phenotype information[edit]

I am learning now how the pywikibot interface works. And then the first step that would seem interesting is to start adding taxon ids to most of the genomes as still many genomes do not have this information. How shall I begin with this? Make my own taxon script and dsmz script or should I merge this into the bot you have or are there other guidelines? I tried to find your email but was unsuccessful to that end.

--jjkoehorst (talk) 06:10, 8 January 2016 (UTC)

@Jjkoehorst: Our Bot is based on a python framework we call ProteinBoxBot. It basically is a 2 layer framework consisting of a core layer which takes care of communicating and dealing with different wikdiata issues (e.g. duplicate resolution etc). On top of this core layer - called PBB_Core - resource specific scripts are developed and maintained. @Sebotic has written a nice blog that might get your started. Extensive documentation is maintained on our projects Bitbucket repository. However, having the bot written is only part of the solution. We typically follow the following workflow with a new resource.
  1. Make sure the data license attached to the source allows distributing content on Wikidata (CC0)
  2. Check for existing records from your resource in Wikidata and make sure they are all correct and accurate
  3. Model 1 of 2 representative records from the resource under consideration
  4. During the modeling process it will become clear whether or not all needed properties do exist in Wikidata. If not, you need to propose the requires properties
  5. When all properties are in place either develop your bot or run your developed bot on your model items. These should not be broken by your bot.
  6. Run on 10 items
  7. Run on 100 items
  8. Run on 1000 items
  9. Once confident enough run on all.

I typically leave time in between the subsequent runs for possible issues to surface.

I am a bit hesitant to share my email address here, but if wanted you can reach me with a DM on twitter (handle: @andrawaag) --Andrawaag (talk) 16:42, 8 January 2016 (UTC)

bot deleting and altering data[edit]

Hello Andrawaag.

The bot User:ProteinBoxBot made an enormous amount of changes recently, amoung of these are the following: edits with the tag "wikisyntax", where it deletes most of the data (for example [3]) and in others following it, it changes the type from enzyme to protein family (for example [4]), where it doesn't seem to be correct. can you halt and verify this activity? Hummingbird (talk) 23:18, 11 January 2016 (UTC)

It even undid one of my edits: https://www.wikidata.org/w/index.php?title=Q21149193&diff=prev&oldid=291208902 -- numbermaniac (talk) 01:22, 12 January 2016 (UTC)
@Numbermaniac: Just to quickly replicate my reply also here: Sorry for that, I tried to move the interwiki link to a new item by undoing earlier changes, because the Wikipedia page deals with the enzyme class and not with the human specific type of this enzyme. Will create a new item manually.
@Hummingbird: Hi, sorry for the confusion, I would like to explain what is going on right now. In recent days, I took care to clean up the Wikidata data model for genes and diseases in order to align it with what was discussed in the Wikidata project molecular biology [5]. According to this discussion, interwiki links from Wikipedias should go to Wikidata gene items (subclass gene) and only if the topic is really only about the species-specific protein, the link should got to the protein. This data model is also required to be able to populate the Gene Wiki info boxes with our new 'info box gene' module in English Wikipedia [6] which will build the info box entirely from data fetched from Wikidata. See also our preprint here, explaining details: [7].
What specifically happened in recent days:
  • I merged all orphaned items which were "found in taxon": 'human' and 'subclass': 'protein' but did not have identifiers except their label, into Wikidata human gene items. This affected ~ 2,800 items and it also unified ~800 interwiki links of different Wikipedia languages on the human gene items. I did these merges via script supported manual curation, so most of them should be correct now.
  • There were also ~350 protein items which had interwiki links on them linking to Wikipedia Gene Wiki pages. Unfortunately, some of these links also went to enzyme classes or protein families, not to Gene Wiki pages. Some of these got hijacked recently and transformed the enzyme class/protein family items into human protein items. In the first case, I moved and unified the interwiki links onto the Wikidata human gene items. In the second case, I reverted the changes made earlier to reestablish the enzyme class/protein family. This is what you saw as deleted information in your example, but in total not more than ~100 items were affected by these deletions. The deleted protein information will be re-added to Wikidata as new items in the coming days and linked to the human genes accordingly. This seemed to be the best solution to untangle the protein family/human protein problem. For the enzyme classes and protein families, I will go through all of them and add the enzyme classification numbers and other useful information, so these can be used as subclass of and instance of values on Wikidata species specific protein items like human or mouse. I did these merges also by script assisted manual curation, so this should be quite reliable.
  • Gene ontology term cleanup: I also did extensive Gene Ontology term cleanup to remove wrong terms from Wikidata human and mouse protein items. You can see that because almost all constraint violations for Gene Ontology terms now are cleared [8][9][10][11]. In the coming days, we will also do a fresh import of proteins directly from Uniprot, also involving Gene ontology terms.
In summary, most Gene Wiki Wikipedia pages should now link to their correct Wikidata human gene item and most of the orphaned items, which would confuse users and do not make sense in a human gene/human protein and mouse gene/mouse protein data model, as described above, have now been unified and cleaned up. Except for the enzyme classes/protein families, no data has been deleted, and for those, we are about to re-add the data. I hope this gives you an overview of what I did, looking forward to your comments/suggestions. Best, Sebotic (talk) 07:34, 12 January 2016 (UTC)
Hello Sebotic. In the specific cases I had mentioned, it didn't make sense to me, so I just wanted to make sure that it wasn't a case of a bot that got out of control. As long as this is a planned action, I'm calm. thanks for reply. Hummingbird (talk) 10:19, 12 January 2016 (UTC)

Modification of items[edit]

Hello, I am transfering data from botulinum toxin (Q208413) to botulinum toxin type A (Q4095199) because there are different types of botolotoxin. botulinum toxin (Q208413) will focus on the general features of all toxins (type A to G) and botulinum toxin type A (Q4095199) will be focus on botolotoxin type A. I don't know how this can affect your bot about drug so please take care later if you are doing an update of the data. Thank you Snipre (talk) 13:36, 19 January 2016 (UTC)

@Snipre: Hi! Thanks, that's an important cleanup step to perform. ProteinBoxBot will not touch any item which does not have at least one of a set of unique core identifiers (either Drugbank ID, ChEBI, ChEMBL, Pubchem, UNII), so in the Botox case, it would only touch item Q4095199. If the identifiers cannot be mapped reliably, no data will be written, but a conflict will be logged for manual inspection of the item. In case no item with the appropriate identifiers can be found, a new item will be created.
I guess the Chembl ID on the general botox item Q208413 should also be transferred to Botox A or deleted? I did a similar cleanup recently, cleaning up the generalized topic of Vitamin B and the actual chemical compound Cyanocobalamin. What also seems to require a lot of cleanup is stereoisomeric compounds e.g. for amino acids and sugars. Very frequently, I see Wikidata items containing a mix of identifiers for all 3 possible cases (e.g. the L-, D- and DL-mix forms). We will not come around manual cleanup here, I think. Best Sebotic (talk) 20:10, 19 January 2016 (UTC)
Sebotic I cleaned the general item about Botulinum Toxin so you won't find any identifier about the type A there. Snipre (talk) 20:43, 19 January 2016 (UTC)
By the way can have a look at these 2 items, Neurotensin (Q419576) and NTS (Q14904891) ? One is the gene and one is the protein but they have the same PubChem ID 16129680. I have some trouble to define what is the correct molecule. Thanks. Snipre (talk) 21:01, 19 January 2016 (UTC)
@Snipre: I fixed that one, in addition to the fact that the PubChem ID was outdated, a Pubchem ID should certainly not be on a Wikidata gene item. This info was added by KrBot, which seems to take data from Wikipedia info boxes and add these to Wikidata. I am not sure if these kinds of imports are really useful anymore for domains like genes/proteins or drugs where we do the imports from primary, authoritative databases. Thx! Sebotic (talk) 22:03, 19 January 2016 (UTC)

Update item Q19856779[edit]

Hello, I transfered some data from mitomycin (Q417625) to mitomycin (Q19856779) but without the reference. Please add that item in your next update session. A added only the identifiers used to extract the other ones from external databases. Snipre (talk) 20:23, 24 January 2016 (UTC)

Please check if these items can be merged[edit]

Hello, can you check if your bot didn't create duplicates for the following cases:

Thanks Snipre (talk) 22:08, 2 February 2016 (UTC)

@Snipre: Hi! The bot can create duplicate items on purpose under clearly defined circumstances. The bot searches for items which have a certain set of unique IDs (Drugbank, Pubchem, ChEMBL, CHEBI, KEGG, Inchi_key). If it does not find an existing item on that basis, it creates a new item. This is what happened here. Item Ephedra (Q20817199) got created on 13th August 2015, but item Ephedra (Q13530468) only got added the Drugbank ID on 13th October, so before 13th October item Ephedra (Q13530468) could not have been recognized as the appropriate item by my bot. This is done, because matching labels or aliases can cause extensive problems by writing to the wrong item(s). For data consistency, it is better to create a new item and merge it later on, than to produce Wikidata items with the wrong data on them. Should at some point both items have one of the IDs mentioned above, they will be detected by my bot. You can either merge these items or I will do it in a few days, so you have time to recapitulate how these duplicate items came into existence. Sebotic (talk) 23:06, 2 February 2016 (UTC)
@Sebotic: Ok, thanks for the explanation. For Ephedra we can keep both items separated, one for the drink (medicinal preparation and one for the molecule. But I didn't find a clear definition of the molecule so that why I am wondering if the molecule really exists. For me DrugBank is not clear about that the difference mixture of molecules and one molecule. Snipre (talk) 08:20, 3 February 2016 (UTC)

P2888[edit]

exact match (P2888) is ready. --Tobias1984 (talk) 19:28, 5 June 2016 (UTC)

Problems with ProteinBoxBot[edit]

Hello, Andrawaag. I've just described a problem with ProteinBoxBot on its talk page. (It's topic #28). I don't know how often you check that page, but perhaps you'd like to take a look. Thanks for your attention. Akhooha (talk) 20:22, 4 July 2016 (UTC)

Hello, Akhooha I have just responded to you on that page --Andrawaag (talk) 20:36, 4 July 2016 (UTC)

Share your experience and feedback as a Wikimedian in this global survey[edit]

  1. This survey is primarily meant to get feedback on the Wikimedia Foundation's current work, not long-term strategy.
  2. Legal stuff: No purchase necessary. Must be the age of majority to participate. Sponsored by the Wikimedia Foundation located at 149 New Montgomery, San Francisco, CA, USA, 94105. Ends January 31, 2017. Void where prohibited. Click here for contest rules.

Unused properties[edit]

This is a kind reminder that the following properties were created more than six months ago: MGI Gene Symbol (P2394), UCSC Genome Browser assembly ID (P2576). As of today, these properties are used on less than five items. As the proposer of these properties you probably want to change the unfortunate situation by adding a few statements to items. --Pasleim (talk) 19:08, 17 January 2017 (UTC)

@Pasleim: Thanks for the reminder. MGI Gene Symbol (P2394) has been added to the workflow, and contains more items now. UCSC Genome Browser assembly ID (P2576) points to reference genomes. For now we only cover 4. We plan to extent in the near future, but for now I hope it is okay to have this small number of items. --Andrawaag (talk) 17:54, 19 January 2017 (UTC)

Your feedback matters: Final reminder to take the global Wikimedia survey[edit]

(Sorry to write in Engilsh)

Disambiguation pages standing in Dutch election[edit]

Great work on adding the election candidates. However, there are a few places where you've marked the details on a disambiguation page rather than the actual candidate:

SELECT ?item ?itemLabel
{
  ?item wdt:P3602 ?election; wdt:P31 wd:Q4167410 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "nl" . }
}

Try it!

--Oravrattas (talk) 07:06, 5 March 2017 (UTC)

Thanks, for reporting these. They are fixed now.

--Andrawaag (talk) 14:15, 5 March 2017 (UTC)

Descriptions[edit]

Hoi Andrawaag, descriptions zijn niet het begin van een zin en dus normaal niet met een hoofdletter. Voor items zoals Jan Baas (Q28872044) kan je waarschijnlijk beter iets als "Nederlands politicus" doen. Multichill (talk) 18:02, 13 March 2017 (UTC)

Ik heb een overzicht op User:Sjoerddebruin/Dutch politics/Tweede Kamerverkiezingen 2017 gezet. Begint er goed uit te zien! Multichill (talk) 19:54, 13 March 2017 (UTC)
En mocht je de smaak te pakken hebben: 2012. Multichill (talk) 20:31, 13 March 2017 (UTC)
@Multichill: Eerlijk gezegd heb ik de smaak best te pakken. Maar misschien is het interessanter om eerst de uitslag per kandidaat toe te voegen, zodra deze beschikbaar zijn. Daarnaast moet ik ook nog wat tijd vinden. Nu is het relatief arbeidsintensief omdat ik alles via Quickstatements doe. De bot accounts waar ik toegang toe heb, hebben geen task permissie voor verkiezingsdata. Dat gezegd hebbende, zullen we samen een bot account aanmaken, specifieke voor verkiezingsdata? Het zou sowieso interessant zijn om te zien of de Wikidata integrator platform die we nu gebruiken rondom genen, eiwitten en ziektes ook generiek toepasbaar is op andere domeinen. --Andrawaag (talk) 21:59, 14 March 2017 (UTC)
Politici die landelijk actief zijn in Nederland is een domein waar al behoorlijk wat werk is verzet. Daarom had ik het ook onder User:Sjoerddebruin/Dutch politics gehangen.
Je kan een bot account aanmaken voor je eigen projectjes zoals bijvoorbeeld de politici. Ik heb ondertussen geloof ik wel een dozijn bots voor allerlei verschillende taken.
Ik ben zelf in allerlei domeinen actief, maar de laatste tijd voornamelijk schilderijen. Daar hebben we er ondertussen ook al meer dan 200.000 van :-) Multichill (talk) 17:57, 15 March 2017 (UTC)
Heb jij een handige manier op [12] aan de juiste persoon toe te voegen? Multichill (talk) 15:16, 1 April 2017 (UTC)

affirmé dans : Banque-Carrefour des Entreprises[edit]

Cette référence manque de précision : ouvrage avec page, url ? --Jmh2o (talk) 13:45, 14 May 2017 (UTC)

Tout a fait.J'ai ajouté le lien a Q16626729.--Andrawaag (talk) 10:33, 15 May 2017 (UTC)

any reason to use P31 and P279 for the same value?[edit]

tinyurl.com/lardh6y

Is true that every Q14912958-gene is indistinguishable from other Q14912958-gene?

I would claim grain of sand as class (P279 of some other class), even we cannot distinguish wast majority of them. Because we can isolate any grain and claim P31 grain of sand. d1g (talk) 10:06, 15 May 2017 (UTC)

biological pathway[edit]

I'm not sure: can we merge no label (Q28864279) (that you created) and biological pathway (Q4915012)? Thank you, Tubezlob (🙋) 19:49, 16 May 2017 (UTC)

Belgian post code[edit]

Hi,

There is a discussion right now on Wikidata:Bistro (Topic:Trdng8v51r0sq785) about the unusual duplicates adding of post code you made on several belgians communes. Could you explain why you did this (in French or in English, as you prefer).

Cdlt, VIGNERON (talk) 11:20, 28 May 2017 (UTC)