I created a new tool to simplify adding main subjects to our datasets. Feel free to try it out 😃https://www.wikidata.org/wiki/Wikidata:Tools/ItemSubjector
Topic on User talk:Jsamwrites
Jump to navigation
Jump to search
Reply to "New tool"
@So9q Thanks. I will test it.
Hi, I see you did a huge batch with 8000+. I usually try to have 2-3000 as a max for now because larger than so I usually suspect that there is a subclass that I have missed. In this case there seem to be 5000+ articles with "aspiration cytology" which is currently missing an item. Now that the batch has already finished we need this https://github.com/dpriskorn/ItemSubjector/issues/5 implemented to replace the subject.
I created Q108601075Â :)
This looks good :) https://editgroups.toolforge.org/b/CB/c8f12a880f1f/
@So9q Thanks for the tool and for finding 'aspiration cytology'. https://github.com/dpriskorn/ItemSubjector/issues/5 will be very useful.
Another interesting feature could be to disable "alias" https://github.com/dpriskorn/ItemSubjector/issues/7.
Hi! I published a pre-release with many improvements, including ignoring aliases that you requested above.
See https://github.com/dpriskorn/ItemSubjector/releases/tag/v0.2-alpha0
@So9q Thanks. I will use this pre-release version
New pre-release out with new features :) WDYT?
@So9q I see a lot of interesting features in this version. Great job!! But there are a couple of issues.
I think the DEBUG mode is switched on by default. I see a lot of messages on the screen.- I want to try the JOBS on toolforge. But it's not very clear how I can run prepared jobs on Toolforge/PAWS.
- I tried -l option. It doesn't work anymore. I think this option is replaced by -a option.
But with -a option, I have to press Enter for every new scholarly article. I am not sure whether you are introducing it as a feature replacing the previous batch option.
@So9q Updated the above reply. I think, I checked out a wrong branch. Things are working fine. I striked-out some previous comments.
Here is a gift for you :) https://github.com/dpriskorn/ItemSubjector/blob/prepare-batch-improved-structure/Kubernetes_HOWTO.md
@So9q Thanks for this tutorial.
After doing ssh, I ran the following command (assuming that itemsubjector already exists), I get the following error
$ become itemsubjector
You are not a member of the group tools.itemsubjector.
Any existing member of the tool's group can add you to that.
Yeah, you have got to create your own tool in the web interface and name it whatever e.g. "itemsubjector-jsam" and use that.
I updated the guide with links.
@So9q Thanks for the updated guide. I am now able to ssh and run.
Great! I see you are editing a lot. Counting all your edits for september and oktober until now you have 1M edits! If all of them are main subject then you have single handedly taken Wikidata from 14M to 15M total main subjects on the 37M articles. Wow!
See https://qlever.cs.uni-freiburg.de/wikidata/zZAhrs which times out in Blazegraph.
@So9q Thanks. You are right. I am checking the statistics here to see the change: https://scholia.toolforge.org/statistics
According to this query we had 27M articles without any main subject before. I'm curious to see how many it is after our effort. At best, it is 1M less, so still 26M to go! :D
We are now almost down to 26M lacking P921! Nice work. 25M is our next target. How many weeks do you think it will take? 3?
@So9q thanks. I have been monitoring it here: https://scholia.toolforge.org/statistics Today, the value is at 17,064,544 and on September 25, it was 15,831,145. So, I would say around 3 weeks.
That is a different measurement 😃 My search list all articles without any P921
How many reverts have you got per 100.000 edits? I got a few when I matched Canada, so I stopped matching countries.
@So9q I recall couple of them some weeks ago. The first one was 'Systemic therapy', which was ambiguous since it occurs in two different fields: psychology and cancer therapy.
- Q108744083 (Newly created one)
- Q1929812
One possible suggestion for improvement in ItemSubjector:
possible warnings when there is use of the property P1889 (different from), i.e., there are two items with the same label.
The second one was alcholism vs. alcholism treatment. I think that I added the former to an item which already had the latter. So, I am now careful with single word labels.
Ok, on it. Tracked here https://github.com/dpriskorn/ItemSubjector/issues/26
Fixed and working on master :)
@So9q Thanks. I will test it.
Are you participating in WikidataCon 2021? I will be sharing my personal experiences here: https://pretalx.com/wdcon21/talk/CUQXWE/ Will you be able to attend this session?
Thanks for creating that and for inviting me. I feel honored. I signed up for WikidataCon and I will do my very best to attend. Would you like me to prepare something?
I just created a new query that might interest you/the participants: Wikidata:SPARQL query service/queries/examples#Galaxies ordered by the ones that are most linked from scientific articles
Also this query now times out on Blazegraph. A similar query works on QLever on older data from a few months back. I'm planning to set up a QLever instance in Toolforge in the near future.
Also: Did you try the new feature I pushed on master yesterday? :D
@So9q I tested the newest feature. It will be quite useful for certain subjects. I may need to test it more.
I did not yet test the Issue 26 fix.
Thanks for your reply concerning WikidataCon. I feel that the participants may have questions about ItemSubjector. If possible, you could present some of its key features.
Alright, I'll try to prepare a little demo video of the tool in action so they get a feeling for the interface.
People might want a QS similar front end, which could be a fun project.
I personally would like to make multiple batches and not have to wait for one to end before the next can be started.
I pushed QuickStatements export support to master (first step towards a WebUI)Â :) I did not find time to prepare a video, unfortunately.
If I don't make it into the meeting, please ask the participants whether they would like a Web UI for the tool that integrates with QS. See https://github.com/dpriskorn/ItemSubjector/issues/29
There has been very little traffic in the gitrepo so far so I wonder if a WebUI is worth the effort.
@So9q No worries. I am talking about ItemSubjector in my presentation. Please find the slides here https://figshare.com/articles/presentation/Making_Scholarly_Articles_Findable_Towards_Ensuring_F_of_FAIR_Data_Principles/16910203
I will try the latest changes in upcoming days. Once again thanks.
@So9q Tested this today: https://github.com/dpriskorn/ItemSubjector/issues/26. It's working. Thanks for the update.
@So9q I tested the QS export option. It's working well. Great option. Thanks.
One observation: the CSV file generated misses the 'inferred from' (may be, it's a desired feature)
Oh, that's a bug. I'll open an issue
We are almost down to 25M articles missing P921 now. :) I'm working on medicine and there are thousands of subjects to cover still...
@So9q Yes, 25.16 MÂ :)
@So9q < 25M articles now :)
Update and new version!
@So9q Thanks for the update. I will take a look. Scholia statistics show that 20 M "main subject" values are now available.
Do you know how many P921 statements there were before I made the tool?
I have been working on Q35456 lately and will work on all of medicines for the forseeable future. Using the --sparql with --limit with the newest version makes the tool way better IMO because it first fetches (while I walk the dogs) and then I can approve/disapprove in one go in the end and start the job in k8s and leave it running until i finishes.
My biggest k8s job has been around 30k items so far running for 8 hours or more. Now I'm thinking 100k jobs and runtime of 24h might be easy to archive.
@So9q Based on what I remember, I think the number was somewhere between 15M-16M (Scholia Statistics). I may have to dig my tweets. I had shared some screen-captures in the beginning. Check this. It shows the difference: around 15M to 17M (October 2021). I think that a majority of them comes from itemSubjector.
Documenting essential medicinces (Q35456) will be useful. Wow, 100k jobs will be great.
I just started a 88k batch. Easy, took a few minutes to run all the queries and a few minutes to review. 😀
@So9q I am not able to run the latest version or the 0.3-alpha2.
I created the following issue: https://github.com/dpriskorn/ItemSubjector/issues/49