Wikidata:Requests for permissions/Bot/AitalvivemBot
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved, I normally let all the issues to settle down, but here the main issues have been solved, the remaining side issues can be sorted out (please do), and I am travelling starting from tomorrow, so that I boldly approve the bot now--Ymblanter (talk) 20:32, 30 July 2019 (UTC)[reply]
AitalvivemBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: AitalvivemBot (talk • contribs • logs)
Task/s: Move the lexicographical data from a xml file (from Lo Congrès) dictionary database to Wikidata. Those data will be used in the future as a foundation for the creation of automatics natural languages processing (NLP) for the Occitan languages (Lengadocian and Gascon).
Code: You can find my code here https://github.com/aitalvivem/AitalvivemBot. This must be used with the pywikibot framework.
Function details:
- The Bot read the lexicographicals data from a .xml file containing for each lexeme a lemma, some forms, a lexical category, some grammatical features, some claims and a language.
- To add a form in the Wikidata database the Bot first check if the relevant Lexeme exists and get its id. If the Lexeme does not exists the Bot create it.
- The Bot add the form for the relevant Lexeme.
- The bot add the claims if it is needed
--Aitalvivem (talk) 12:37, 29 July 2019 (UTC)[reply]
- @User:AitalvivemBot: What kind of „usefull datas“ you like to introduce here? --Succu (talk) 20:56, 1 July 2019 (UTC)[reply]
- @Succu: We want to introduce all the Lexemes from 3 languages (French, Occitan Lengadocian and Occitan Gascon) used by our online dictionary. It will concern in the first time the Lemmes, the forms, the translation relationships between words, the variants relationships. And later we may add definition relationship.--Aitalvivem (talk) 09:49, 2 July 2019 (UTC)[reply]
- @Aitalvivem: I'm Nicolas (from the meeting we just had ). When you will be ready, it would be good to do a test run, like importing 5-10 Lexemes so we can check everything is ok.
- Hi Nicolas, of course I will make a test run but the bot if far from ready for now. I will let you know when I am ready if you are interested !--Aitalvivem (talk) 18:41, 3 July 2019 (UTC)[reply]
- @Succu: for now, this is just this first step of a long project, the bot is not ready yet ; Aitalvivem is starting to first contact and connect with the Wikidata community. The „usefull datas“ are lexicographical data for the Lexeme namespace.
- Cheers, VIGNERON (talk) 13:55, 2 July 2019 (UTC)[reply]
- Please use the sandbox items or test environment for testing. Unless you add actual data, don't do this by creating new items in the live environment. I asked for Q65028409 (label "en-value"@en) and Q65010246/Q65028365 to be deleted. --- Jura 16:01, 4 July 2019 (UTC)[reply]
- @Jura Sorry for this mistake, I thought that the live environment would not take account of my request while I do not have the Bot permission.--Aitalvivem (talk) 09:12, 5 July 2019 (UTC)[reply]
- As I wrote on Lexicographical data, please check carefully that all the data you will import are licenced under CC0. Pamputt (talk)
- Of course, I work for lo Congrès and we have already split our data to be sure that all the data which would be import are licenced under CC0.--Aitalvivem (talk) 08:16, 11 July 2019 (UTC)[reply]
Hello everyone, on the advise of @VIGNERON I am going for a test run for my bot AitalvivemBot with 6 lexemes.
Here are the expected results :
Lexeme : acomiar, nature : verb, language : occitan form : acomiar, grammatical feature : infinitive, claim : P276 -> gasconha
Lexeme : faunic, nature : adjective, language : occitan form : faunic, grammatical feature : adjective masculine singular, claim : P276 -> gasconha form : faunica, grammatical feature : adjective feminine singular, claim : P276 -> gasconha
Lexeme : camin, nature : Common noun, language : occitan form : camin, grammatical feature : masculine singular, claim : P276 -> gasconha, claim : P276 -> lengadoc form : camins, grammatical feature : masculine plural, claim : P276 -> gasconha, claim : P276 -> lengadoc
Lexeme : passejada, nature : Common noun, language : occitan form : passejada, grammatical feature : feminine singular, claim : P276 -> lengadoc form : passejadas, grammatical feature : feminine plural, claim : P276 -> lengadoc
Lexeme : penós, nature : ajective, language : occitan form : penós, grammatical feature : masculine singular, claim : P276 -> lengadoc
Lexeme : rétroflexe, nature : adjective, language : french form : rétroflexe, grammatical feature : masculine singular form : rétroflexe, grammatical feature : feminine singular form : rétroflexes, grammatical feature : masculine plural form : rétroflexes, grammatical feature : feminine plural
--Aitalvivem (talk) 13:19, 29 July 2019 (UTC)[reply]
And here are the results : acomiar (L57833), faunic (L57834), camin (L57835), passejada (L57836), penós (L57837), rétroflexe (L57838) There was an error on the feminine form of faunic (L57834) but it came from the xml file (i rectified it manually). An other error append on the claims of the lexeme camin (L57835) but it is now solved. To be sure that the error was fixed I added another lexeme, polit (L57839), and everything worked just fine.--Aitalvivem (talk) 14:32, 29 July 2019 (UTC)[reply]
- @Aitalvivem: it sounds good and the Lexemes look good too, maybe one more bigger test (maybe 20-30 lexemes), just to be sure and spot any last error. Cheers, VIGNERON (talk) 16:05, 29 July 2019 (UTC)[reply]
- These seem fine. I was a little surprised to see location (P276) used to indicate the region - @VIGNERON: is that what we're doing for lexemes now? I haven't tried expressing that myself yet. ArthurPSmith (talk) 17:36, 29 July 2019 (UTC)[reply]
- @ArthurPSmith: sorry, we talked about it off-wiki. I'm not sure it is « what we're doing for lexemes now » since we never really did store this kind of data but it was the more logical way to do it. It was decided to do it like that instead of just discarding this information (especially as this is a very important one). As this is structured data, it can always be changed if a better structure is found afterward. And of course, if you have a better idea right now, don't hesitate to share it ;) Cdlt, VIGNERON (talk) 17:46, 29 July 2019 (UTC)[reply]
- I can't think of anything better right now! Go ahead, thanks! ArthurPSmith (talk) 17:48, 29 July 2019 (UTC)[reply]
- @ArthurPSmith: As VIGNERON said, we thought that this was the best way to give a dialect for a lexeme. But if there is a better property or if a new one is created (like I asked here) this would not change anything for the bot, we will just have to change a property for an other in the xml file. Aitalvivem (talk) 08:23, 30 July 2019 (UTC)[reply]
- I can't think of anything better right now! Go ahead, thanks! ArthurPSmith (talk) 17:48, 29 July 2019 (UTC)[reply]
- @ArthurPSmith: sorry, we talked about it off-wiki. I'm not sure it is « what we're doing for lexemes now » since we never really did store this kind of data but it was the more logical way to do it. It was decided to do it like that instead of just discarding this information (especially as this is a very important one). As this is structured data, it can always be changed if a better structure is found afterward. And of course, if you have a better idea right now, don't hesitate to share it ;) Cdlt, VIGNERON (talk) 17:46, 29 July 2019 (UTC)[reply]
- These seem fine. I was a little surprised to see location (P276) used to indicate the region - @VIGNERON: is that what we're doing for lexemes now? I haven't tried expressing that myself yet. ArthurPSmith (talk) 17:36, 29 July 2019 (UTC)[reply]
- Any objections left here?--Ymblanter (talk) 19:18, 29 July 2019 (UTC)[reply]
- there is still the pending issue about the CC0 licence. Has an OTRS ticket been opened on this topic? Pamputt (talk) 02:44, 30 July 2019 (UTC)[reply]
- @pamputt: Sorry but I don't know what is an OTRS ticket, I read the wikipedia page about OTRS but I didn't really understand how it work. Aitalvivem (talk) 08:23, 30 July 2019 (UTC)[reply]
- In French for this section since we all three speak French: un ticket OTRS est tout simplement un mail envoyé sur un système (aussi dit OTRS) qui permet de gérer et archiver des mails. Sur les projets Wikimédia, ce système est utiliser de nombreuses façons mais ici il s'agit de conserver l'autorisation de publication sous licence CC0. Mais honnêtement @Pamputt: est-ce vraiment nécessaire ici ? cela me semble exagéré, d'une part l'import des données est fait par le propriétaire des données (or OTRS c'est plutôt pour les imports par des tiers, si on ne fait pas confiance à une institution pour donner la bonne licence via son compte utilisateur pourquoi lui ferait-on confiance pour donner la même licence par mail ?) et d'autre part les données concernées ne sont pas vraiment "copyrightable" (cf. notamment meta:Wikilegal/Lexicographical Data). Une meilleure solution serait peut-être d'indiquer les sources (vu qu'il me semble qu'il s'agit de dictionnaires du domaine public, non ?). Cdlt, VIGNERON (talk) 08:40, 30 July 2019 (UTC)[reply]
- @VIGNERON: Merci pour les précisions ! @Pamputt: Le dictionnaire à partir duquel on versera les données sera libre droit mais il n'existe pas encore. Nous sommes justement en train de collecter les données libre de droit de différent dictionnaires pour en créer un nouveau. Mais une fois qu'il sera crée on pourra faire la manip via OTRS si c'est plus sûr pour vous (il faudra juste nous expliquer comment faire ;) ). Aitalvivem (talk) 09:19, 30 July 2019 (UTC)[reply]
- @Pamputt:Hi ! I'm the one who asked AitalVivem to program the bot, as I work in Lo Congrès (the institution wanting to pour its datas). About the license, the datas we want to pour (and which are not created for now) will be under an open license (I just have to confirm about CC0 with my director). As soon as I have his answer, I will confirm this to Wikidata using the standard authentication protocol by e-mail. The data we destinate to Wikidata are only datas we built ourselves, so there won't be any right restriction. Unuaiga (talk) 15:13, 30 July 2019 (UTC)[reply]
- @VIGNERON, Unuaiga: sur la licence, c'est plus une position de principe mais comme Lo Congres a jusque pour le moment publie ces donnees sous licence CC by, ca pose la question du chqngement de licence. Et bien sur je parle de donnees protegeable par le droit d'auteur. Je n'ai rien vu de tel dans les tests qui ont ete realises pour le moment mais comme il a ete question d'importer des sens, ca me parait plus sur de regler cette question maintenant car une fois qu'on aura autoriser le bot a tourner, ca sera plus complique de verifier le respect d'une eventuelle licence. Pamputt (talk) 16:28, 30 July 2019 (UTC)[reply]
- @Pamputt: je laisse Unuaiga et/ou Aitalvivem confirmer ou infirmer mais il me semblait qu'il n'a jamais été question d'importer des sens (effectivement, ce point est important pour la question de la protection par droit d'auteur). Cdlt, VIGNERON (talk) 16:47, 30 July 2019 (UTC)[reply]
- @Pamputt, VIGNERON: Nous avons réfléchi au début à importer les données du Basic, un dictionnaire que nous avons nous-mêmes créés. Mais le fait qu'il soit dans le sens occitan-français compliquait les choses, nous avons donc laissé l'idée de côté pour le moment. Nous construirons éventuellement plus tard un autre bot, avec un nom d'utilisateur différent, pour importer ce type de données. Unuaiga (talk) 16:56, 30 July 2019 (UTC)[reply]
- Ok, dans ce cas, aucun probleme de licence, tant qu'on ne touche qu'aux informations grammaticales et pas au sens. Pamputt (talk) 17:09, 30 July 2019 (UTC)[reply]
- @VIGNERON, Unuaiga: sur la licence, c'est plus une position de principe mais comme Lo Congres a jusque pour le moment publie ces donnees sous licence CC by, ca pose la question du chqngement de licence. Et bien sur je parle de donnees protegeable par le droit d'auteur. Je n'ai rien vu de tel dans les tests qui ont ete realises pour le moment mais comme il a ete question d'importer des sens, ca me parait plus sur de regler cette question maintenant car une fois qu'on aura autoriser le bot a tourner, ca sera plus complique de verifier le respect d'une eventuelle licence. Pamputt (talk) 16:28, 30 July 2019 (UTC)[reply]
- In French for this section since we all three speak French: un ticket OTRS est tout simplement un mail envoyé sur un système (aussi dit OTRS) qui permet de gérer et archiver des mails. Sur les projets Wikimédia, ce système est utiliser de nombreuses façons mais ici il s'agit de conserver l'autorisation de publication sous licence CC0. Mais honnêtement @Pamputt: est-ce vraiment nécessaire ici ? cela me semble exagéré, d'une part l'import des données est fait par le propriétaire des données (or OTRS c'est plutôt pour les imports par des tiers, si on ne fait pas confiance à une institution pour donner la bonne licence via son compte utilisateur pourquoi lui ferait-on confiance pour donner la même licence par mail ?) et d'autre part les données concernées ne sont pas vraiment "copyrightable" (cf. notamment meta:Wikilegal/Lexicographical Data). Une meilleure solution serait peut-être d'indiquer les sources (vu qu'il me semble qu'il s'agit de dictionnaires du domaine public, non ?). Cdlt, VIGNERON (talk) 08:40, 30 July 2019 (UTC)[reply]
- @pamputt: Sorry but I don't know what is an OTRS ticket, I read the wikipedia page about OTRS but I didn't really understand how it work. Aitalvivem (talk) 08:23, 30 July 2019 (UTC)[reply]
- Could we see 10 lexeme creations for French (not all of the same lexical category). --- Jura 04:39, 30 July 2019 (UTC)[reply]
- there is still the pending issue about the CC0 licence. Has an OTRS ticket been opened on this topic? Pamputt (talk) 02:44, 30 July 2019 (UTC)[reply]
- @VIGNERON: @jura: Ok I am going for a bigger test run with something like 30 lemexes (with 10 lexemes for French) with a maximum of diversity in the categories. Aitalvivem (talk) 08:23, 30 July 2019 (UTC)[reply]
I ran the second test with 30 Lexemes and here are the results : abrutit (L57921) to simetricament (L57940) for the Lexemes in Occitan and absent (L57941) to dérober (L57950) for the Lexemes in French. Everything worked as expected. There were two little errors I corrected in de tria (L57926) and bon (L57931) but, as for the first test, those errors came from errors in my xml file. Aitalvivem (talk) 13:15, 30 July 2019 (UTC)[reply]
- Looks ok. Would you change abstinente (L57942) and similar to use lexical category noun (Q1084)? For à ce point (L57944), à ce soir (L57945), à charge de (L57946), we should probably be using the ones from Wikidata:Lists/locutions/types, but we could refine this later. --- Jura 13:36, 30 July 2019 (UTC)[reply]
- @Jura: Why do you want to use noun (Q1084) instead of proper noun (Q147276) and common name (Q56753314) ? It looks like we will loose a piece of information. Anyway the nature of a lexeme is defined in the xml file, the bot won't need any changes if you want to use either noun (Q1084) or proper noun (Q147276) and common name (Q56753314), you just have to use the one you want in the xml file. And thank you for the list of locutions I think we will use them !Aitalvivem (talk) 15:18, 30 July 2019 (UTC)[reply]
- It's only abstinente (L57942)/common name (Q56753314) that should be changed. common name (Q56753314) is used mainly for taxa, not lexemes. noun (Q1084) is what we use for these, these can be seen as parent for proper noun (Q147276). proper noun (Q147276) used on Accous (L57943) is fine. --- Jura 15:26, 30 July 2019 (UTC)[reply]
- @Jura: Why do you want to use noun (Q1084) instead of proper noun (Q147276) and common name (Q56753314) ? It looks like we will loose a piece of information. Anyway the nature of a lexeme is defined in the xml file, the bot won't need any changes if you want to use either noun (Q1084) or proper noun (Q147276) and common name (Q56753314), you just have to use the one you want in the xml file. And thank you for the list of locutions I think we will use them !Aitalvivem (talk) 15:18, 30 July 2019 (UTC)[reply]
- @Jura: Ok, we will use noun (Q1084) for common nouns and proper noun (Q147276) for proper nouns. Where can we find documentation about the items that are usually used to indicate PoS or gramatical informations ? It would help us to formate our data with the indications that are usually used. Unuaiga (talk) 15:33, 30 July 2019 (UTC)[reply]
- For French, there are some stats on Wikidata:Lists/lexemes/fr. Wikidata:Lexicographical_data/Documentation#Data_Model has some basic ones. We had some discussion on nouns on Wikidata talk:Lexicographical data. I have no opinion about what you might want to use in 'oc'. --- Jura 15:37, 30 July 2019 (UTC)[reply]
- @Jura: Ok thanks, we will look at that. Unuaiga (talk) 16:57, 30 July 2019 (UTC)[reply]
- @Jura: Ok, we will use noun (Q1084) for common nouns and proper noun (Q147276) for proper nouns. Where can we find documentation about the items that are usually used to indicate PoS or gramatical informations ? It would help us to formate our data with the indications that are usually used. Unuaiga (talk) 15:33, 30 July 2019 (UTC)[reply]
- Looks good to me too. One issue though which I think was because you hand-corrected something: de tria (L57926) has the lexeme with language 'oc', but the form with language 'fr'! ArthurPSmith (talk) 13:47, 30 July 2019 (UTC)[reply]
- By the way I think this will be the first bot approved to add lexemes in Wikidata - congratulations!! ArthurPSmith (talk) 13:49, 30 July 2019 (UTC)[reply]
- @ArthurPSmith: Oops yes I didn't saw the error on the form of de tria (L57926) I am going to correct it ! Thank you ! Aitalvivem (talk) 15:04, 30 July 2019 (UTC)[reply]
- Concerning the forms of the verbs, is it planned to add all the forms or only the indicative forms (for French and Occitan)? If so, it is a pity not to take the opportunity to add by bot all the verb froms but it is not a big deal (it could be done later). Pamputt (talk) 17:09, 30 July 2019 (UTC)[reply]
- @Pamputt: We plan to add all of the forms to the verbs. We didn't do this with the dataset because it would have been too long to manually add a complete verb in our test file. But if we add our real datas, we have many hundreds of occitan verbs conjugated. And maybe some conjugated french verbs. Unuaiga (talk) 17:58, 30 July 2019 (UTC)[reply]