Wikidata:Requests for permissions/Bot/Gabrabot
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved --Lymantria (talk) 16:59, 27 March 2023 (UTC)[reply]
Gabrabot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Gabrabot (talk • contribs • logs) / Mtanti (talk • contribs • logs)
Task/s: Bulk upload Maltese lexeme data to wikidata from Gabra.
Code: (in progress)
Function details: --Gabrabot (talk) 21:57, 14 February 2023 (UTC)[reply]
This bot will be used to seed wikidata lexemes with Maltese words from the Maltese online dictionary Gabra. This dictionary contains approximately 20 000 lexemes with many wordforms each which were intially automatically generated and later manually fixed. In collaboration with Wiki Community Malta, a project was set up to populate wiki lexemes with this data. The plan is to only upload a few lexemes at a time to allow volunteers to enrich and correct these lexemes.
The bot will be uploading the following data for each lexeme:
- lemma
- part of speech tag
- gabra reference id
- root of lexeme
- glosses of lexeme as senses
- wordforms of lexeme as forms together with grammatical features of each form (the can be hundreds of different forms)
Changes:
- The glosses part has been removed to avoid possible copyright issues.
- The wordforms have been limited to just 16, which are those that are seen in the first page when opening a lexeme (in the case of verbs, only forms without an object and with a positive polarity are included).
- If a lemma with a given Gabra ID already exists then it will be skipped completely to avoid causing disruptive changes. The current Maltese lexemes in Wiki Data have been checked to make sure they all have a Gabra ID included.
Comments
[edit]The license of the Gabra project is listed as creative commons. Don't we require public domain licensing? BrokenSegue (talk) 22:33, 14 February 2023 (UTC)[reply]
- My understanding is that we require public domain licensing when no attribution is possible or desired. In this case, each lexeme imported from il-Ġabra will evidently be attributed to the Ġabra reference ID (P5928), which clearly satisfies the CC-by attribution requirement. In any case, you cannot copyright a word or lexeme, which makes individual words and lexemes de facto public domain. That's standard and basic copyright law. Anything beyond individual words / lexemes brought to Wikidata from Ġabra will be referenced through P5928. -- ToniSant (talk) 13:45, 15 February 2023 (UTC)[reply]
- Source on your claim about public domain licensing on Wikidata? Wikidata:Licensing says "All structured data in the main, property and lexeme namespaces is made available under the Creative Commons CC0 License". CC0 is not compatible with Gabra's license. Your second point amounts to "this information is not copyrightable" but if that's the case why does the database list a copyright at all? BrokenSegue (talk) 16:24, 15 February 2023 (UTC)[reply]
- Fair point. You are right that there are two points here. Thank you very much for following this through. Your line of discussion is clearly rooted in the way Wikidata:Licensing is presented. As an active Maltese-language Wikimedian, I can certainly see the great benefits that can be derived from the use of this Ġabra database to create Maltese lexemes on Wikidata. This is my primary motivation to engage in this discussion.
- The first point in my earlier comment related to my impression of how we (i.e. Wikidata) operate, which is why I said what I understood rather than quoted a source. I'm happy to retract this point. Still, the key point there is that the database that this bot will work on is CC-by because they wanted attribution not because they're reserving any other rights. This is why I mentioned addressing this through P5928. In terms of attribution, it may also be practical to add a reference to the source URL for the sense statement, thus satisfying the attribution license. My other point about how it is indeed not possible to copyright words, is very evident from reading any version of copyright legislation endorsed by WIPO. Although certain licenses are claimed it doesn't mean that they're actually valid or enforceable. Moreover, it can be argued that the CC-by license is for the database rights rather than individual items in the dataset, especially since ND is not reserved as a right. To my understanding (so this is just an opinion, even if a considered one) the license on this database is a simple CC-by because the creators wanted the project to be credited and not because they didn't want aspects of it to be adopted by Wikidata. Then again, I recognize that this may be seen as a convoluted argument, even with all the good faith in the world.
- Interestingly, the requested/required attribution is for a project that appears to no longer be active.[1] Perhaps @Mtanti can shed some more light on this.
- I make this new point because the bot could ensure that the necessary attribution is made through P5928 and/or a reference URL on the relevant imported statements, mostly for Senses, but also for Forms, if necessary. Ultimately, if this is a sticking point, there may also be the option of this bot's operator to reach out to the CC-by license holder to consider switching to CC-0 or PD. It's likely that this may indeed be possible in a similar way to how some GLAM institutions have opted for a CC-by-SA license on Commons for images they previously held under an all right reserved type of copyright, after outreach work by Wikimedians. -- ToniSant (talk) 18:25, 15 February 2023 (UTC)[reply]
- The problem with your argument that we are providing attribution and so are free to use the data is that people who reuse the data from Wikidata believe our content to be licensed under CC0. This means they are free to redistribute our content without attribution. But that would not be the case. If you want to argue the content isn't copyrightable I fear I am not a lawyer and cannot comment on that. If we were doing a very limited import then I wouldn't worry but this is suggesting an import of most of the catalog. Is there a process by which we can ask the WMF to answer this? I do fear that the solution here is going to be to reach out to the creators and ask them to re-license the work. BrokenSegue (talk) 04:29, 16 February 2023 (UTC)[reply]
- All the Wiki lexemes with Maltese words I've seen cite Ġabra as their source. Are you saying that these are violating copyright?
- Also, note that we do not intend to copy all the content from Ġabra. We are only going to copy a subset of lexemes and only a subset of the information from each lexeme, probably only a small subset of the word forms. Mtanti (talk) 20:17, 15 February 2023 (UTC)[reply]
- They may very well be violating copyright copyright. I am not a lawyer. I'm merely pointing out that you are planning to copy large amounts of data from a CC-BY source to a CC0 repository which is incompatible. Questions like these are why we have a bot approval process. BrokenSegue (talk) 04:31, 16 February 2023 (UTC)[reply]
- I was pointed towards meta:Wikilegal/Lexicographical_Data which has some more context. I'm not clear if things like "part of speech tag" or "root of lexeme" are copyrightable though. BrokenSegue (talk) 05:30, 16 February 2023 (UTC)[reply]
- Thanks @BrokenSegue - this is very helpful. I'm not a lawyer either, but I'd say that getting a CC0 license on the source will avoid any ambiguity and potential issues later on, even if it's fairly clear that individual words are definitely not copyrightable. It's trademarks that can be applied to individual words but that's a very different scenario than the one we're discussing here. I'm wondering whether the idea of "doing a very limited import" (as you aptly put it) would help @Mtanti make the case for CC0 to the creator. -- ToniSant (talk) 08:36, 16 February 2023 (UTC)[reply]
- I believe that the only copyrightable part of the data is the glosses. If I leave out the glosses and just copy over everything else I mentioned, would that be acceptable? Mtanti (talk) 19:44, 16 February 2023 (UTC)[reply]
- sorry I don't know what a "glosses" is. I'm a real newb here. BrokenSegue (talk) 01:37, 17 February 2023 (UTC)[reply]
- The glosses are the dictionary definitions. In Gabra they're just very short English translations of the Maltese words, like "to steal". Mtanti (talk) 07:09, 17 February 2023 (UTC)[reply]
- yeah those definitely seem copyrightable. BrokenSegue (talk) 18:48, 18 February 2023 (UTC)[reply]
- But the rest aren't. Can I upload those without the glosses? 193.188.47.91 19:34, 18 February 2023 (UTC)[reply]
- Neutral I'll say that I won't oppose this request for permission. I'd support if we got the original source to explicitly indicate we are in compliance with the copyright. I don't have the ability to approve this request anyways as I'm not a bcrat. BrokenSegue (talk) 20:40, 18 February 2023 (UTC)[reply]
- But the rest aren't. Can I upload those without the glosses? 193.188.47.91 19:34, 18 February 2023 (UTC)[reply]
- yeah those definitely seem copyrightable. BrokenSegue (talk) 18:48, 18 February 2023 (UTC)[reply]
- The glosses are the dictionary definitions. In Gabra they're just very short English translations of the Maltese words, like "to steal". Mtanti (talk) 07:09, 17 February 2023 (UTC)[reply]
- sorry I don't know what a "glosses" is. I'm a real newb here. BrokenSegue (talk) 01:37, 17 February 2023 (UTC)[reply]
- I was pointed towards meta:Wikilegal/Lexicographical_Data which has some more context. I'm not clear if things like "part of speech tag" or "root of lexeme" are copyrightable though. BrokenSegue (talk) 05:30, 16 February 2023 (UTC)[reply]
- They may very well be violating copyright copyright. I am not a lawyer. I'm merely pointing out that you are planning to copy large amounts of data from a CC-BY source to a CC0 repository which is incompatible. Questions like these are why we have a bot approval process. BrokenSegue (talk) 04:31, 16 February 2023 (UTC)[reply]
- Source on your claim about public domain licensing on Wikidata? Wikidata:Licensing says "All structured data in the main, property and lexeme namespaces is made available under the Creative Commons CC0 License". CC0 is not compatible with Gabra's license. Your second point amounts to "this information is not copyrightable" but if that's the case why does the database list a copyright at all? BrokenSegue (talk) 16:24, 15 February 2023 (UTC)[reply]
- Here is my understanding of what is copyrightable and so must not be added on Wikidata: the definition (Wikidata sense), the etymology (I guess it corresponds to "root of lexeme"). All the rest of what you have listed (part of speech, form, etc.) is ok to be added here.
- Before approving the bot status, could you run your bot for a few lexemes (less than 10) so that we can check with real examples? Pamputt (talk) 17:33, 20 February 2023 (UTC)[reply]
- @Mtanti, Gabrabot, ToniSant: globally it looks good.
- For me licence is not really an issue (it was discussed multiple times and we even have meta:Wikilegal/Lexicographical_Data).
- Not adding the glosses is a bit problematic, we usually require senses on Lexemes. Definition are copyrighteable but glosses are not (per definition, a gloss should be too short to be copyrighteable). That said, I guess that if we have a way to identify the lexemes (with Ġabra lexeme ID (P5928) for instance, do you plan to add it?) it could be ok to do the import without the gloss but it's a bit of a shame.
- I'm curious and if I understand correctly, the root is the stem/radical here (as in semitic languages) not the etymology (per Pamputt comment). Does it goes in word stem (P5187)?
- I'm also wondering, how will the bot deal with existing lexemes or with complex cases (two lexemes with the same lemma for instance).
- Publishing the code is important too, we don't have much bot for Lexemes and every code is precious now.
- Lastly, +1 with Pamputt, could you run a small import? like 10 lemexes, to see exactly what it will look like.
- Cheers, VIGNERON (talk) 17:45, 20 February 2023 (UTC)[reply]
- The root is the radicals, yes. There's a wiki item called 'root'.
- The bot will be focusing on the Ġabra ID in order to avoid duplicates. At the moment there are some 100 lexemes in Maltese on wiki lexemes. We can check them all manually to make sure that they have an ID set in order to avoid duplicates.
- I will be publishing the code on github, don't worry.
- I will be publishing the code on github, don't worry. Just need to polish the code first.
- Yes I will be running it on a few lexemes as soon as I can. Mtanti (talk) 18:47, 20 February 2023 (UTC)[reply]
- Lexemes are being uploaded as we speak. Mtanti (talk) 20:07, 23 February 2023 (UTC)[reply]
- Yes I'll run it on a few lexemes and let you know. The root refers to the radicals of the word, not the etymology. Mtanti (talk) 18:43, 20 February 2023 (UTC)[reply]
- Lexemes are being uploaded as we speak. Mtanti (talk) 20:07, 23 February 2023 (UTC)[reply]
- @Mtanti: for kemm (L1036706) and kief (L1036702) it's seems good (caveat: I don't know Maltese).
For the root lexemes like (L1036705) or k-j-f (L1036701) it's a bit more strange, I understand they are needed for the other lexemes but still almost empty lexemes doesn't seem ideal (same caveat). Could someone who has knowledge in Semitic root (Q266273) have a look and tell us what they think?After some thinking and talking (including on Telegram), no more objection.- Cheers, VIGNERON (talk) 18:47, 26 February 2023 (UTC)[reply]
- Update: Support this request now (not everything is perfect of course, but clearly good enough). Cheers, VIGNERON (talk) 10:12, 27 February 2023 (UTC)[reply]
- I will approve the request in a couple of days, provided that no objections will be raised. --Lymantria (talk) 17:46, 9 March 2023 (UTC)[reply]