Wikidata:Mastodon tags with QIDs

From Wikidata
Jump to navigation Jump to search

This is a proposal to add QIDs to Mastodon tags (also known as hashtags).

Use cases / Advantages / The why?[edit]

Adding QIDs to tags allows to use a controlled vocabulary for tags.

It allows to "normalize" tags to meanings, which helps both to join synonymous tags as well as discern homonymous tags.

This also allows to follow a single tag across different languages.

It may also eventually allow to follow very specific sub-hashtags, if one so wants (say #SiberianHusky if you want to follow all #dog). That's probably out of scope for this initial proposal.

Limitations / Disadvantages[edit]

  • Danger of centralization both on the side of ActivityPub client/server and controlled vocabulary: The Fediverse currently is a decentralized where you can choose between different instances and services to publish your things. Mastodon is only one of them, there are many other services like Pleroma, Misskey or Bonfire and also instances for other purposes like mastodon (Pixelfed for image sharing, Peertube, Bookwyrm etc.) Enabling such a feature just for Mastodon could further centralization/consolidation of types of ActivityPub servers in the Fediverse. Also, only supporting Wikidata for tagging with controlled vocabularies would centralize things on the side of the controlled vocabularies to choose from. Thus, the approach should rather be protocol-driven to enable usage in all ActivityPub-conformant servers and clients and should also enable integration of controlled vocabularies that are not Wikidata.
    Wikidata itself is federated, so it should not be difficult to recognize other "semantic" tags drawn from different vocabularies. It may be wish not to assume that all semantic items have names starting with "Q", though.
  • Applications and user interfaces not supporting QIDs should always have a fallback to something that doesn't break existing interfaces, hashtags preferably.
  • It is not the intent to replace the hashtag system, but rather to allow certain hashtags to have a QID in order to connect them to each other. Speaking hashtags that don't have QIDs such as #IToldYouSo or #HeIsDeadJim would and should still be possible and not be affected by this proposal, the same for new hashtags created for stuff that doesn't have a QID yet. The QID extension of Mastodon tags should not interfere with current use cases.
    Not sure if a transfer of knowledge can be applied here, but it seems like most word processing tools are nowadays very capable of handling text with Unicode characters unknown to them. Does this make sense?
  • The naive user experience will probably be that this is a "multilingual" hash tag, in so far as clicking it will result in a search which will pull up posts in many languages (incidentally also with many different (hopefully synonymous) names for the tag). So monolingual/bilingual users, having the search results include "unreadable" posts may not be perceived as desirable.
  • The UX for adding QID annotations may be perceived as adding unnecessary clutter or complexity.
  • Multiple QIDs may map to the same hashtag. Similar and identical hashtags with similar or distinct meaning are nothing new and part of the versatility of hashtags. #openaccess can refer to a publishing model and to rules for ISPs how to use copper or FTTH infrastructure. In the long run, a consolidation happens by popularity or separation. Tying a particular hashtag string to a QID may be resisted by those who see the ambiguity as desirable.
  • QIDs may not be specific enough, at least without a more robust means of combining concepts. For example, #BrooklineTownMeeting does not exist as a QID, although "Brookline, MA" is Q455752 and "Town Meeting" is Q49142. Humorously-specific hashtags such as #RainbowUnicornHorseRambo are part of the culture of social media, and in addition to being engaging also provide a useful mechanism to disambiguate a tweets from a particular transient community (ie, #SMWCon2022).
  • Some QIDs may be too specific to allow easy use by an unskilled user, and a search for (eg) the QID for "dog" (Q144) may not turn up tweets tagged with the QID for "greyhound" Q38571 (which is a "dog breed" Q39367 not a "dog") -- although a clever search engine might mitigate this.
    Especially regarding that "a clever search engine" part: I see this as an advantage. Under current hashtag customs, more precise niche tags will be missed in a sea of more generic tags. With this system, there is now really an incentive for people to use Q38571 instead of Q144 and still being found.

Possible vectors for misuse[edit]

  • If the presentation of a QID is freely chosen by the author (ie, user is free to use #Dog or #Dogs ... or #Cat to represent Q144), then malicious actors may abuse this to redirect the user to objectionable content (#kittens to Q695677, or #GeorgeBush to Q64943914, to echo an old Google bomb)
    • Similarly the P2572 property could be abused to add misleading or objectionable content. Eg, vandalism on the P2572 property of Q144 could insert "#DogsSuckCatsRock" retroactively into all posts using the #Q144 hashtag.
  • Obscure QIDs could be targeted for vandalism of this sort, and/or taken over. If I can get P2572 of some arbitrary little-noticed Q item to read #cscott without a patroller noticing, than User:cscott can use that Q item for vanity hashtag purposes indefinitely without any further edits to wikidata, and little concern that any other poster is going to use that Q item as a hash tag.
  • The treatment of edited QIDs may open up more options for abuse: if the presentation of Q144 as #Dog is a lookup done dynamically by the client, then edits to the P2572 property of Q144 may have surprising retroactive effects on "old" posts. But on the other hand, if the presentation is fixed at the time the post is authored, editing QIDs may cause the apparent effect of the hashtag to change when the QID is edited. (IE vandalism changes the name of the hashtag associated with Q144 from #Dog to #Cat, a post is written, the vandalism is quickly reverted, but then anyone stumbling across the viral post which apparently has #Cat in it will (a) find that clicks taken them to #Dog instead, and (b) find that the post is "invisible" to searches for #Cat (since the it is indexed under Q144).
  • If the "wikidata hashtag" for Q1 is conflated with the "ordinary" hashtag #Q1, the results may be surprising to some users if #Q1 is widely used by naive users, either in the ordinary course of events or as a result of a news event (for example, a controversial "Question 1" on the ballot, or some government policy coming into effect in "Quarter 1".)

Possible proposals[edit]

Extend the data model[edit]

The current data model for tags is here:

This is based on ActivityPub tags. (As per the suggestion above, the data model extension should be considered to be at the ActivityPub level, so as to be appropriate for all services in the fediverse, not to the Mastodon-specific data model.)

The suggestion is to extend the data model with

  • an optional field id (or @id) to specify the URI of a tag to identify its semantics. This can be used to store the Wikidata URI (http://www.wikidata.org/entity/Q???).
    • Some care should be taken to honor the semantics implied by JSON-LD and use an appropriate type property if possible. As per JSON-LD Best Practices (see best practice 6), there is some inherent type confusion: is the tag #BarackObama a Person or a Tag or a schema:Dataset? ActivityPub allows using a Person as a tag (see the example in the spec), but it's probably fair to say that Mastodon's tag spec treats the Tag object as something separate from the Person; the Tag contains usage history in its properties, for example. As such, it may be best to use a field name other than id, with its special semantics, and clarify that the JSON-LD link is to a concept described by the Tag, and is not the Tag itself.

The data model may further be extended with

Note that as per the discussion in the ActivityPub specification, the information in the data model is completely independent from the particular microsyntax for tags used in the content, name, and summary properties of a post.

UX for adding a label[edit]

Whenever a hashtag is being entered, there should be a drop down that allows to select an entity from Wikidata. This could be considered obtrusive if not implemented well, and the prefix search necessary to populate the drop down could be costly in terms of load on wikidata servers. It's possible a unique typographical marker could/should be used to indicate unambiguously that the author intends to add a wikidata-based tag, for example ## or #Q .

The "name" on the tag would remain the usual, say #DouglasAdams, but the "url" will be stored as https://[domain]/tags/Q42. (Note: this is exactly the URL used for #Q42, and which returns the indexed results for the tag named #Q42 from the server at domain. I think we'd want to either use a unique domain to indicate QID tags, or replace the /tags/ part of the URL with an alternate API path, say /qid/Q42. Alternatively, we may also want to add a new property to the tag object, linked via a JSON-LD context to the appropriate QID, but for interoperability the url field should still point to something which will return the search results for the given tag when clicked.)

You may want to visually distinguish tags backed by Q-items: perhaps by an alternate color or other styling, perhaps by a typographical marker such as a double-hash, ##DouglasAdams, or a prefix #Q Douglas Adams.

Possible design decision:

  • 1) The author of the post can use what name they want for the hashtag (see "vectors for misuse" above)
  • 2) The hashtag gets set when selecting the QID to the label of the QID in Wikidata (Alternatively: Property:P2572 exists and could be used to set the label)
  • 3) The QIDs can be completely 'hidden' and set by a mechanism not backed by a microsyntax in the post content. That is, the wikidata-backed tags don't "appear" as an explicit #tag but instead are set and displayed separately, outside the content box. (In a variant, they could be added by writing ##Tag in the content area, as a shortcut, but once added "disappear" and are added to the "wikidata tag" list outside the box.) See https://social.wxcafe.net/@wxcafe/109359919813425386 for a bit more discussion out "out of band" tags ("Tumblr style") and spaces.

Problems with 1: one could write one text, but link to a completely different QID. This can be done for humor, but it may also be misleading. There is historical precedence for this. The French Encyclopédie used links in order to hide criticism that would otherwise not have passed state censorhip. The article on sacramental bread was linked with cannibalism and vice versa. Writing a text linked to a completely different QID could be considered permissable in order to allow for subtle remarks, humor, criticism and satire; but see "vectors for misuse" above.

Problem with 2: labels have spaces, and hashtags don't support spaces. Property:P2572 may be used to mitigate this, but perhaps there will not be good consensus on what the single hashtag for a QID "should" be. In either case, drawing from Wikidata might invite vandalism (see "vectors for misuse" above).

If selecting strictly from Wikidata, this might invite vandalism to Wikidata. Changes to labels may not get reflected on older posts, so it would get out of date anyway. (See discussion about label presentation under "vectors for misuse".)

Alternative 3 is perhaps more compatible with wikidata-unaware clients, who will ignore this content entirely, and because the tags are not considered "content" it is perhaps more acceptable if their names and presentation change after the post is made (for example, because the P2572 property on the QID was edited).

UX for displaying a label[edit]

Wikidata-unaware clients will display the "name" of the tag as the hashtag. It may be missing the styling indicating it is a "wikidata-enabled" hashtag. Clicking on the tag will probably display an index by the "name" on the local server.

Wikidata-aware clients may display the "name" but may also elect to annotate it with additional information from Wikidata (icon image, short description as title text, etc) as well as style it to indicate it is "wikidata-enabled". Clicking on the tag /may/ produce an index by QID, but might have to account for a wikidata-unaware server instance which may not support this search, in which case it /should/ fall back to an index by "name". (This depends somewhat on the implementation decision discussed below under "indexing on an instance" -- if the decision is to index Q1 as #Q1 instead of maintaining a separate index, it may not be possible for the client to determine whether the server supports "indexing by QID" and the results presented for "#Q1" may not actually contain any results from wikidata tags.)

Indexing on an instance[edit]

Wikidata-unaware clients will index on the name as usual.

Wikidata-aware clients should index on the QID, not on the name. This can be done either by substituting #Q42 as the name of the tag, or by maintaining a separate index by QID (to avoid namespace conflicts with hashtags beginning with Q).

UX only change, use raw QIDs as hashtags[edit]

Not necessarily incompatible with adding URL support to the data model. Although both approaches could be pursued in parallel, it would be best if the results were ultimately consistent.

In this approach, hashtags in the form of #Q42 are not represented by any data model change (or not /necessarily/ represented by a data model change) and the focus is on special /client side/ support for hashtags of this form.

UX could be the same as above: a drop down suggests auto-completions based on wikidata entries, and inserts #Qnnn if they are selected. All #Qnnn tags are *rendered* using localized names taken from P2472 of the QID, both during post composition and when browsing a feed. Some additional styling might be added to further distinguish these.

Indexing will be done on the name of the hashtag as usual, which just happens to be a QID. The search results page will probably probably want the same UX treatment as the feed browser, so that a search for #Q42 shows up as a search for #DouglasAdams (or ##DouglasAdams, if that's the styling preferred).

Advantages[edit]

No change to the data model or to servers. Opt-in support in clients.

Not incompatible with server-side support: the presence of a URL field in the tag can be used as additional evidence that the #Q tag should be rendered as a wikidata reference, not as an 'ordinary' hashtag. From the user perspective then, server side support /when present/ adds the ability for a "custom hashtag name" for QIDs -- although indexing might then be affected and malicious uses enabled; see discussion above.

Hashtag searches work on all clients, even those which don't "know" about QIDs.

Disadvantages[edit]

There might be valid uses of the hashtag #Q1 #Q2 etc., e.g. for quarters, or questions on a ballot during elections etc, and these are hard to disambiguate without data model support.

On instances who don't understand this convention, they would see raw QIDs, which should be avoided (however the posts are always indexed and searched correctly).

Search[edit]

A search for a specific QID may logically want to be expanded to also include QIDs which are instances of the original QID or subclasses of the original QID. For example, searching for dog (Q144) may want to include a result which is tagged greyhound (Q38571) but that requires following the chain Q38571 greyhound instanceof dog breed (Q39367) subclass of dog (Q144).

This broadening can either be done on the client or server side, and both can be done whether #Qxxx tags are located "opportunistically" or unambiguously in the data model. The server side approach is much less resource-intensive, but the client side approach does not require wikidata-aware servers.

Client side[edit]

Query the server for the results for Q144 but then expand the results with those returned by a query for Q39367 and subsequently for Q38571 etc. This is potentially costly, but could be done as multiple incremental searches and does not require any special support from the server.

Server side[edit]

When processing a post containing the hashtag #Q38571 (either in the datamodel or opportunistically), also add this post to the index under #Q39367 and #Q144 etc.