Topic on User talk:Pyfisch

Jump to navigation Jump to search
MisterSynergy (talkcontribs)

Hey Pyfisch, in User:Pyfisch/Counter-Vandalism you refer to the "magic summary" containing valuable information about the edit which Wikibase adds to (almost) all ns0 edits. It is clear what you mean with this term (the part between /* and */), but how do you split it up into action and parameter(s) such as language codes or wiki projects? Did you find some documentation for this feature somewhere, or did you just try to gather parameters heuristically?

Cescolino89 (talkcontribs)

Hola, no se por que me has corregido esos datos, toda persona puede incurrir en robos y lo de ese ser humano ha sido uno de los más grandes del arbitraje español. Ruego que modifique su corrección

Pyfisch (talkcontribs)

Hallo MisterSynergy, beschreibt die Magic Summaries in Wikidata. Ich habe einen regulären Ausdruck ìn Python benutzt um die Zusammenfassungen zu parsen: "/\* ([\w-]+):(?:(\d+)(?:\|([\w-]+))?)?.*? \*/.*".

Hola Cescolino89, este es un resumen de sus modificaciones: Has entrado que Juan Martínez Munuera es un ladrón y no un árbitro de fútbol, que no es un hombre sino un ladrón, y que es una película y no un hombre.

MisterSynergy (talkcontribs)

Thanks, this is what I looked for, although it does not seem to be as complete as desired unfortunately.

MisterSynergy (talkcontribs)

FYI: I am using https://public.paws.wmcloud.org/User:MisterSynergy/misc/2020%2010%20unpatrolled%20changes/unpatrolled%20changes%20dashboard.ipynb to filter unpatrolled recent changes currently, in order to get an idea of the situation and to see what I can patrol. It is originally meant to be for myself only and not for presentation to others, which is why I did not present it in the 24hr-meeting. The Python code is a bit lengthy meanwhile, but not much else than some SQL querying (~2-5 min), pandas acrobatics, and graph decoration (both on the fly).

Yet, it helps me to find larger sets of items which can be batch-patrolled so that these revisions do not consume the time of other patrollers. We have currently patrolled roughly 40% of all unpatrolled changes, and I find it still difficult to filter the really problematic ones out of the remaining 100k unpatrolled edits. Any idea where to look at, or how to filter? I have not figured out how to filter unpatrolled changes so that the actual vandalism shows up in reasonable numbers. I see some occasional instances of vandalism, but it really isn't much.

I also think that we can offer plenty of more reports for other users to engage in this field. Whether they would do so, however, … I'm not sure. :-)

Pyfisch (talkcontribs)

Very interesting evaluation. I'm glad it helps you to batch patrol some changes, so others don't have to.

I have some ideas for how to filter the remaining changes, but they are usually a bit more complicated:

  1. Look for changed statements where the value was changed, but the references weren't. This is almost never correct and should be fixed. Sometimes this is done by vandals or in test edits.
  2. There are certain facts that never change and therefore the statements shouldn't be changed either. For people these are for example birth name, date of birth, place of birth.
  3. In theory constraint violation reports or ShEx could be used to check and filter changes. But due to the huge number of inconsistencies found in Wikidata there are already too many problems to fix for human editors.
  4. One strategy to prioritize unpatrolled changes could be to count the number of sitelinks to Wikipedia. Vandalism occurs most often on items for important topics or people which will have Wikipedia articles, in addition these items are on average more complete meaning new editors will have less information to add.
  5. Filter changes that were made by globally locked users or users that are blocked on other Wikimedia projects. I sometimes find users that are already blocked as sockpuppets on English Wikipedia, but their bad edits remain on Wikidata.

I think the data quality issues and vandalism on Wikidata are linked. As long as Wikidata tolerates statements without sources it will be easier to add wrong information than to remove it. This especially applies to contentious claims, but also to facts like elevation of mountains where there is always some source. When no source is given for a new or changed claim, I usually either try to verify it, which takes me much longer than to add the claim, or I won't bother patrolling the change.

MisterSynergy (talkcontribs)

Thanks for the comment. Let me try to address all of your points:

  1. No clue how to query this right now. WMDE is working on a feature that somehow records these cases (value changed, but reference not) and to warn users in case this happens unintentionally, but I don't know where they store this data and how to query it. Users can confirm that they did this intentionally as well.
  2. There can always be mistakes that are being corrected by IPs. In fact, I have seen quite a lot of valuable contributions by non-autoconfirmed users in the past weeks and think that the vast majority of unpatrolled changes was made with good faith and relatively good knowledge about this project.
  3. ShEx is useful to compare an item to a desired data model (i.e. which properties to use, and so on), but IMO it is not helpful to detect vandalism. Additionally, it did not really gain traction yet and it does not perform well on a project scale as you compare single items against the schema. Covi reports on the other hand, at least to ones by KrBot, suffer from KrBot's somewhat outdated and incomplete implementation that does not always comply with the official constraint definitions. I'd rather see them go away completely or being done by someone else, but KrBot's operator as the inventor of the covi system is still on it.
  4. Have a look at User:MisterSynergy/patrol/highly used items. Those are items which are in use in more than 500 Wikimedia pages in some way and have unpatrolled changes. This problem will be gone soon, since I got my admin-bot User:MsynABot approved and it will soon implement the RfC that we are going to semi-protect all items with 500 or more uses on Wikimedia pages. It just needs some minor modifications to the code to be done.
  5. This is an aspect which I have not yet considered in my script. We have discussed those API calls earlier and I found the database where a part of the information is coming from, but unfortunately some data such as local edit counts and local blocks are apparently directly being queried from all the servers on-the-fly. In other words: I don't think we can retrieve all information in the API call result for several users in a single query. We can just see when the global account was created from the central users database.

Generally I am not overly convinced by the narrative that Wikidata has a serious vandalism problem. Yes there is vandalism and some of it stays much longer than acceptable, but this is not a problem of a scale that should objectively threaten Wikidata's reputation. I know that in some of the larger Wikipedia projects (enwiki, dewiki, etc) there are influential editors of the core community propagating this notion, but they usually only show a few instances of Wikidata vandalism and then claim that this is a serious problem—without much knowledge about the actual situation. If we considerably increase our counter-vandalism activities, they would either not believe it to be effective, or find some other reason why they do not like (and want) Wikidata.

Pyfisch (talkcontribs)
  1. It is not possible to query right now. You'd have to get the old and the new revision via the API and compare them. The warning that is shown to users is temporary right now and not stored anywhere. However there is a Phabricator ticket to change this.
  2. Certainly. But you asked for ways to filter changes that are likely to be vandalism and these changes are comparatively likely to be vandalism. It still needs to be verified if they actually are.
  3. An additional challenge is that the constraints of properties are sometimes incomplete or need improvement. However commonly vandals set values that have the wrong type. For example a vandal replaced the first name of a person with shit.
  4. There are many items on Wikidata that don't have sitelinks at all and I don't think they are targeted by vandals that often. However the item for an author of an unpopular book that is read in schools may be targeted frequently.
  5. Thats a pity. I don't know if this data is important enough to query from the API regularly.

It's true that vandalism problem is exaggerated by these Wikidata critics. Right now I use a filtered display of recent changes: high risk changes (remove the unpatrolled filter to see more). I patrol these changes regularly, the majority are either vandalism or test edits and need to be reverted. The issue is that I have the impression that I am the only person who really patrols these changes with these tags. Don't misunderstand me, a lot of vandalism in these categories is reverted by other people before I have a change, but right now I seem to be the only person specifically patrolling changes with these tags. Before I started this list was a lot longer and contained changes that weren't patrolled for 10, 20, 30 days. And I don't think this vandalism is usually caught in Wikidata if there isn't a motivated person at the time.

MisterSynergy (talkcontribs)

Thanks, quite a helpful filter. I think I could replicate most of it in my script as well. Might be an idea to search for activity from similar IPs then.

As of now, I am trying to patrol on a per-user approach, i.e. judge whether a user edits with good faith or not, and then either patrol everything or undo whatever is vandalism. This is a problem that I see with most of the current patrol tools that take a per-revision approach, where it is often difficult to judge an edit due to the lack of context; the per-revision approach is also relatively expensive in terms of required patrol effort.

My intention is that we might eventually get to a point where (with some more editors) all edits, maybe with select characteristics, receive a patrol process by an experienced editor. "Terms in German" are for instance very doable, and there is no real backlog in spite of German being one of the most used languages here. However I am not sure whether we can get there on a significantly larger scale, to be honest.

We also need to consider that after 30 days all edits are practically being patrolled anyways, since this information resides in the recentchanges table that is just a rolling snapshot of the activity in the past 30 days. I'd rather patrol good-faith, yet not perfect edits than let them obscure our sight to the actually problematic edits.

Reply to ""magic summary""