Jump to content

User talk:ASarabadani (WMF)/Growth of databases of Wikidata

Add topic
From Wikidata
Latest comment: 1 year ago by Addshore in topic Problem statement calrification

Hardware

[edit]

What kind of assumptions does this make about future hardware? For example, I believe it's pretty easy nowadays to get a server with 8 TB of RAM. Nemo 19:19, 4 September 2024 (UTC)Reply

Proposed criteria for merging revisions

[edit]

I believe that consecutive edits matching one of the following criteria may be safely merged into a single revision:

  • The two edits come from the same run of a particular tool, irrespective of the difference between their save times.
    • The run identifier is generally discernible from the edit summary if it comes from a tool like QuickStatements or OpenRefine.
    • Other tools, such as the Author Disambiguator or SourceMD, may need different heuristics (one edit adds a statement and the following one adds a reference to that statement).
  • The two edits are termbox edits (some combination of label, description, or alias) made by the same user within 10 seconds.
    • When changing multiple fields in the Wikidata User Interface termbox, for some reason each of those changes manifests as a single edit in sequence.
  • The two edits are made by one of a number of bot accounts that often separate each change into its own edit (such a list of accounts will likely not be exhaustive, as more cases may be discovered).
    • Cewbot, KrBot, and Emijrpbot are just a few examples.
    • The time difference between such edits could be somewhat larger than that for termbox edits, such as 30 seconds or a minute.

I am sure there are other useful criteria that could be applied, but I believe these are some of the most prominent ones. Mahir256 (talk) 20:45, 6 September 2024 (UTC)Reply

Emijrpbot isn't even in the TOP100 of accounts which do successive modifications on the same page (known as reedits in Wikiscan). Emijrp (talk) 15:42, 7 September 2024 (UTC)Reply
I second this. Squashing/merging of Revisions might save us a lot of space, out of the current 349 GB perhaps a quarter can be shaved off. That would give us some room to find more long term solutions. So9q (talk) 13:59, 14 September 2024 (UTC)Reply
When it comes to merging revisions, there are two kind of merging. On the one hand, you can merge them at the time the edits are made. On the other hand, you could look back and merge edits within the history.
If storage space is the main bottleneck, editing the history might allow us to merge edits by the same user on the same item together even if they are within an hour of each other instead of just 10 seconds. ChristianKl11:46, 24 September 2024 (UTC)Reply
I agree, timespan is not of importance, edits in between is. As long as there are no edits in between and it is within the same date, I'm good with merging. So9q (talk) 10:16, 26 September 2024 (UTC)Reply

FWIW, the idea would be to remove revisions, not to merge them. let's say a bot makes 100 edits in one minute on an item, the idea would be to remove 98 edits in between the first and the last edits from our databases. ASarabadani (WMF) (talk) 12:38, 24 September 2024 (UTC)Reply

That is a terrible idea if it means a user looking at the history can no longer ascertain who made what edit. How would this affect the edit history? Can someone make an example item on test.wikidata and delete a few revision from the database so we can see how it plays out in the interface afterwards? So9q (talk) 10:19, 26 September 2024 (UTC)Reply

More timeframe details wanted

[edit]

@ASarabadani (WMF) wrote "With the current projections and without interventions, it will soon become unstable and later completely unusable rendering all of Wikidata without a way forward."

Could someone provide more details about what soon means? 1 month? 1 year? 3 years? Essentially I want to know how critical it is. So9q (talk) 13:57, 14 September 2024 (UTC)Reply

The answer to your question is a massive "it depends". How much growth it's going to have and how many interventions we do and what do you mean by unusable. ASarabadani (WMF) (talk) 12:47, 24 September 2024 (UTC)Reply

Revision archive

[edit]

Would it be feasible to offload revisions that are older than x days to an archive table that does not need to be kept in memory? If we were to offload everything older than 60 days, how big would the online in memory tables be?

How many of the queries to the database that you specified are for old revisions?

I'm willing to wait 30-60 seconds for old revisions if that is needed to keep the project up in the long term. So9q (talk) 14:12, 14 September 2024 (UTC)Reply

If you mean the revision table, I fear not: whenever one clicks on the History tab, that table is queried; the first page of page history loading for 30-60 seconds (for pages that haven’t been edited more than 50 times in the past 60 days, i.e. the vast majority of pages) wouldn’t be acceptable IMO. If you mean the text table, it’s already offloaded (on all WMF wikis) via mw:Manual:External storage. —Tacsipacsi (talk) 22:40, 22 September 2024 (UTC)Reply
Okay, what about 5-10 seconds with a clear loader on screen then for all revisions except the last 50 on all items (without time considerations).
How many revision could be moved then @ASarabadani (WMF)?
How would that affect the tables?
How much effort would it be to implement in mediawiki? So9q (talk) 10:22, 26 September 2024 (UTC)Reply

Is problem with edit engine or query engine

[edit]

I understand that WDQS uses a version of WD without history, that often has 60 second timeouts over queries QLever has no problem with, while the GUI and update APIs have a full revision database. Which is under more stress, and what barriers are there to changing the query engine? I don't observe problems with the GUI or QuickStatements myself. Vicarage (talk) 15:03, 23 September 2024 (UTC)Reply

Yes, having an idea of how much the WDQS is the bottleneck and how much Wikidata's core database is the bottleneck is important, to understand where the focus should be. ChristianKl13:23, 25 September 2024 (UTC)Reply

I started a discussion in project chat about this situation

[edit]

See Wikidata:Project chat#New ticket about making Wikidata horizontally scalable. I invite you all to participate and join the discussion about this critical issue that affects all of WMF and the consequences this dire situation has for the goal of sharing in the sum of all human knowledge. So9q (talk) 11:18, 26 September 2024 (UTC)Reply

Let's compare with OSM

[edit]

WMF "official" details:

  • revision table, but Nikkis observation that the UI generates a lot of unnecessary edits e.g. when editing a label and description and clicking publish are not mentioned.
  • I have not been able to find information about individual machines in the s8 cluster or relevant statistics.

OSM database details:

  • According this tableinfo dump from 2016 the OSM tables were huge compared to wikidatawiki.
  • I have not heard a single "impending doom" message like this one from the OSMF or the system administrators.
  • I found a list of the hardware used by OSM to run their RMDB Postgresql database.

Amsterdam

[edit]

The server they are currently using for hosting the main db in Amsterdam is Supermicro SYS-2029U-TN24R4T which has 6TB RAM capacity. They only have ~500GB RAM in this server as of 2024-10-10.

They have a read only mirror db in Amsterdam with 3TB RAM capacity but only 256GB installed.

Dublin

[edit]

In Dublin they have an even faster machine Supermicro SYS-120U-TNR acting as read only mirror with support for up to 8TB RAM.

London

[edit]

In London they have 2 read only mirrors also with 6TB RAM capacity each.

Conclusions

[edit]

The database structure of mediawiki seems reasonably optimized with relatively little duplication when I looked into the tables.

In short, I conclude:

  • we don't seem to have major issues with the database architecture compared to OSM
  • we don't seem to have major issues with the database software compared to OSM (we use mariadb, they use Postgresql, both can be clustered if they ever grow out of a single fast state of the art machine)
  • we seem to have capacity issues caused by old hardware no longer being able to support the size of wikidatawiki.--So9q (talk) 20:36, 10 October 2024 (UTC)Reply
  1. Please don't insult the Wikimedia developers by claiming that they are inexperienced or not sufficiently qualified, when you don't have any relevant expertise to access their qualifications. Take this as an admin warning.
    The primary reason why MediaWiki gets developed is for hosting Wikipedia. To motivate the WMF to spend resources to update MediaWiki to handle more data for the sake of Wikidata it's likely useful to write up a document like what ASarabadani wrote and using terms like "Impending Doom" might be useful help someone give ASarabadani the resources to make the changes that ASarabadani considers to be necessary.
    Saying that ASarabadani brothers the community by writing this article on his user profile seems like a strange complaint. Transparency is a virtue. ChristianKl21:46, 10 October 2024 (UTC)Reply
    Thanks for the heads up. I pruned a bit. I agree transparency is fantastic to have, it one of few things you cannot get enough of when it comes to organizations. :)
    I'm aware that Wikipedia is the flagship, but if wikidatawiki goes down, so does Wikipedia since it is needed in the background. Hence only focusing on Wikipedia when provisioning hardware seems like a bad idea. I was not aware that it was an organizational memo not intended for the community to read/base actions on.
    Unfortunately we have a severe lack of information about the operations of Wikidatawiki.
    I actually got a response this evening in the Telegram wikidata chat from @ASarabadani (WMF) where he said that my ideas for buying a beefier server had already been considered and rejected. Unfortunately this information does not seem to be in Wikitech. I also don't see any public reports from WMDEs technical team anywhere about their considerations or wishes or recommendations.
    At this moment I cannot find even basic information, e.g.:
    1. who is even responsible for the s8 cluster, WMDE or WMF?
    2. what is the budget for the s8 cluster for 2024?
    3. what are the relevant machine statistics for the s8 wikidatawiki cluster?
    4. what are the specs on the individual machines in the cluster (e.g. can they be upgraded to house more RAM or are new servers needed?)
    It seems worth to investigate further. So9q (talk) 22:52, 10 October 2024 (UTC)Reply
    1. Certainly the WMF is responsible for the s8 cluster, however naturally it is wikidata.org that makes use of this cluster primarily (though all clients (wikipedias and such) also query the databases on s8).
    2. That would be a question for WMF staff, i don't believe any public numbers are split up like this
    3. At https://noc.wikimedia.org/db.php you can see the information about which hosts make up the cluster. However I can't see a way to see what HW these host have. I'm going to go do some poking though.
    4. See above.
    ·addshore· talk to me! 14:37, 3 November 2024 (UTC)Reply
    All s8 machines have 503GB `MemTotal_bytes` it would seem. ·addshore· talk to me! 16:41, 3 November 2024 (UTC)Reply

`git` ideas re revisions

[edit]

`squash` sounds the the `git` approach to the multiple revisions problem. And we could effectively do that to historical revisions with some ease, as we don't have to recompute hashes etc for "commits" as we dont actually have commits. Arguably you could a) do some sort of automatic squashing of historical revision b) provide an API that squashes some commits that users can trigger c) allow some situations where real time edits are squashed? (or ammended?)

Another `git` style thing, that would likly be more work would be introducing some idea of branching content (specifically for the period edits are made), but this sounds like a lot of work. ·addshore· talk to me! 10:56, 26 October 2024 (UTC)Reply

Though I also see the side of changing history is bad / not nice, and also loosing out on the granular info that already exists is bad and not nice. Also of course, the classic, edit summaries are hard... ·addshore· talk to me! 07:58, 30 October 2024 (UTC)Reply
How much would it help (any guesses?) if the "squashing" was limited to consecutive edits within some limited time period (1 hour, 1 day?) by a single user with the same non-automated edit summary? I.e. ignore all the auto-generated "Changed claim/Created claim/Removed claim" etc. parts of the edit summaries when squashing, replacing with something generic ("Edits squashed"?) It seems to me this wouldn't be at all harmful as the full details of any changes can be reviewed easily enough. ArthurPSmith (talk) 21:03, 4 November 2024 (UTC)Reply
It should be the sort of number that is easy enough to figure out in SQL, and I might give it a go at some point. ·addshore· talk to me! 08:16, 5 November 2024 (UTC)Reply

67GB of revision shas?

[edit]

@ASarabadani_(WMF): So, based on 2267515456 rows (max id) and 32 bytes for each sha1, 67.5GB of the table is currently just revision hashes. Sure they aren't used in indexes, but thats still ~20% of the table content just made up from hashes? That for wikidata are also just duplicates of the hashes for the content, as currently only one content is used at a time? ·addshore· talk to me! 08:16, 30 October 2024 (UTC)Reply

@Addshore I have thought of different ways to improve this, one is to move hashes to a dedicated table (and normalize it) but that sorta defies the whole purpose. The other one I have been considering is to trim it to 16 chars and compare that part only. Certainly we don't need cryotgraphically secure hashing for the revisions. ASarabadani (WMF) (talk) 17:18, 2 November 2024 (UTC)Reply
Though, ultimately, at the level of only 67GB, it's probably not worth the engineering time to bother touching this in any way? ·addshore· talk to me! 14:18, 3 November 2024 (UTC)Reply

Problem statement calrification

[edit]

> Wikidata's core database has been growing extremely fast since its inception.

Do we have a measurement of this growth over time? Is the growth rate at the DB table level increasing? decreasing? do we have projections? I am assuming yes as you lets say "With the current projections", however I can't see any such projections linked?

> For example, it's 1.5x bigger than database of English Wikipedia while being a decade younger.

This in itself doesn't sound like a "problem" to me, just a comparison to some other known sizes.

> it will soon become unstable and later completely unusable

This does sound bad, however isn't directly the problem you are stating, I believe?

> we can't expand the memory further

This sounds like the core issue that you are identifying? And or the fact that too many queries are happening? But I expect it is primarily the former?

If so, it could be useful to look at the index sizes of the wikidatawiki db, rather than just the collective table size ·addshore· talk to me! 14:28, 3 November 2024 (UTC)Reply

I'm going to go ahead and immediately answer one of my own questions

SELECT
  TABLE_NAME AS `Table`,
  ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024 /1024) AS `Total Size (GB)`,
  ROUND((DATA_LENGTH) / 1024 / 1024 /1024) AS `Data Size (GB)`,
  ROUND((INDEX_LENGTH) / 1024 / 1024 /1024) AS `Index Size (GB)`
FROM
  information_schema.TABLES
WHERE
  TABLE_SCHEMA = "wikidatawiki"
AND
  ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024 /1024) > 1
ORDER BY
  (DATA_LENGTH + INDEX_LENGTH)
DESC;

Resulting in...

+-------------------------+-----------------+----------------+-----------------+
| Table                   | Total Size (GB) | Data Size (GB) | Index Size (GB) |
+-------------------------+-----------------+----------------+-----------------+
| revision                |             340 |            124 |             216 |
| wbt_item_terms          |             243 |            109 |             134 |
| pagelinks               |             107 |             53 |              53 |
| content                 |              99 |             99 |               0 |
| slots                   |              81 |             53 |              28 |
| text                    |              74 |             74 |               0 |
| change_tag              |              73 |             22 |              51 |
| comment                 |              67 |             55 |              13 |
| wbt_text_in_lang        |              50 |             26 |              24 |
| wbt_term_in_lang        |              46 |             25 |              21 |
| wbt_text                |              22 |             12 |              10 |
| page_props              |              22 |             10 |              12 |
| page                    |              14 |              6 |               7 |
| externallinks           |              13 |              7 |               7 |
| cu_changes              |              10 |              5 |               5 |
| watchlist               |              10 |              4 |               6 |
| wb_changes_subscription |               9 |              4 |               6 |
| wb_items_per_site       |               7 |              3 |               4 |
| abuse_filter_log        |               6 |              2 |               4 |
| recentchanges           |               5 |              1 |               4 |
| linktarget              |               4 |              2 |               2 |
| archive                 |               2 |              1 |               1 |
| logging                 |               2 |              1 |               1 |
+-------------------------+-----------------+----------------+-----------------+
23 rows in set (0.003 sec)

So:

  • Total index size ~608GB
  • Index size of revision is ~216GB, which is around a third of the table size
  • Terms storage is primarily indexes, and certainly uses a large (ish) chunk of index space.
  • It looks like I cant query `mysql.innodb_index_stats` to get more data, but if @ASarabadani_(WMF): would be up for that, that would be great!

·addshore· talk to me! 14:52, 3 November 2024 (UTC)Reply

The index sizes (note the update times):

cumin2024@db1172.eqiad.wmnet[mysql]> select * from innodb_index_stats where stat_name = 'size' order by stat_value desc limit 50;
+---------------+-------------------------+------------------------------------------+---------------------+-----------+------------+-------------+------------------------------+
| database_name | table_name              | index_name                               | last_update         | stat_name | stat_value | sample_size | stat_description             |
+---------------+-------------------------+------------------------------------------+---------------------+-----------+------------+-------------+------------------------------+
| wikidatawiki  | revision                | PRIMARY                                  | 2024-09-14 08:41:39 | size      |   16303296 |        NULL | Number of pages in the index |
| wikidatawiki  | wbt_item_terms          | PRIMARY                                  | 2024-10-16 19:49:24 | size      |   14334912 |        NULL | Number of pages in the index |
| wikidatawiki  | content                 | PRIMARY                                  | 2023-05-07 19:57:24 | size      |   12960512 |        NULL | Number of pages in the index |
| wikidatawiki  | revision                | rev_page_actor_timestamp                 | 2024-09-14 08:41:39 | size      |    9922496 |        NULL | Number of pages in the index |
| wikidatawiki  | text                    | PRIMARY                                  | 2023-04-28 13:17:10 | size      |    9733568 |        NULL | Number of pages in the index |
| wikidatawiki  | wbt_item_terms          | wbt_item_terms_term_in_lang_id_item_id   | 2024-10-16 19:49:24 | size      |    9443157 |        NULL | Number of pages in the index |
| wikidatawiki  | wbt_item_terms          | wbt_item_terms_item_id                   | 2024-10-16 19:49:24 | size      |    8115685 |        NULL | Number of pages in the index |
| wikidatawiki  | revision                | rev_page_timestamp                       | 2024-09-14 08:41:39 | size      |    7958269 |        NULL | Number of pages in the index |
| wikidatawiki  | comment                 | PRIMARY                                  | 2024-04-11 14:33:44 | size      |    7146432 |        NULL | Number of pages in the index |
| wikidatawiki  | pagelinks               | PRIMARY                                  | 2024-06-25 16:25:08 | size      |    6968384 |        NULL | Number of pages in the index |
| wikidatawiki  | slots                   | PRIMARY                                  | 2024-03-12 06:34:00 | size      |    6937664 |        NULL | Number of pages in the index |
| wikidatawiki  | revision                | rev_actor_timestamp                      | 2024-09-14 08:41:39 | size      |    5683836 |        NULL | Number of pages in the index |
| wikidatawiki  | revision                | rev_timestamp                            | 2024-09-14 08:41:39 | size      |    4715008 |        NULL | Number of pages in the index |
| wikidatawiki  | pagelinks               | pl_backlinks_namespace_target_id         | 2024-06-25 16:25:08 | size      |    3858944 |        NULL | Number of pages in the index |
| wikidatawiki  | slots                   | slot_revision_origin_role                | 2024-03-12 06:34:00 | size      |    3666368 |        NULL | Number of pages in the index |
| wikidatawiki  | wbt_text_in_lang        | PRIMARY                                  | 2024-10-03 07:53:06 | size      |    3343168 |        NULL | Number of pages in the index |
| wikidatawiki  | wbt_term_in_lang        | PRIMARY                                  | 2024-10-04 08:45:40 | size      |    3214701 |        NULL | Number of pages in the index |
| wikidatawiki  | pagelinks               | pl_target_id                             | 2024-06-25 16:25:08 | size      |    3139648 |        NULL | Number of pages in the index |
| wikidatawiki  | change_tag              | PRIMARY                                  | 2024-05-20 13:05:17 | size      |    2830464 |        NULL | Number of pages in the index |
| wikidatawiki  | wbt_text_in_lang        | wbt_text_in_lang_text_id_text_id         | 2024-10-03 07:53:06 | size      |    2121329 |        NULL | Number of pages in the index |
| wikidatawiki  | change_tag              | ct_rc_tag_id                             | 2024-05-20 13:05:17 | size      |    2094464 |        NULL | Number of pages in the index |
| wikidatawiki  | change_tag              | ct_rev_tag_id                            | 2024-05-20 13:05:17 | size      |    1917376 |        NULL | Number of pages in the index |
| wikidatawiki  | change_tag              | ct_tag_id_id                             | 2024-05-20 13:05:17 | size      |    1836864 |        NULL | Number of pages in the index |
| wikidatawiki  | wbt_term_in_lang        | wbt_term_in_lang_text_in_lang_id_lang_id | 2024-10-04 08:45:40 | size      |    1680236 |        NULL | Number of pages in the index |

Hope that helps. For the size of the database over time, we have graphs based on backup sizes. I posted one in s4: phab:F37157040 ASarabadani (WMF) (talk) 16:59, 5 November 2024 (UTC)Reply

Many thanks! ·addshore· talk to me! 16:17, 10 December 2024 (UTC)Reply