Progress in percents: 0 1 2 3 …

Well, servers usually don’t crash ( our English Wikipedia master is running for 800 days, on white-box hardware, RAID0, 4.0 ;-), but when they do (like some kernel bugs on our big big boxes), one of most painful experiences is InnoDB log recovery.

Usually people will reduce the innodb-log-file-size to ease up with the recovery (it helps, in a way :), but the real problem is somewhere else.

See, when InnoDB does crash recovery, it applies the log changes in memory, and builds a flush list. It doesn’t flush any pages during the recovery process, so the flush list grows big, thousands, tens of thousands, maybe millions kind of big, anyway, big-number big.

Oh, did I mention? The flush list is actually a linked list, not some kind of hippy tree stuff. Every time a log record is read from a log and something gets updated, the flush list will be traversed, thousands, tens of thousands, maybe millions of entries.

The expensive code looks something like this:

while (b && (ut_dulint_cmp(b->oldest_modification,
             block->oldest_modification) > 0)) {
       prev_b = b;
       b = UT_LIST_GET_NEXT(flush_list, b);
}

Then your profile starts looking like this, and you wish your systems didn’t crash:

%        symbol name
87.6978  buf_flush_insert_sorted_into_flush_list
 5.8571  -kernel
 1.9793  recv_apply_hashed_log_recs
 0.8059  buf_calc_page_new_checksum

So, the recovery process cost is exponential, and people work around it by reducing the log file size, by reducing performance of their system, while the actual fix is right there, in optimizing the data structure. Current model is outdated for anything built in last 5 years anyway.

Oh, and of course, I’d like systems not to crash at all, like that database master on whitebox raid0 running for 800 days.

Update: this is old stuff. Peter wrote about it, Heikki opened a bug, then thought it would need more than five minutes to fix it and classified it as a feature request, so Peter could write more about it. That makes it even more sad. We’d probably change the synopsis for the feature request, “make crash recovery work”.

Update 2: get the patch at Percona (Yasufumi is god :)

3 thoughts on “Progress in percents: 0 1 2 3 …”

  1. Wow… that is REALLY REALLY stupid and NEEDS to be fixed.

    It’s also not scalable across cores (obviously) which is why my InnoDB recoveries always use one core.

    Dumb dumb dumb.

  2. Domas,

    I wrote about it over a year ago
    http://www.mysqlperformanceblog.com/2007/07/17/innodb-recovery-is-large-buffer-pool-always-better/

    And there is MySQL bug filed on it http://bugs.mysql.com/bug.php?id=29847 which is of course classified as Feature request.

    The workaround one can often use in this case is to REDUCE innodb_buffer_pool_size and remove innodb_flush_method=O_DIRECT so OS cache can be used for caching. This can speed up recovery 10x in some cases.

Comments are closed.