威凡网全力打造:网页编程、软件开发编程、平面设计、服务器端开发、操作系统等在线学习平台!学编程,上威凡网!
PHP教程>> PHP基础 PHP技巧 PHP实例 PHP文摘 PHP模板 PHP总结
当前位置:首页 > PHP教程 > PHP总结
上一节 下一节
 ACIDinHBase

by lars hofhansl as we know, acid stands for atomicity, consistency, isolation, and durability. hbase supports acid in limited ways, namely puts to the same row provide all acid guarantees. (hbase-3584 adds multi op transactions and hbase-

by lars hofhansl

as we know, acid stands for atomicity, consistency, isolation, and durability.

hbase supports acid in limited ways, namely puts to the same row provide all acid guarantees. (hbase-3584 adds multi op transactions and hbase-5229 adds multi row transactions, but the principle remains the same)

so how does acid work in hbase?

hbase employs a kind of mvcc. and hbase has no mixed read/write transactions.

the nomenclature in hbase is bit strange for historical reasons. in a nutshell each regionserver maintains what i will call "strictly monotonically increasing transaction numbers".

when a write transaction (a set of puts or deletes) starts it retrieves the next highest transaction number. in hbase this is called a writenumber.
when a read transaction (a scan or get) starts it retrieves the transaction number of the last committed transaction. hbase calls this the readpoint.

each created keyvalue is tagged with its transaction's writenumber (this tag, for historical reasons, is called the memstore timestamp in hbase. note that this is separate from the application-visible timestamp.)

the highlevel flow of a write transaction in hbase looks like this:
  • lock the row(s), to guard against concurrent writes to the same row(s)
  • retrieve the current writenumber
  • apply changes to the wal (write ahead log)
  • apply the changes to the memstore (using the acquired writenumber to tag the keyvalues)
  • commit the transaction, i.e. attempt to roll the readpoint forward to the acquired writenumber.
  • unlock the row(s)
  • the highlevel flow of a read transaction looks like this:
  • open the scanner
  • get the current readpoint
  • filter all scanned keyvalues with memstore timestamp > the readpoint
  • close the scanner (this is initiated by the client)
  • in reality it is a bit more complicated, but this is enough to illustrate the point. note that a reader acquires no locks at all, but we still get all of acid.

    it is important to realize that this only works if transactions are committed strictly serially; otherwise an earlier uncommitted transaction could become visible when one that started later commits first. in hbase transaction are typically short, so this is not a problem.

    hbase does exactly that: all transactions are committed serially.

    committing a transaction in hbase means settting the current readpoint to the transaction's writenumber, and hence make its changes visible to all new scans.
    hbase keeps a list of all unfinished transactions. a transaction's commit is delayed until all prior transactions committed. note that hbase can still make all changes immediately and concurrently, only the commits are serial.

    since hbase does not guarantee any consistency between regions (and each region is hosted at exactly one regionserver) all mvcc data structures only need to be kept in memory on every region server.

    the next interesting question is what happens during compactions.

    in hbase compactions are used to join multiple small store files (create by flushes of the memstore to disk) into a larger ones and also to remove "garbage" in the process.
    garbage here are keyvalues that either expired due to a column family's ttl or version settings or were marked for deletion. see here and here for more details.

    now imagine a compaction happening while a scanner is still scanning through the keyvalues. it would now be possible see a partial row (see here for how hbase defines a "row") - a row comprised of versions of keyvalues that do not reflect the outcome of any serializable transaction schedule.

    the solution in hbase is to keep track of the earliest readpoint used by any open scanner and never collect any keyvalues with a memstore timestamp larger than that readpoint. that logic was - among other enhancements - added with hbase-2856, which allowed hbase to support acid guarantees even with concurrent flushes.
    hbase-5569 finally enables the same logic for the delete markers (and hence deleted keyvalues).

    lastly, note that a keyvalue's memstore timestamp can be cleared (set to 0) when it is older than the oldest scanner. i.e. it is known to be visible to every scanner, since all earlier scanner are finished.

    update thursday, march 22:
    a couple of extra points:
    • the readpoint is rolled forward even if the transaction failed in order to not stall later transactions that waiting to be committed (since this is all in the same process, that just mean the roll forward happens in a java finally block).
    • when updates are written to the wal a single record is created for the all changes. there is no separate commit record.
    • when a regionserver crashes, all in flight transaction are eventually replayed on another regionserver if the wal record was written completely or discarded otherwise.


    申明:本文章由威凡网编辑整理并发布,如文中有侵权行为,请与本站客服联系(QQ:254677821)!
    上一节 下一节
    相关教程  
    其他教程  
    PHP基础
    PHP技巧
    PHP实例
    PHP文摘
    PHP模板
    PHP总结

    Copyright©威凡网 版权所有 苏ICP备2023020142号
    站长QQ:254677821