June 20, 2014, 2:54 p.m.

    Fast Blockchain Scanning

    (Another post about bitcoin RPC from Python.)

    For certain bitcoin applications you'll need to perform some kind of processing on all of the transactions coming through on the bitcoin block chain. In a bitcoin wallet application, for example, you need to check each new transaction to identify any outputs which are spendable by the wallet, and add the corresponding amounts to the wallet balance.

    In this post we'll look at how to do this with bitcoin RPC calls.

    Theres one obvious way to achieve this, but the obvious approach requires a full transaction index, and also works out to be quite inefficient. I'll show you a less obvious, and more efficient way to do this, without the transaction index requirement.

    The 'obvious' approach

    We'll use the RPCHost class from this previous post.

    And let's then assume that we have a block hash for the block we need to process, and just want to process all the transactions in this block. (I'm not going to look at logic for choosing which blocks to add, and when, but may come back to talk about this in a future post.)

    So one way to approach this is to make a call to the 'getblock' RPC command, first of all, and then iterate over the transaction IDs returned by this command to submit additional queries for information about each of these transactions:

    blockHash = '0000000000000000046e61b042e9eae6eb0dde3417c4299f8bc22b2bdbf7c1c4' # for example
    block = rpcHost.call('getblock', blockHash)
    for txid in block['tx']:
        details = rpcHost.call('gettransaction', txid)
        # .. do something with these transaction details

    This is nice and straightforward, but there are some problems with this approach.

    Full transaction index

    The first problem is that calling gettransaction for arbitrary transactions requires a full transaction index.

    Bitcoind accepts a txindex option to control whether or not a full transaction index is created and maintained, and by default this will be turned off. (Some more information about this here.)

    If you trying some arbitrary RPC queries, without the txindex turned on, you're likely to see something like the following:

    >bitcoin-cli getblock 00000000373403049c5fff2cd653590e8cbe6f7ac639db270e7d1a7503d698df
        "hash" : "00000000373403049c5fff2cd653590e8cbe6f7ac639db270e7d1a7503d698df",
        "confirmations" : 78055,
        "size" : 207,
        "height" : 1000,
        "version" : 1,
        "merkleroot" : "3bf8c518a7a1187287516da67cb96733697b1d83eb937e68ae39bd4c08e563b7",
        "tx" : [
        "time" : 1337966144,
        "nonce" : 1029134858,
        "bits" : "1d00ffff",
        "difficulty" : 1.00000000,
        "chainwork" : "000000000000000000000000000000000000000000000000000003e903e903e9",
        "previousblockhash" : "0000000036f7b90238ac6b6026be5e121ac3055f19fffd69f28310a76aa4a5bc",
        "nextblockhash" : "00000000923dac28eba985c5ae67e023361aa91c90e433aa992590d27629cead"
    >bitcoin-cli gettransaction 3bf8c518a7a1187287516da67cb96733697b1d83eb937e68ae39bd4c08e563b7
    error: {"code":-5,"message":"Invalid or non-wallet transaction id"}

    (This block hash and transaction is from bitcoin testnet.)

    Transaction indexing is required in order for bitcoind to get from the transaction hash to the actual transaction data in the blockchain, but bitcoind doesn't actually normally need to be able to look up all of the transactions in the blockchain history. Once a transaction's outputs have been spent, bitcoind can pretty much forget about that transaction, and indexing all of those old, spent transactions has non-negligeable cost, so it's good to be able to discard that information.

    As a bitcoin user I don't really want to have to turn on txindex, incurring a reindexing operation and additional disk and memory footprint, unless this is really necessary.

    Hidden costs

    The second problem is that there's a lot going on behind the scenes for those 3 lines of python script.

    There's a network query and response required for each transaction, for starters, and then bitcoind is also being forced to fully decode all of the transaction details provided by the gettransaction RPC command regardless of whether or not we actually use these details, and everything passed in and out of these queries is getting encoded to and decoded from JSON, in each directions, all of which adds to the time overhead.

    This can get work out pretty slow, then.

    There are a lot of blocks in the bitcoin blockchain and we'd like our application to be able to get through these blocks as quickly as possible, so let's look at how we can improve on this.

    DIY transaction decoding

    A good first step is to actually take over the job of decoding each transaction ourselves.

    This gives us a some extra work to do, but the bitcoin transaction data format is pretty well specified, and it turns out that this is not so hard to do in practice.

    We'll be looking at some basic transaction decoding a bit further down in the post, but for now lets just replace the call to gettransaction with a call to getrawtransaction, and assume we can work with the resulting raw transaction data:

    blockHash = '0000000000000000046e61b042e9eae6eb0dde3417c4299f8bc22b2bdbf7c1c4' # for example
    block = rpcHost.call('getblock', blockHash)
    for txid in block['tx']:
        rawTransactionHex = rpcHost.call('getrawtransaction', txid)
        # .. do something with this raw transaction data

    This will help out to a certain extent, since bitcoind no longer has to decode the whole transaction, and pass all of the elements through JSON and over the network. And we can also choose to only decode the transaction elements that are actually relevant to us. But there's still a network query and response for every single transaction, and there's still that full transaction index requirement (since getrawtransaction needs to find the transaction data, just as with gettransaction).

    Locating transactions

    The transaction index requirement is kind of irritating in our case, because we actually know the block hash containing the transaction, and so we could theoretically just tell this to bitcoind.

    At the time of writing, however, there's no blockhash parameter to either gettransaction or getrawtransaction, and no other way to tell these commands where to find transactions that are not indexed internally, so we'll need to find some other way to work around the issue.

    Raw block data

    The solution, as it turns out, is to go one step deeper in our decoding efforts, and actually decode the whole block ourselves.

    A key piece of information is that, although currently undocumented in the wiki, the getblock RPC command actually accepts a second 'verbose' parameter.

    (In general, it seems like the best place to look for RPC command documentation is in the corresponding source code, not the wiki! Look for source files starting with an 'rpc' prefix. In this case we want rpcblockchain.cpp.)

    The verbose parameter defaults to True, but can be set to False, and then getblock just returns raw bytes for the block.

    The raw bytes for each transaction are included directly in each block, and if we're already decoding raw transaction data then decoding the block itself in order to locate these transactions isn't a lot harder.

    The following specification of the bitcoin block format can be found in the bitcoin wiki::

    Magic novalue always 0xD9B4BEF94 bytes
    Blocksizenumber of bytes following up to end of block4 bytes
    Blockheaderconsists of 6 items80 bytes
    Transaction counterpositive integer VI = VarInt1 - 9 bytes
    transactionsthe (non empty) list of transactionsTransaction counter-many transactions

    This is mostly straightforward, but one detail to note is that transaction byte sizes are not explicitly included, and we have to at least minimally parse each transaction in order to figure out where individual transactions end.

    Here's a concrete block query (from bitcoin testnet, lines split for readability):

    >bitcoin-cli getblock 00000000373403049c5fff2cd653590e8cbe6f7ac639db270e7d1a7503d698df
        "hash" : "00000000373403049c5fff2cd653590e8cbe6f7ac639db270e7d1a7503d698df",
        "confirmations" : 262564,
        "size" : 207,
        "height" : 1000,
        "version" : 1,
        "merkleroot" : "3bf8c518a7a1187287516da67cb96733697b1d83eb937e68ae39bd4c08e563b7",
        "tx" : [
        "time" : 1337966144,
        "nonce" : 1029134858,
        "bits" : "1d00ffff",
        "difficulty" : 1.00000000,
        "chainwork" : "000000000000000000000000000000000000000000000000000003e903e903e9",
        "previousblockhash" : "0000000036f7b90238ac6b6026be5e121ac3055f19fffd69f28310a76aa4a5bc",
        "nextblockhash" : "00000000923dac28eba985c5ae67e023361aa91c90e433aa992590d27629cead"
    >bitcoin-cli getblock 00000000373403049c5fff2cd653590e8cbe6f7ac639db270e7d1a7503d698df false

    It looks like bitcoind omits the magic number and blocksize entries, and if we look at the block header specification (which can be found here) then we can see that this all appears to make sense, and breaks down as follows:

    previous block hash:
    merkle root:
    number of transactions:
    transaction data (for single transaction):

    Decoding variable integers

    Most of this data is fixed size, and we don't care about a lot of the contents, so we can skip right up to the number of transactions entry.

    This is stored in a variable length integer format, and won't always be a single byte.

    The specification for bitcoin variable length integers can be found here, and we can write python code to decode this as follows:

    def _intFromBytes_LittleByteOrder(data):
        multiplier = 1
        result = 0
        for i in range(len(data)):
            byteValue = struct.unpack('<B', data[i:i + 1])[0]
            result += byteValue * multiplier
            multiplier = multiplier << 8
        return result
    def DecodeVarInt(data, pos):
        if pos >= len(data):
            raise RanOutOfData()
        result = _intFromBytes_LittleByteOrder(data[pos:pos + 1])
        pos += 1
        if result < 253:
            return pos, result
        byteSize = 2 ** (result - 252)
        if pos + byteSize > len(data):
            raise RanOutOfData()
        return pos + byteSize, _intFromBytes_LittleByteOrder(data[pos:pos + byteSize])

    DecodeVarInt() accepts a buffer and a position in that buffer, and returns a tuple with new position after decoding the variable integer record and the variable integer value.

    The _intFromBytes_LittleByteOrder() helper is a bit scruffy, but designed to work with both python 2 and 3. If you only have to support Python 3.2 you can replace this with a call to the built-in method int.from_bytes(data, byteorder='little').

    Decoding transactions

    After the number of transactions entry we need to parse the actual transaction data itself, at least to the extent of working out where each transaction ends.

    We can find the transaction data specificiation here, and there's also some useful additional discussion about transaction format in that Ken Shirriff blog post.

    Based on this information, our final block decode method looks like this:

    def GetTransactionsInBlock(data):
            pos = 80 # skip block header
            pos, numberOfTransactions = DecodeVarInt(data, pos)
            result = []
            for i in range(numberOfTransactions):
                startPos = pos
                pos += 4 # skip version
                pos, numberOfInputs = DecodeVarInt(data, pos)
                for i in range(numberOfInputs):
                    pos += 32 # txid
                    pos += 4 # vout
                    pos, scriptLen = DecodeVarInt(data, pos)
                    pos += scriptLen
                    pos += 4 # sequence
                pos, numberOfOutputs = DecodeVarInt(data, pos)
                for i in range(numberOfOutputs):
                    pos += 8 # output amount
                    pos, scriptLen = DecodeVarInt(data, pos)
                    pos += scriptLen
                pos += 4 # lock time
            if pos != len(data):
                raise Exception('invalid block data')
            return result
        except RanOutOfData:
            raise Exception('invalid block data')

    Putting it together

    Note that we'll need to convert from hex encoded string data (as returned from the getblock RPC call) to bytes. Putting this all together we end up with the following:

        blockData_InHex = self._rpcHost.call('getblock', blockHash, False)
        blockData = binascii.unhexlify(blockData_InHex.encode('ascii'))
        rawTransactions = GetTransactionsInBlock(blockData)
        for rawTransaction in rawTransactions:
            # .. do something with this raw transaction data


    It's interesting, I think, to see how this stuff works, at the byte level.

    Decoding bitcoin blocks and transactions isn't so hard in practice, and it can potentially make sense to get your hands a bit dirty and get stuck into these details, even in situations where you're using RPC commands and generally writing 'higher level' code.

    It's nice to let bitcoind take care of a bunch of messy details relating to peer to peer network connectivity and blockchain validity, but if you're writing an application that needs to scan the blockchain, and you're getting transactions through individual gettransaction (or getrawtransaction) RPC calls, you should look at changing this right away!


    This approach to scanning the blockchain is used in the SwapBill embedded protocol and results in a particularly big speedup in this specific case. With SwapBill potentially relevant transactions can be trivially identified from raw transaction data, and most non-SwapBill transactions are then quickly discarded without any further decoding.

    blog comments powered by Disqus