June 20, 2014, 2:54 p.m.

Fast Blockchain Scanning

(Another post about bitcoin RPC from Python.)

For certain bitcoin applications you'll need to perform some kind of processing on all of the transactions coming through on the bitcoin block chain. In a bitcoin wallet application, for example, you need to check each new transaction to identify any outputs which are spendable by the wallet, and add the corresponding amounts to the wallet balance.

In this post we'll look at how to do this with bitcoin RPC calls.

Theres one obvious way to achieve this, but the obvious approach requires a full transaction index, and also works out to be quite inefficient. I'll show you a less obvious, and more efficient way to do this, without the transaction index requirement.

The 'obvious' approach

We'll use the RPCHost class from this previous post.

And let's then assume that we have a block hash for the block we need to process, and just want to process all the transactions in this block. (I'm not going to look at logic for choosing which blocks to add, and when, but may come back to talk about this in a future post.)

So one way to approach this is to make a call to the 'getblock' RPC command, first of all, and then iterate over the transaction IDs returned by this command to submit additional queries for information about each of these transactions:

blockHash = '0000000000000000046e61b042e9eae6eb0dde3417c4299f8bc22b2bdbf7c1c4' # for example
block = rpcHost.call('getblock', blockHash)
for txid in block['tx']:
    details = rpcHost.call('gettransaction', txid)
    # .. do something with these transaction details

This is nice and straightforward, but there are some problems with this approach.

Full transaction index

The first problem is that calling gettransaction for arbitrary transactions requires a full transaction index.

Bitcoind accepts a txindex option to control whether or not a full transaction index is created and maintained, and by default this will be turned off. (Some more information about this here.)

If you trying some arbitrary RPC queries, without the txindex turned on, you're likely to see something like the following:

>bitcoin-cli getblock 00000000373403049c5fff2cd653590e8cbe6f7ac639db270e7d1a7503d698df
{
    "hash" : "00000000373403049c5fff2cd653590e8cbe6f7ac639db270e7d1a7503d698df",
    "confirmations" : 78055,
    "size" : 207,
    "height" : 1000,
    "version" : 1,
    "merkleroot" : "3bf8c518a7a1187287516da67cb96733697b1d83eb937e68ae39bd4c08e563b7",
    "tx" : [
        "3bf8c518a7a1187287516da67cb96733697b1d83eb937e68ae39bd4c08e563b7"
    ],
    "time" : 1337966144,
    "nonce" : 1029134858,
    "bits" : "1d00ffff",
    "difficulty" : 1.00000000,
    "chainwork" : "000000000000000000000000000000000000000000000000000003e903e903e9",
    "previousblockhash" : "0000000036f7b90238ac6b6026be5e121ac3055f19fffd69f28310a76aa4a5bc",
    "nextblockhash" : "00000000923dac28eba985c5ae67e023361aa91c90e433aa992590d27629cead"
}
>bitcoin-cli gettransaction 3bf8c518a7a1187287516da67cb96733697b1d83eb937e68ae39bd4c08e563b7
error: {"code":-5,"message":"Invalid or non-wallet transaction id"}

(This block hash and transaction is from bitcoin testnet.)

Transaction indexing is required in order for bitcoind to get from the transaction hash to the actual transaction data in the blockchain, but bitcoind doesn't actually normally need to be able to look up all of the transactions in the blockchain history. Once a transaction's outputs have been spent, bitcoind can pretty much forget about that transaction, and indexing all of those old, spent transactions has non-negligeable cost, so it's good to be able to discard that information.

As a bitcoin user I don't really want to have to turn on txindex, incurring a reindexing operation and additional disk and memory footprint, unless this is really necessary.

Hidden costs

The second problem is that there's a lot going on behind the scenes for those 3 lines of python script.

There's a network query and response required for each transaction, for starters, and then bitcoind is also being forced to fully decode all of the transaction details provided by the gettransaction RPC command regardless of whether or not we actually use these details, and everything passed in and out of these queries is getting encoded to and decoded from JSON, in each directions, all of which adds to the time overhead.

This can get work out pretty slow, then.

There are a lot of blocks in the bitcoin blockchain and we'd like our application to be able to get through these blocks as quickly as possible, so let's look at how we can improve on this.

DIY transaction decoding

A good first step is to actually take over the job of decoding each transaction ourselves.

This gives us a some extra work to do, but the bitcoin transaction data format is pretty well specified, and it turns out that this is not so hard to do in practice.

We'll be looking at some basic transaction decoding a bit further down in the post, but for now lets just replace the call to gettransaction with a call to getrawtransaction, and assume we can work with the resulting raw transaction data:

blockHash = '0000000000000000046e61b042e9eae6eb0dde3417c4299f8bc22b2bdbf7c1c4' # for example
block = rpcHost.call('getblock', blockHash)
for txid in block['tx']:
    rawTransactionHex = rpcHost.call('getrawtransaction', txid)
    # .. do something with this raw transaction data

This will help out to a certain extent, since bitcoind no longer has to decode the whole transaction, and pass all of the elements through JSON and over the network. And we can also choose to only decode the transaction elements that are actually relevant to us. But there's still a network query and response for every single transaction, and there's still that full transaction index requirement (since getrawtransaction needs to find the transaction data, just as with gettransaction).

Locating transactions

The transaction index requirement is kind of irritating in our case, because we actually know the block hash containing the transaction, and so we could theoretically just tell this to bitcoind.

At the time of writing, however, there's no blockhash parameter to either gettransaction or getrawtransaction, and no other way to tell these commands where to find transactions that are not indexed internally, so we'll need to find some other way to work around the issue.

Raw block data

The solution, as it turns out, is to go one step deeper in our decoding efforts, and actually decode the whole block ourselves.

A key piece of information is that, although currently undocumented in the wiki, the getblock RPC command actually accepts a second 'verbose' parameter.

(In general, it seems like the best place to look for RPC command documentation is in the corresponding source code, not the wiki! Look for source files starting with an 'rpc' prefix. In this case we want rpcblockchain.cpp.)

The verbose parameter defaults to True, but can be set to False, and then getblock just returns raw bytes for the block.

The raw bytes for each transaction are included directly in each block, and if we're already decoding raw transaction data then decoding the block itself in order to locate these transactions isn't a lot harder.

The following specification of the bitcoin block format can be found in the bitcoin wiki::

Field	Description	Size
Magic no	value always 0xD9B4BEF9	4 bytes
Blocksize	number of bytes following up to end of block	4 bytes
Blockheader	consists of 6 items	80 bytes
Transaction counter	positive integer VI = VarInt	1 - 9 bytes
transactions	the (non empty) list of transactions	Transaction counter-many transactions

This is mostly straightforward, but one detail to note is that transaction byte sizes are not explicitly included, and we have to at least minimally parse each transaction in order to figure out where individual transactions end.

Here's a concrete block query (from bitcoin testnet, lines split for readability):

>bitcoin-cli getblock 00000000373403049c5fff2cd653590e8cbe6f7ac639db270e7d1a7503d698df
{
    "hash" : "00000000373403049c5fff2cd653590e8cbe6f7ac639db270e7d1a7503d698df",
    "confirmations" : 262564,
    "size" : 207,
    "height" : 1000,
    "version" : 1,
    "merkleroot" : "3bf8c518a7a1187287516da67cb96733697b1d83eb937e68ae39bd4c08e563b7",
    "tx" : [
        "3bf8c518a7a1187287516da67cb96733697b1d83eb937e68ae39bd4c08e563b7"
    ],
    "time" : 1337966144,
    "nonce" : 1029134858,
    "bits" : "1d00ffff",
    "difficulty" : 1.00000000,
    "chainwork" : "000000000000000000000000000000000000000000000000000003e903e903e9",
    "previousblockhash" : "0000000036f7b90238ac6b6026be5e121ac3055f19fffd69f28310a76aa4a5bc",
    "nextblockhash" : "00000000923dac28eba985c5ae67e023361aa91c90e433aa992590d27629cead"
}
>bitcoin-cli getblock 00000000373403049c5fff2cd653590e8cbe6f7ac639db270e7d1a7503d698df false
01000000bca5a46aa71083f269fdff195f05c31a125ebe26606bac3802b9f73600000000
b763e5084cbd39ae687e93eb831d7b693367b97ca66d51877218a1a718c5f83b40bebf4f
ffff001d0a5a573d01010000000100000000000000000000000000000000000000000000
00000000000000000000ffffffff1f0440bebf4f0122172f503253482f49636549726f6e
2d51432d6d696e65722fffffffff0100f2052a01000000232103a5f981bf546b95152ed6
695bc50edd0b8db3afb48839b9d58714519e5bdd1f95ac00000000

It looks like bitcoind omits the magic number and blocksize entries, and if we look at the block header specification (which can be found here) then we can see that this all appears to make sense, and breaks down as follows:

version:
01000000
previous block hash:
bca5a46aa71083f269fdff195f05c31a125ebe26606bac3802b9f73600000000
merkle root:
b763e5084cbd39ae687e93eb831d7b693367b97ca66d51877218a1a718c5f83b
time:
40bebf4f
difficulty:
ffff001d
nonce:
0a5a573d
number of transactions:
01
transaction data (for single transaction):
01000000010000000000000000000000000000000000000000000000000000
000000000000ffffffff1f0440bebf4f0122172f503253482f49636549726f6e
2d51432d6d696e65722fffffffff0100f2052a01000000232103a5f981bf546b
95152ed6695bc50edd0b8db3afb48839b9d58714519e5bdd1f95ac00000000

Decoding variable integers

Most of this data is fixed size, and we don't care about a lot of the contents, so we can skip right up to the number of transactions entry.

This is stored in a variable length integer format, and won't always be a single byte.

The specification for bitcoin variable length integers can be found here, and we can write python code to decode this as follows:

def _intFromBytes_LittleByteOrder(data):
    multiplier = 1
    result = 0
    for i in range(len(data)):
        byteValue = struct.unpack('<B', data[i:i + 1])[0]
        result += byteValue * multiplier
        multiplier = multiplier << 8
    return result

def DecodeVarInt(data, pos):
    if pos >= len(data):
        raise RanOutOfData()
    result = _intFromBytes_LittleByteOrder(data[pos:pos + 1])
    pos += 1
    if result < 253:
        return pos, result
    byteSize = 2 ** (result - 252)
    if pos + byteSize > len(data):
        raise RanOutOfData()
    return pos + byteSize, _intFromBytes_LittleByteOrder(data[pos:pos + byteSize])

DecodeVarInt() accepts a buffer and a position in that buffer, and returns a tuple with new position after decoding the variable integer record and the variable integer value.

The _intFromBytes_LittleByteOrder() helper is a bit scruffy, but designed to work with both python 2 and 3. If you only have to support Python 3.2 you can replace this with a call to the built-in method int.from_bytes(data, byteorder='little').

Decoding transactions

After the number of transactions entry we need to parse the actual transaction data itself, at least to the extent of working out where each transaction ends.

We can find the transaction data specificiation here, and there's also some useful additional discussion about transaction format in that Ken Shirriff blog post.

Based on this information, our final block decode method looks like this:

def GetTransactionsInBlock(data):
    try:
        pos = 80 # skip block header
        pos, numberOfTransactions = DecodeVarInt(data, pos)
        result = []
        for i in range(numberOfTransactions):
            startPos = pos
            pos += 4 # skip version
            pos, numberOfInputs = DecodeVarInt(data, pos)
            for i in range(numberOfInputs):
                pos += 32 # txid
                pos += 4 # vout
                pos, scriptLen = DecodeVarInt(data, pos)
                pos += scriptLen
                pos += 4 # sequence
            pos, numberOfOutputs = DecodeVarInt(data, pos)
            for i in range(numberOfOutputs):
                pos += 8 # output amount
                pos, scriptLen = DecodeVarInt(data, pos)
                pos += scriptLen
            pos += 4 # lock time
            result.append(data[startPos:pos])
        if pos != len(data):
            raise Exception('invalid block data')
        return result
    except RanOutOfData:
        raise Exception('invalid block data')

Putting it together

Note that we'll need to convert from hex encoded string data (as returned from the getblock RPC call) to bytes. Putting this all together we end up with the following:

    blockData_InHex = self._rpcHost.call('getblock', blockHash, False)
    blockData = binascii.unhexlify(blockData_InHex.encode('ascii'))
    rawTransactions = GetTransactionsInBlock(blockData)
    for rawTransaction in rawTransactions:
        # .. do something with this raw transaction data

Conclusion

It's interesting, I think, to see how this stuff works, at the byte level.

Decoding bitcoin blocks and transactions isn't so hard in practice, and it can potentially make sense to get your hands a bit dirty and get stuck into these details, even in situations where you're using RPC commands and generally writing 'higher level' code.

It's nice to let bitcoind take care of a bunch of messy details relating to peer to peer network connectivity and blockchain validity, but if you're writing an application that needs to scan the blockchain, and you're getting transactions through individual gettransaction (or getrawtransaction) RPC calls, you should look at changing this right away!

Context

This approach to scanning the blockchain is used in the SwapBill embedded protocol and results in a particularly big speedup in this specific case. With SwapBill potentially relevant transactions can be trivially identified from raw transaction data, and most non-SwapBill transactions are then quickly discarded without any further decoding.

Comments (discussion closed)

rodomonte1109 January 2016 17:51

Do you actually go deeper into the quest? I need a byte approach without using the rpc that slow down a lot, in python this seems the best approach https://github.com/tenthirt... but need some actual parsing on the scripts, do you develope faster solutions?

Thomas Young21 January 2016 12:19

No, I didn't try direct access to the block files, but this should definitely be faster.

If you try this, I will be interested to hear about how you get on!

rodomonte1121 January 2016 18:34

At the end I use this adding opcodes from the bitcoin code https://github.com/jtauber/... now I need to add parsing for multisig and other not standard transaction, but for the moment I suspend the project moving onto another.