Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ethrex-prover performance

Latest benchmarks against rsp

  • ethrex commit: 42073248334a6f517c22bd7b8faf30787724d9da (#5224)
  • rsp commit: 2c5718029e7c0b24a34b011088fef221489fc714
Block (mainnet)Gas Usedethrex (SP1)rsp(ethrex - rsp) / rsp * 100%
237690827,949,56202m 23s01m 27s63.2%
2376908344,943,00612m 24s07m 49s58.7%
2376908427,075,54308m 40sFailed-
2376908522,213,85406m 40sFailed-
2376908628,364,69907m 36s04m 45s60.0%
2376908717,523,98506m 02sFailed-
2376908817,527,75904m 50s02m 59s61.4%
2376908923,854,09408m 14s04m 44s73.9%
2376909024,238,02908m 11s04m 40s75.4%
2376909124,421,21607m 02sFailed-
2376909221,714,22706m 35s04m 01s64.3%

Benchmark server hardware:

For ethrex:

  • AMD EPYC 7713 64-Core Processor
  • 128 GB RAM
  • RTX 4090 24 GB

For rsp:

  • AMD EPYC 7F72 24-Core Processor
  • 64 GB RAM
  • RTX 4090 24 GB

note

rsp exited with a block gas used mismatch error on the failed cases

How to reproduce for ethrex:

  1. Clone ethrex-replay
    1. Optionally, edit Cargo.toml to change the ethrex libraries to a specific branch/commit you want to benchmark.
    2. Run cargo update if you do change it.
  2. Run cargo r -r -F "sp1,gpu" -p ethrex-replay -- blocks --action prove --zkvm sp1 --from 23769082 --to 23769092 --rpc-url <RPC>
    1. For ethrex, an RPC endpoint that implements debug_executionWitness (like an ethrex or reth node) works best.

How to reproduce for rsp:

  1. Clone rsp
  2. Make the following change so that we measure rsp's proving time the same way we measure it for ethrex (duration of the prove() call), you can then grep the stdout for proving_time:
        diff --git a/crates/executor/host/src/full_executor.rs b/crates/executor/host/src/full_executor.rs
    index 99a0478..d42a1d2 100644
    --- a/crates/executor/host/src/full_executor.rs
    +++ b/crates/executor/host/src/full_executor.rs
    @@ -123,6 +123,7 @@ pub trait BlockExecutor<C: ExecutorComponents> {
                 .map_err(|err| eyre::eyre!("{err}"))??;
    
                 let proving_duration = proving_start.elapsed();
    +            println!("proving time: {}", proving_duration.as_secs());
                 let proof_bytes = bincode::serialize(&proof.proof).unwrap();
    
                 hooks
  1. Run SP1_PROVER=cuda cargo r -r --manifest-path bin/host/Cargo.toml --block-number <BLOCK NUMBER> --rpc-url <RPC> --prove
    1. For rsp, an alchemy RPC endpoint works best.

Optimizations

NamePRtl;dr
jumpdest opt.https://github.com/lambdaclass/ethrex/pull/4608Cuts 15% total zkvm cycles
trie opt. 1https://github.com/lambdaclass/ethrex/pull/4648Halves trie hashing calls
trie opt. 2https://github.com/lambdaclass/ethrex/pull/4723Trie hashing zkVM cycles reduced 1/4, faster than risc0-trie
trie opt. 3https://github.com/lambdaclass/ethrex/pull/4763Trie get and insert ops. reduced cycles by 1/14.
trie opt. 4https://github.com/lambdaclass/ethrex/pull/5224Serialize resolved tries and removes decoding, 22% total cycles reduced
ecpairing precompilehttps://github.com/lambdaclass/ethrex/pull/4809Reduced ecpairing cycles from 138k to 6k (10% total proving cycles gone)
ecmul precompilehttps://github.com/lambdaclass/ethrex/pull/5133Reduces another 10% of total cycles

Trie operations are one of the most expensive in our prover right now. We are using risc0-trie as a fast zkVM trie reference to optimize our own. ethrex-trie optimization for zkvm see for a detailed exploration of our trie vs risc0’s.

Detailed proving times (SP1)

Benchmark server hardware:

  • AMD EPYC 7713 64-Core Processor
  • 128 GB RAM
  • RTX 4090 24 GB

Absolute times

Block (Mainnet)Gasethrex main (70fc63)ethrex (jumpdest opt.)ethrex (ecpairing precompile)ethrex (trie opt1.)ethrex (trie opt2.)ethrex (trie opt3.)ethrex (trie opt1. + trie opt2. + trie opt3.)ethrex (all opts, our trie)ethrex (risc0_trie)rsp main (9a7048)
2342699335.7M20m 40s20m 04s20m 10s20m 12s19m 31s17m 20s16m 24s14m 38s14m 32s08m 39s
2342699420.7M13m 24s12m 55s12m 32s13m 08s12m 48s11m 31s10m 53s09m 18s10m 04s05m 39s
2342699516.6M10m 14s09m 54s09m 56s09m 56s09m 52s08m 39s08m 06s07m 20s07m 19s04m 33s
2342699622.5M16m 58s16m 37s15m 48s16m 50s15m 42s14m 44s14m 08s12m 25s12m 59s06m 39s

Relative improvement vs ethrex main ((opt - base)/base * 100)

Block (Mainnet)Gasethrex (jumpdest opt.)ethrex (ecpairing precompile)ethrex (trie opt1.)ethrex (trie opt2.)ethrex (trie opt3.)ethrex (trie opt1+2+3)ethrex (all opts, our trie)ethrex (risc0_trie)
2342699335.7M-2.9%-2.4%-2.3%-5.6%-16.1%-20.6%-29.2%-29.7%
2342699420.7M-3.6%-6.5%-2.0%-4.5%-14.1%-18.8%-30.6%-24.9%
2342699516.6M-3.3%-2.9%-2.9%-3.6%-15.5%-20.8%-28.3%-28.5%
2342699622.5M-2.1%-6.9%-0.8%-7.5%-13.2%-16.7%-26.8%-23.5%

Relative to RSP main ((opt - rsp) / rsp * 100)

BlockGasethrex mainethrex (jumpdest opt.)ethrex (ecpairing precompile)ethrex (trie opt1.)ethrex (trie opt2.)ethrex (trie opt3.)ethrex (trie opt1+2+3)ethrex (all opts, our trie)ethrex (risc0_trie)
2342699335.7M138.9%132.0%133.1%133.5%125.6%100.4%89.6%69.2%68.0%
2342699420.7M137.2%128.6%121.8%132.4%126.5%103.8%92.6%64.6%78.2%
2342699516.6M124.9%117.6%118.3%118.3%116.8%90.1%78.0%61.2%60.8%
2342699622.5M155.1%149.9%137.6%153.1%136.1%121.6%112.5%86.7%95.2%

ethrex-trie optimization for zkvm

Overview

Using the risc0-trie we get the next perf. gains in cycle count over our ethrex-trie:

  • rebuild_storage_trie() 218.999 -> 64.435 (x3.4)
  • rebuild_state_trie() 119.510 -> 37.426 (x3.2)
  • apply_account_updates() 139.519 -> 16.702 (x8.4)
  • state_trie_root() 63.482 -> 17.276 (x3.7)
  • validate_receipts_root() 13.068 -> 2.013 (x6.5)

our goal is to have our trie perform as close as possible to risc0’s.

Flamegraph overview

ethrex-trie (filtering for ethrex_trie functions) alt text

risc0-trie (filtering for risc0_ethereum_trie functions) alt text

Optimizations to our trie

1. Caching the initial node hashes

https://github.com/lambdaclass/ethrex/pull/4648

We were calculating all the hashes of the initial state trie (and storage) nodes, but we weren’t caching them when building the tries, so there was double hashing. https://github.com/lambdaclass/ethrex/pull/4648 fixes it by introducing a function to manually set the cached hash of a node, and caching whenever building a trie using Trie::from_nodes .

Calls to compute_hash were reduced by half after those changes:

before alt text

after alt text

although the total proving time didn’t change that much:

alt text

(proved with SP1 in l2-gpu-3, RTX 4090)

2. memcpy calls on trie hashing

ethrex-trie: alt text

risc0-trie: alt text

our Node::compute_hash is almost on par with risc0’s Node::memoize (hashes and caches the hash) but we have a big memcpy call that’s almost twice as long in cycles as the actual hashing.

initially I thought this was an overhead of OnceLock::initialize because risc0 uses an Option to cache the hash, but apparently that’s not the case as I ran a small experiment in which I compared initialization cycles for both and got the next results:

Report cycle tracker: {"oncelock": 3830000, "option": 3670000}
Experiment code
lib

```rust
use std::sync::OnceLock;

pub struct OnceLockCache {
    pub inner: OnceLock<u128>,
}

impl OnceLockCache {
    pub fn init(&self, inner: u128) {
        println!("cycle-tracker-report-start: oncelock");
        self.inner.get_or_init(|| compute_inner(inner));
        println!("cycle-tracker-report-end: oncelock");
    }
}

pub struct OptionCache {
    pub inner: Option<u128>,
}

impl OptionCache {
    pub fn init(&mut self, inner: u128) {
        println!("cycle-tracker-report-start: option");
        self.inner = Some(compute_inner(inner));
        println!("cycle-tracker-report-end: option");
    }
}

pub fn compute_inner(inner: u128) -> u128 {
    inner.pow(2)
}
```

guest program

```rust
#![no_main]
sp1_zkvm::entrypoint!(main);

use std::sync::OnceLock;

use fibonacci_lib::{OnceLockCache, OptionCache};

pub fn main() {
    let mut oncelocks = Vec::with_capacity(10000);
    let mut options = Vec::with_capacity(10000);

    for i in (1 << 64)..((1 << 64) + 10000) {
        let oncelock_cache = OnceLockCache {
            inner: OnceLock::new(),
        };
        oncelock_cache.init(i);
        oncelocks.push(oncelock_cache);
    }

    for i in (1 << 64)..((1 << 64) + 10000) {
        let mut option_cache = OptionCache { inner: None };
        option_cache.init(i);
        options.push(option_cache);
    }
}
```

in the flamegraph it’s not clear where that memcpy is originating (any ideas?). The calls that are on top of it are encode_bytes() (from our RLP lib) and some sha3 calls.

Node hashing is composed of two operations: first the node is encoded into RLP, then we keccak hash the result.

RLP Encoding

Our RLP lib uses a temp buffer to write encoded payloads to, and in the end it finalizes the encoding by writing the payload prefix (which contains a length) to an output buffer, plus the actual payload. This means we are allocating two buffers, initializing them, and copying bytes to both of them for each node that’s being encoded and hashed.

First I tried preallocating the buffers (which are vecs) by estimating a node’s final RLP length.

This was done in the l2/opt_rlp_buffer, this had an influence the memcpy length by reducing it almost 15%:

then I tried to bypass our RLP encoder altogether by implementing a small and fast encoding for the Branch node only (which is the majoritarian node type), this reduced memcpy further, 30% over main:

I didn’t test implementing fast RLP for the other node types as I was expecting a bigger reduction in memcpy from Branch alone.

Keccak

A difference between our trie and risc0’s is that they use the tiny-keccak crate instead of sha3 as we do. I tried replacing it in our trie but the memcpy got bigger.

Regarding memory, I also noticed that when hashing we are initializing a 32 byte array and writing the hash to it after. I wanted to test if this mattered so I experimented by creating an uninitialized buffer with MaybeUninit (unsafe rust) and writing the hash to it. This had no impact in memcpy's length.

Both changes are in the l2/opt_keccak branch.

Final

At the end the problem was a combination of slow RLP encoding and that we did a preorder traversal while hashing a trie recursively. This was fixed in https://github.com/lambdaclass/ethrex/pull/4723.

3. Node cloning in NodeRef::get_node

Whenever we get, insert or remove from a trie we are calling NodeRef::get_node which clones the Node from memory and returns it, which means that for each operation over a trie we are cloning all the relevant nodes for that op.

This is fixed in https://github.com/lambdaclass/ethrex/pull/4763 see that PR desc. for details