GPU Accelerated Zero-Knowledge Proving

Sep 7, 2025

••• views

••• likes

Zero-knowledge proofs enables verifiable computation without opening up private transactions, creating a powerful building blocks for privacy-preserving transactions and scalable onchain verification. However, the computational intensity of proof gen has overtime been a major issue. A well-known tradeoff in ZK systems are the "resource-hungry prover": achieving a small, fast-to-verify proof requires high computation but often quite slow, blocking real-time use.

Modern GPUs drastically change this equation. By leveraging their parallel design, they turn what was a fundamental issue into a solvable engineering problem. ZK systems of today primarily use two constructions: zk-SNARKs and zk-STARKs, both of which rely on core operations that map perfectly to GPU acceleration.

What Actually Takes So Long?

Nearly all proving time goes to two operations: Multi-Scalar Multiplication (MSM) and Number Theoretic Transform (NTT).

What's Multi-Scalar Multiplication?

MSM computes the sum of scaled elliptic curve points: $S = \sum_{i=1}^{n} a_i \cdot P_i$ . This single operation consumes a whooping 60-80% of proving time in SNARKs. It personally requires a high number of random memory accesses to fetch elliptic curve points, making memory bandwidth the primary performance issue here. The Pippenger algorithm parallelizes this task by sorting scalars into buckets.

fn pippenger_msm<G: CurveGroup>(
    bases: &[G::Affine],
    scalars: &[G::ScalarField],
    window_size: usize,
) -> G {
    let num_buckets = 1 << window_size;
    let mut buckets = vec![G::zero(); num_buckets];
    
    for (base, scalar) in bases.iter().zip(scalars.iter()) {
        let bucket_index = extract_window(scalar, window_size);
        buckets[bucket_index] += base;
    }
    
    // bucket aggregation
    buckets.into_iter()
           .enumerate()
           .map(|(i, bucket)| bucket * G::ScalarField::from(i))
           .fold(G::zero(), |acc, term| acc + term)
}

So, what happens in the GPU implementation is it partitions this across thousands of threads, achieving over 700 million point operations per second on an RTX 4090.

What's Number Theoretic Transform?

NTT on the other hand mutliplies polynomials fast unlike MSM's random memory jumps, NTT follows a predictable pattern, exactly what GPUs are built for.

__global__ void ntt_radix2_kernel(
    uint64_t* coeffs,
    uint64_t* twiddles,
    uint32_t stage,
    uint32_t n
) {
    uint32_t tid = threadIdx.x + blockIdx.x * blockDim.x;
    uint32_t stride = 1 << stage;
    uint32_t m = n >> (stage + 1);
    
    if (tid >= m) return;
    
    uint32_t k = tid % stride;
    uint32_t j = tid / stride;
    uint32_t idx1 = j * (stride << 1) + k;
    uint32_t idx2 = idx1 + stride;
    
    uint64_t u = coeffs[idx1];
    uint64_t v = mod_mul(coeffs[idx2], twiddles[k], MODULUS);
    
    coeffs[idx1] = mod_add(u, v, MODULUS);
    coeffs[idx2] = mod_sub(u, v, MODULUS);
}

This sequential pattern allows performant NTT implementations to achieve 85-95% memory bandwidth.

Case Study: CPU vs GPU Proving Perf

Think about Alice and Bob, both operating zkEVM Sequencers for their respective Layer-2 chains. They process identical transaction batches but use different proving infra.

Scenario: zkEVM Block Proving

Both sequencers must prove a block containing 500 transactions with the following operations:

1,200 ERC-20 transfers
300 Uniswap V3 swaps
200 storage updates
100 contract deployments

This generates a circuit with approximately $2^{26}$ constraints, typical for a medium-sized zkEVM block.

Alice's CPU Approach

Alice runs her prover on a high-end CPU - AMD EPYC 7763, 64 cores:

impl CpuZkEvmProver {
    fn prove_block(&self, block: &Block) -> ProofResult {
        let start = Instant::now();
        
        // circuit compilation
        let circuit = self.compile_block_circuit(block); // 45s
        
        // msm operations
        let witness_commitment = self.cpu_msm(&circuit.witnesses); // 180s
        
        // ntt operations
        let constraint_proof = self.cpu_ntt(&circuit.constraints); // 90s
        
        // final proof
        let proof = self.assemble_proof(witness_commitment, constraint_proof);
        
        ProofResult {
            proof,
            proving_time: start.elapsed(), // ~315s total
            memory_usage: 45_000_000_000, // 45GB peak
        }
    }
    
    fn cpu_msm(&self, witnesses: &[Fr]) -> G1Projective {
        witnesses.iter()
                .zip(self.setup.iter())
                .map(|(scalar, base)| base.mul(scalar))
                .fold(G1Projective::identity(), |acc, point| acc + point)
    }
}

Alice's Results:

Proving time: 5.25 minutes per block
Memory usage: 45GB peak
Throughput: ~95 transactions/minute proven
Hardware cost: $8,000 server

Bob's GPU-Accelerated Approach

Bob uses an RTX 4090 with ICICLE acceleration:

use icicle_bls12_381::{curve::G1Projective, msm};
use icicle_core::msm::MSMConfig;
 
impl GpuZkEvmProver {
    fn prove_block(&self, block: &Block) -> ProofResult {
        let start = Instant::now();
        
        // circuit compilation
        let circuit = self.compile_block_circuit_gpu(block); // 12s
        
        // gpu msm - 16,384 cores parallel
        let witness_commitment = self.gpu_msm(&circuit.witnesses); // 8s
        
        // gpu ntt
        let constraint_proof = self.gpu_ntt(&circuit.constraints); // 3 seconds
        
        // final proof
        let proof = self.assemble_proof(witness_commitment, constraint_proof);
        
        ProofResult {
            proof,
            proving_time: start.elapsed(), // ~23s total
            memory_usage: 18_000_000_000, // 18GB peak
        }
    }
    
    fn gpu_msm(&self, witnesses: &[Fr]) -> Result<G1Projective> {
        let cfg = MSMConfig {
            precompute_factor: 8,
            c: 16,
            ..Default::default()
        };
        
        msm::msm(witnesses, &self.gpu_setup, &cfg)
    }
}

Bob's Results:

Proving time: 23 seconds per block
Memory usage: 18GB peak
Throughput: ~1,300 transactions/minute proven
Hardware cost: $1,800 (RTX 4090 + basic server)

Comparing the Results

Operation	Alice (CPU)	Bob (GPU)	Speedup
Circuit Compilation	45s	12s	3.8×
MSM Operations	180s	8s	22.5×
NTT Operations	90s	3s	30×
Total Block Proof	315s	23s	13.7×

Why This Changes Everything

Network Finality:

Alice's network: 5.25-minute finality for L1 confirmations
Bob's network: 23-second finality for L1 confirmations

Cost Per Transaction

// one every 12s
const BLOCKS_PER_DAY: u32 = 7200;
 
impl ProvingEconomics {
    fn daily_capacity(&self, proving_time_seconds: u32) -> u32 {
        let daily_seconds = 24 * 60 * 60;
        daily_seconds / proving_time_seconds
    }
}

Daily proving capacity (Alice vs Bob):

Alice can prove: 274 blocks/day that's 137,000 transactions
Bob can prove: 3,756 blocks/day that's 1,878,000 transactions

What Alice Needs to Compete

Alice would need 14 CPUs to match Bob's single GPU setup, leading to:

14× more expensive hardware cost - $112,000 vs$ 8,000
14× higher more power consumption - 14kW vs 1kW
Complex distributed coordination overhead

How Memory Gets Read

The performance difference starts from fundamentally how both servers access memory:

// alice's cpu msm - sequential, cache-bound
for (scalar, base) in witnesses.iter().zip(setup.iter()) {
    result += base * scalar; // cache miss every iteration
}
 
// bob's gpu msm - parallel, lightweight
__global__ void msm_kernel(scalar_t* scalars, point_t* bases, point_t* buckets) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int bucket_idx = extract_window(scalars[tid], WINDOW_SIZE);
    atomicAdd(&buckets[bucket_idx], bases[tid]); // coalesced memory access
}

This case study for e.g simply shows why GPU acceleration has become essential for production zkEVM deployments. Bob's infra can serve 13.7× more users while holding on to faster finality and lower operational costs.

SNARK vs. STARK Acceleration Patterns

SNARK Systems

Groth16 and PLONK are MSM-heavy. Groth16, for instance, spends about 80% of its proving time on elliptic curve type operations. PLONK on the other hand distributes the load more evenly between MSM ~60% and polynomial operations ~30%, requiring balanced optimization for both.

impl<E: Pairing> Groth16Prover<E> {
    fn create_proof(&self, circuit: &Circuit<E::ScalarField>) -> Result<Proof<E>> {
        let (a, b, c) = circuit.to_matrices();
        
        let g_a = self.params.alpha_g1.msm(&a);
        let g_b = self.params.beta_g2.msm(&b);  
        let g_c = self.params.gamma_g1.msm(&c);
        
        Ok(Proof { a: g_a, b: g_b, c: g_c })
    }
}

STARK Systems

STARKs are dominated by polynomial arithmetic NTTs and hashing e.g. Poseidon or Blake3. Based on how regular memory access patterns in STARKs is, it often leads to better GPU memory bandwidth utilization compared to the more random-access patterns in SNARK MSM operations.

impl<F: PrimeField> StarkProver<F> {
    fn prove(&self, trace: &ExecutionTrace<F>) -> StarkProof<F> {
        // polynomial interpolation via ntt
        let trace_poly = self.interpolate(trace);
        
        let composition_poly = self.evaluate_constraints(&trace_poly);
        
        let fri_proof = self.fri_commit(&composition_poly);
        
        StarkProof {
            trace_commitment: self.commit(&trace_poly),
            composition_commitment: self.commit(&composition_poly),
            fri_proof,
        }
    }
}

zkEVM Acceleration

zkEVMs present unique challenges due to the complexity of EVM instructions and large state representations. The proving process involves generating constraints for each EVM opcode and state transition.

Proving the EVM

impl ZkEvmProver {
    fn prove_block(&self, block: &Block) -> ZkEvmProof {
        let mut circuit = EvmCircuit::new();
        
        for tx in &block.transactions {
            for (pc, opcode) in tx.trace.iter().enumerate() {
                match opcode {
                    OpCode::ADD => {
                        let (a, b) = self.stack.pop2();
                        let result = a + b;
                        circuit.add_constraint(a + b - result);
                        self.stack.push(result);
                    },
                    OpCode::SSTORE => {
                        let (key, value) = self.stack.pop2();
                        circuit.add_storage_constraint(key, value);
                        self.update_state_root(key, value);
                    },
                    // ...
                }
            }
        }
        
        self.prove_circuit(circuit)
    }
}

Each opcode, from ADD to SSTORE, must be translated into a set of polynomial constraints.

Proving the State Tree

zkEVMs generally must always prove correct memory access patterns and storage updates. This typically involves Merkle tree operations for state root verification

fn constrain_merkle_update(
    circuit: &mut Circuit,
    old_root: Variable,
    new_root: Variable,
    key: Variable,
    old_value: Variable,
    new_value: Variable,
    path: &[Variable],
) {
    let mut current = old_value;
    let mut updated = new_value;
    
    for (i, &sibling) in path.iter().enumerate() {
        let bit = circuit.extract_bit(key, i);
        
        let hash_input = circuit.select(bit, (sibling, current), (current, sibling));
        current = circuit.poseidon_hash(hash_input.0, hash_input.1);
        
        let updated_input = circuit.select(bit, (sibling, updated), (updated, sibling));
        updated = circuit.poseidon_hash(updated_input.0, updated_input.1);
    }
    
    circuit.assert_equal(current, old_root);
    circuit.assert_equal(updated, new_root);
}

GPU acceleration generally becomes important for the extensive hashing operations required in state transition verifications

Conclusion

GPUs themselve didn't just speed up ZK proving, it has basically changed the way ZK proving is done by figuring out the "resource-hungry prover" problem, Now we have Layer 2s finalizing in seconds and private DeFi running in production that were previously not entirely possible. As circuit complexity increases with applications like zkEVMs, GPU speed up is no longer just about optimization, it is a fundamental requirement for production deployments.

Essential Links

ICICLE: An open-source GPU library by Ingonyama for accelerating ZK primitives. GitHub
Arkworks: A Rust-based ecosystem for designing and implementing ZK proof systems. Website
Halo2: A PLONK-based proving system developed by the Zcash Foundation, widely used in projects like the PSE zkEVM. GitHub
ZPrize: An industry competition focused on accelerating ZK performance through software and hardware. Website
"zk-SNARKs: A Primer" by Vitalik Buterin: An excellent high-level explanation of how zk-SNARKs are constructed. Article
"Anatomy of a STARK" by StarkWare: A deep dive into the mathematical components of zk-STARKs.Article
"Introducing ICICLE: An Open-Source GPU Library for Zero Knowledge Acceleration" by Ingonyama Article
"Blitzar, A novel opensource zero-knowledge proof for SQL operations" by Space and Time. Blitzar provides a faster and more robust framework for building GPU-accelerated ZK proofs Github

•••

Share this post:

Posts

X Website GitHub