A photo of Shodipo Ayomide
All Posts

GPU Accelerated Zero-Knowledge Proving

Sep 7, 2025
·
views
·
likes
GPU Accelerated Zero-Knowledge Proving

Zero-knowledge proofs enables verifiable computation without opening up private transactions, creating a powerful building blocks for privacy-preserving transactions and scalable onchain verification. However, the computational intensity of proof gen has overtime been a major issue. A well-known tradeoff in ZK systems are the "resource-hungry prover": achieving a small, fast-to-verify proof requires high computation but often quite slow, blocking real-time use.

Modern GPUs drastically change this equation. By leveraging their parallel design, they turn what was a fundamental issue into a solvable engineering problem. ZK systems of today primarily use two constructions: zk-SNARKs and zk-STARKs, both of which rely on core operations that map perfectly to GPU acceleration.

What Actually Takes So Long?

Nearly all proving time goes to two operations: Multi-Scalar Multiplication (MSM) and Number Theoretic Transform (NTT).

What's Multi-Scalar Multiplication?

MSM computes the sum of scaled elliptic curve points: S=i=1naiPiS = \sum_{i=1}^{n} a_i \cdot P_i. This single operation consumes a whooping 60-80% of proving time in SNARKs. It personally requires a high number of random memory accesses to fetch elliptic curve points, making memory bandwidth the primary performance issue here. The Pippenger algorithm parallelizes this task by sorting scalars into buckets.

fn pippenger_msm<G: CurveGroup>(
    bases: &[G::Affine],
    scalars: &[G::ScalarField],
    window_size: usize,
) -> G {
    let num_buckets = 1 << window_size;
    let mut buckets = vec![G::zero(); num_buckets];
    
    for (base, scalar) in bases.iter().zip(scalars.iter()) {
        let bucket_index = extract_window(scalar, window_size);
        buckets[bucket_index] += base;
    }
    
    // bucket aggregation
    buckets.into_iter()
           .enumerate()
           .map(|(i, bucket)| bucket * G::ScalarField::from(i))
           .fold(G::zero(), |acc, term| acc + term)
}

So, what happens in the GPU implementation is it partitions this across thousands of threads, achieving over 700 million point operations per second on an RTX 4090.

What's Number Theoretic Transform?

NTT on the other hand mutliplies polynomials fast unlike MSM's random memory jumps, NTT follows a predictable pattern, exactly what GPUs are built for.

__global__ void ntt_radix2_kernel(
    uint64_t* coeffs,
    uint64_t* twiddles,
    uint32_t stage,
    uint32_t n
) {
    uint32_t tid = threadIdx.x + blockIdx.x * blockDim.x;
    uint32_t stride = 1 << stage;
    uint32_t m = n >> (stage + 1);
    
    if (tid >= m) return;
    
    uint32_t k = tid % stride;
    uint32_t j = tid / stride;
    uint32_t idx1 = j * (stride << 1) + k;
    uint32_t idx2 = idx1 + stride;
    
    uint64_t u = coeffs[idx1];
    uint64_t v = mod_mul(coeffs[idx2], twiddles[k], MODULUS);
    
    coeffs[idx1] = mod_add(u, v, MODULUS);
    coeffs[idx2] = mod_sub(u, v, MODULUS);
}

This sequential pattern allows performant NTT implementations to achieve 85-95% memory bandwidth.

Case Study: CPU vs GPU Proving Perf

Think about Alice and Bob, both operating zkEVM Sequencers for their respective Layer-2 chains. They process identical transaction batches but use different proving infra.

Scenario: zkEVM Block Proving

Both sequencers must prove a block containing 500 transactions with the following operations:

This generates a circuit with approximately 2262^{26} constraints, typical for a medium-sized zkEVM block.

Alice's CPU Approach

Alice runs her prover on a high-end CPU - AMD EPYC 7763, 64 cores:

impl CpuZkEvmProver {
    fn prove_block(&self, block: &Block) -> ProofResult {
        let start = Instant::now();
        
        // circuit compilation
        let circuit = self.compile_block_circuit(block); // 45s
        
        // msm operations
        let witness_commitment = self.cpu_msm(&circuit.witnesses); // 180s
        
        // ntt operations
        let constraint_proof = self.cpu_ntt(&circuit.constraints); // 90s
        
        // final proof
        let proof = self.assemble_proof(witness_commitment, constraint_proof);
        
        ProofResult {
            proof,
            proving_time: start.elapsed(), // ~315s total
            memory_usage: 45_000_000_000, // 45GB peak
        }
    }
    
    fn cpu_msm(&self, witnesses: &[Fr]) -> G1Projective {
        witnesses.iter()
                .zip(self.setup.iter())
                .map(|(scalar, base)| base.mul(scalar))
                .fold(G1Projective::identity(), |acc, point| acc + point)
    }
}

Alice's Results:

Bob's GPU-Accelerated Approach

Bob uses an RTX 4090 with ICICLE acceleration:

use icicle_bls12_381::{curve::G1Projective, msm};
use icicle_core::msm::MSMConfig;
 
impl GpuZkEvmProver {
    fn prove_block(&self, block: &Block) -> ProofResult {
        let start = Instant::now();
        
        // circuit compilation
        let circuit = self.compile_block_circuit_gpu(block); // 12s
        
        // gpu msm - 16,384 cores parallel
        let witness_commitment = self.gpu_msm(&circuit.witnesses); // 8s
        
        // gpu ntt
        let constraint_proof = self.gpu_ntt(&circuit.constraints); // 3 seconds
        
        // final proof
        let proof = self.assemble_proof(witness_commitment, constraint_proof);
        
        ProofResult {
            proof,
            proving_time: start.elapsed(), // ~23s total
            memory_usage: 18_000_000_000, // 18GB peak
        }
    }
    
    fn gpu_msm(&self, witnesses: &[Fr]) -> Result<G1Projective> {
        let cfg = MSMConfig {
            precompute_factor: 8,
            c: 16,
            ..Default::default()
        };
        
        msm::msm(witnesses, &self.gpu_setup, &cfg)
    }
}

Bob's Results:

Comparing the Results

OperationAlice (CPU)Bob (GPU)Speedup
Circuit Compilation45s12s3.8×
MSM Operations180s8s22.5×
NTT Operations90s3s30×
Total Block Proof315s23s13.7×

Why This Changes Everything

Network Finality:

Cost Per Transaction
// one every 12s
const BLOCKS_PER_DAY: u32 = 7200;
 
impl ProvingEconomics {
    fn daily_capacity(&self, proving_time_seconds: u32) -> u32 {
        let daily_seconds = 24 * 60 * 60;
        daily_seconds / proving_time_seconds
    }
}

Daily proving capacity (Alice vs Bob):

What Alice Needs to Compete

Alice would need 14 CPUs to match Bob's single GPU setup, leading to:

How Memory Gets Read

The performance difference starts from fundamentally how both servers access memory:

// alice's cpu msm - sequential, cache-bound
for (scalar, base) in witnesses.iter().zip(setup.iter()) {
    result += base * scalar; // cache miss every iteration
}
 
// bob's gpu msm - parallel, lightweight
__global__ void msm_kernel(scalar_t* scalars, point_t* bases, point_t* buckets) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int bucket_idx = extract_window(scalars[tid], WINDOW_SIZE);
    atomicAdd(&buckets[bucket_idx], bases[tid]); // coalesced memory access
}

This case study for e.g simply shows why GPU acceleration has become essential for production zkEVM deployments. Bob's infra can serve 13.7× more users while holding on to faster finality and lower operational costs.

SNARK vs. STARK Acceleration Patterns

SNARK Systems

Groth16 and PLONK are MSM-heavy. Groth16, for instance, spends about 80% of its proving time on elliptic curve type operations. PLONK on the other hand distributes the load more evenly between MSM ~60% and polynomial operations ~30%, requiring balanced optimization for both.

impl<E: Pairing> Groth16Prover<E> {
    fn create_proof(&self, circuit: &Circuit<E::ScalarField>) -> Result<Proof<E>> {
        let (a, b, c) = circuit.to_matrices();
        
        let g_a = self.params.alpha_g1.msm(&a);
        let g_b = self.params.beta_g2.msm(&b);  
        let g_c = self.params.gamma_g1.msm(&c);
        
        Ok(Proof { a: g_a, b: g_b, c: g_c })
    }
}

STARK Systems

STARKs are dominated by polynomial arithmetic NTTs and hashing e.g. Poseidon or Blake3. Based on how regular memory access patterns in STARKs is, it often leads to better GPU memory bandwidth utilization compared to the more random-access patterns in SNARK MSM operations.

impl<F: PrimeField> StarkProver<F> {
    fn prove(&self, trace: &ExecutionTrace<F>) -> StarkProof<F> {
        // polynomial interpolation via ntt
        let trace_poly = self.interpolate(trace);
        
        let composition_poly = self.evaluate_constraints(&trace_poly);
        
        let fri_proof = self.fri_commit(&composition_poly);
        
        StarkProof {
            trace_commitment: self.commit(&trace_poly),
            composition_commitment: self.commit(&composition_poly),
            fri_proof,
        }
    }
}

zkEVM Acceleration

zkEVMs present unique challenges due to the complexity of EVM instructions and large state representations. The proving process involves generating constraints for each EVM opcode and state transition.

Proving the EVM

impl ZkEvmProver {
    fn prove_block(&self, block: &Block) -> ZkEvmProof {
        let mut circuit = EvmCircuit::new();
        
        for tx in &block.transactions {
            for (pc, opcode) in tx.trace.iter().enumerate() {
                match opcode {
                    OpCode::ADD => {
                        let (a, b) = self.stack.pop2();
                        let result = a + b;
                        circuit.add_constraint(a + b - result);
                        self.stack.push(result);
                    },
                    OpCode::SSTORE => {
                        let (key, value) = self.stack.pop2();
                        circuit.add_storage_constraint(key, value);
                        self.update_state_root(key, value);
                    },
                    // ...
                }
            }
        }
        
        self.prove_circuit(circuit)
    }
}

Each opcode, from ADD to SSTORE, must be translated into a set of polynomial constraints.

Proving the State Tree

zkEVMs generally must always prove correct memory access patterns and storage updates. This typically involves Merkle tree operations for state root verification

fn constrain_merkle_update(
    circuit: &mut Circuit,
    old_root: Variable,
    new_root: Variable,
    key: Variable,
    old_value: Variable,
    new_value: Variable,
    path: &[Variable],
) {
    let mut current = old_value;
    let mut updated = new_value;
    
    for (i, &sibling) in path.iter().enumerate() {
        let bit = circuit.extract_bit(key, i);
        
        let hash_input = circuit.select(bit, (sibling, current), (current, sibling));
        current = circuit.poseidon_hash(hash_input.0, hash_input.1);
        
        let updated_input = circuit.select(bit, (sibling, updated), (updated, sibling));
        updated = circuit.poseidon_hash(updated_input.0, updated_input.1);
    }
    
    circuit.assert_equal(current, old_root);
    circuit.assert_equal(updated, new_root);
}

GPU acceleration generally becomes important for the extensive hashing operations required in state transition verifications

Conclusion

GPUs themselve didn't just speed up ZK proving, it has basically changed the way ZK proving is done by figuring out the "resource-hungry prover" problem, Now we have Layer 2s finalizing in seconds and private DeFi running in production that were previously not entirely possible. As circuit complexity increases with applications like zkEVMs, GPU speed up is no longer just about optimization, it is a fundamental requirement for production deployments.

Share this post:

Built with ❤️ © 2025