Zero-knowledge proofs enables verifiable computation without opening up private transactions, creating a powerful building blocks for privacy-preserving transactions and scalable onchain verification. However, the computational intensity of proof gen has overtime been a major issue. A well-known tradeoff in ZK systems are the "resource-hungry prover": achieving a small, fast-to-verify proof requires high computation but often quite slow, blocking real-time use.
Modern GPUs drastically change this equation. By leveraging their parallel design, they turn what was a fundamental issue into a solvable engineering problem. ZK systems of today primarily use two constructions: zk-SNARKs
and zk-STARKs
, both of which rely on core operations that map perfectly to GPU acceleration.
Nearly all proving time goes to two operations: Multi-Scalar Multiplication (MSM) and Number Theoretic Transform (NTT).
MSM computes the sum of scaled elliptic curve points: . This single operation consumes a whooping 60-80%
of proving time in SNARKs
. It personally requires a high number of random memory accesses to fetch elliptic curve points, making memory bandwidth the primary performance issue here. The Pippenger algorithm parallelizes this task by sorting scalars into buckets.
fn pippenger_msm<G: CurveGroup>(
bases: &[G::Affine],
scalars: &[G::ScalarField],
window_size: usize,
) -> G {
let num_buckets = 1 << window_size;
let mut buckets = vec![G::zero(); num_buckets];
for (base, scalar) in bases.iter().zip(scalars.iter()) {
let bucket_index = extract_window(scalar, window_size);
buckets[bucket_index] += base;
}
// bucket aggregation
buckets.into_iter()
.enumerate()
.map(|(i, bucket)| bucket * G::ScalarField::from(i))
.fold(G::zero(), |acc, term| acc + term)
}
So, what happens in the GPU implementation is it partitions this across thousands of threads, achieving over 700 million
point operations per second on an RTX 4090
.
NTT on the other hand mutliplies polynomials fast unlike MSM's random memory jumps, NTT follows a predictable pattern, exactly what GPUs are built for.
__global__ void ntt_radix2_kernel(
uint64_t* coeffs,
uint64_t* twiddles,
uint32_t stage,
uint32_t n
) {
uint32_t tid = threadIdx.x + blockIdx.x * blockDim.x;
uint32_t stride = 1 << stage;
uint32_t m = n >> (stage + 1);
if (tid >= m) return;
uint32_t k = tid % stride;
uint32_t j = tid / stride;
uint32_t idx1 = j * (stride << 1) + k;
uint32_t idx2 = idx1 + stride;
uint64_t u = coeffs[idx1];
uint64_t v = mod_mul(coeffs[idx2], twiddles[k], MODULUS);
coeffs[idx1] = mod_add(u, v, MODULUS);
coeffs[idx2] = mod_sub(u, v, MODULUS);
}
This sequential pattern allows performant NTT implementations to achieve 85-95%
memory bandwidth.
Think about Alice and Bob, both operating zkEVM Sequencers
for their respective Layer-2
chains. They process identical transaction batches but use different proving infra.
Both sequencers must prove a block containing 500
transactions with the following operations:
This generates a circuit with approximately constraints, typical for a medium-sized zkEVM
block.
Alice runs her prover on a high-end CPU
- AMD EPYC 7763, 64 cores
:
impl CpuZkEvmProver {
fn prove_block(&self, block: &Block) -> ProofResult {
let start = Instant::now();
// circuit compilation
let circuit = self.compile_block_circuit(block); // 45s
// msm operations
let witness_commitment = self.cpu_msm(&circuit.witnesses); // 180s
// ntt operations
let constraint_proof = self.cpu_ntt(&circuit.constraints); // 90s
// final proof
let proof = self.assemble_proof(witness_commitment, constraint_proof);
ProofResult {
proof,
proving_time: start.elapsed(), // ~315s total
memory_usage: 45_000_000_000, // 45GB peak
}
}
fn cpu_msm(&self, witnesses: &[Fr]) -> G1Projective {
witnesses.iter()
.zip(self.setup.iter())
.map(|(scalar, base)| base.mul(scalar))
.fold(G1Projective::identity(), |acc, point| acc + point)
}
}
Alice's Results:
Bob uses an RTX 4090
with ICICLE
acceleration:
use icicle_bls12_381::{curve::G1Projective, msm};
use icicle_core::msm::MSMConfig;
impl GpuZkEvmProver {
fn prove_block(&self, block: &Block) -> ProofResult {
let start = Instant::now();
// circuit compilation
let circuit = self.compile_block_circuit_gpu(block); // 12s
// gpu msm - 16,384 cores parallel
let witness_commitment = self.gpu_msm(&circuit.witnesses); // 8s
// gpu ntt
let constraint_proof = self.gpu_ntt(&circuit.constraints); // 3 seconds
// final proof
let proof = self.assemble_proof(witness_commitment, constraint_proof);
ProofResult {
proof,
proving_time: start.elapsed(), // ~23s total
memory_usage: 18_000_000_000, // 18GB peak
}
}
fn gpu_msm(&self, witnesses: &[Fr]) -> Result<G1Projective> {
let cfg = MSMConfig {
precompute_factor: 8,
c: 16,
..Default::default()
};
msm::msm(witnesses, &self.gpu_setup, &cfg)
}
}
Bob's Results:
Operation | Alice (CPU) | Bob (GPU) | Speedup |
---|---|---|---|
Circuit Compilation | 45s | 12s | 3.8× |
MSM Operations | 180s | 8s | 22.5× |
NTT Operations | 90s | 3s | 30× |
Total Block Proof | 315s | 23s | 13.7× |
Network Finality:
// one every 12s
const BLOCKS_PER_DAY: u32 = 7200;
impl ProvingEconomics {
fn daily_capacity(&self, proving_time_seconds: u32) -> u32 {
let daily_seconds = 24 * 60 * 60;
daily_seconds / proving_time_seconds
}
}
Daily proving capacity (Alice vs Bob):
Alice would need 14 CPUs to match Bob's single GPU setup, leading to:
The performance difference starts from fundamentally how both servers access memory:
// alice's cpu msm - sequential, cache-bound
for (scalar, base) in witnesses.iter().zip(setup.iter()) {
result += base * scalar; // cache miss every iteration
}
// bob's gpu msm - parallel, lightweight
__global__ void msm_kernel(scalar_t* scalars, point_t* bases, point_t* buckets) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int bucket_idx = extract_window(scalars[tid], WINDOW_SIZE);
atomicAdd(&buckets[bucket_idx], bases[tid]); // coalesced memory access
}
This case study for e.g simply shows why GPU acceleration has become essential for production zkEVM
deployments. Bob's infra can serve 13.7×
more users while holding on to faster finality and lower operational costs.
Groth16
and PLONK
are MSM-heavy. Groth16
, for instance, spends about 80%
of its proving time on elliptic curve type operations. PLONK
on the other hand distributes the load more evenly between MSM ~60%
and polynomial operations ~30%
, requiring balanced optimization for both.
impl<E: Pairing> Groth16Prover<E> {
fn create_proof(&self, circuit: &Circuit<E::ScalarField>) -> Result<Proof<E>> {
let (a, b, c) = circuit.to_matrices();
let g_a = self.params.alpha_g1.msm(&a);
let g_b = self.params.beta_g2.msm(&b);
let g_c = self.params.gamma_g1.msm(&c);
Ok(Proof { a: g_a, b: g_b, c: g_c })
}
}
STARKs
are dominated by polynomial arithmetic NTTs
and hashing e.g. Poseidon
or Blake3
. Based on how regular memory access patterns in STARKs
is, it often leads to better GPU memory bandwidth utilization compared to the more random-access patterns in SNARK
MSM operations.
impl<F: PrimeField> StarkProver<F> {
fn prove(&self, trace: &ExecutionTrace<F>) -> StarkProof<F> {
// polynomial interpolation via ntt
let trace_poly = self.interpolate(trace);
let composition_poly = self.evaluate_constraints(&trace_poly);
let fri_proof = self.fri_commit(&composition_poly);
StarkProof {
trace_commitment: self.commit(&trace_poly),
composition_commitment: self.commit(&composition_poly),
fri_proof,
}
}
}
zkEVMs present unique challenges due to the complexity of EVM
instructions and large state representations. The proving process involves generating constraints for each EVM
opcode and state transition.
impl ZkEvmProver {
fn prove_block(&self, block: &Block) -> ZkEvmProof {
let mut circuit = EvmCircuit::new();
for tx in &block.transactions {
for (pc, opcode) in tx.trace.iter().enumerate() {
match opcode {
OpCode::ADD => {
let (a, b) = self.stack.pop2();
let result = a + b;
circuit.add_constraint(a + b - result);
self.stack.push(result);
},
OpCode::SSTORE => {
let (key, value) = self.stack.pop2();
circuit.add_storage_constraint(key, value);
self.update_state_root(key, value);
},
// ...
}
}
}
self.prove_circuit(circuit)
}
}
Each opcode, from ADD to SSTORE, must be translated into a set of polynomial constraints.
zkEVMs generally must always prove correct memory access patterns and storage updates. This typically involves Merkle tree
operations for state root verification
fn constrain_merkle_update(
circuit: &mut Circuit,
old_root: Variable,
new_root: Variable,
key: Variable,
old_value: Variable,
new_value: Variable,
path: &[Variable],
) {
let mut current = old_value;
let mut updated = new_value;
for (i, &sibling) in path.iter().enumerate() {
let bit = circuit.extract_bit(key, i);
let hash_input = circuit.select(bit, (sibling, current), (current, sibling));
current = circuit.poseidon_hash(hash_input.0, hash_input.1);
let updated_input = circuit.select(bit, (sibling, updated), (updated, sibling));
updated = circuit.poseidon_hash(updated_input.0, updated_input.1);
}
circuit.assert_equal(current, old_root);
circuit.assert_equal(updated, new_root);
}
GPU acceleration generally becomes important for the extensive hashing operations required in state transition verifications
GPUs themselve didn't just speed up ZK proving, it has basically changed the way ZK proving is done by figuring out the "resource-hungry prover" problem, Now we have Layer 2s finalizing in seconds and private DeFi running in production that were previously not entirely possible. As circuit complexity increases with applications like zkEVMs
, GPU speed up is no longer just about optimization, it is a fundamental requirement for production deployments.