Field Arithmetic, FFT, and MSM
1. Why this chapter exists
Everything halo2 does, from polynomial commitment to verifier challenges, eventually bottoms out in one of three primitives:
- field arithmetic on (the Pallas / Vesta scalar field),
- the radix-2 number-theoretic transform (NTT) that we will simply call FFT,
- multi-scalar multiplication on the corresponding curve, .
If you change any of them and the constants do not match the formal spec, every test in the suite passes and the verifier still accepts forged proofs. That is the failure mode the maintainers worry about, and it is the failure mode CI cannot catch on its own. This chapter tells you which file is the law for each primitive.
2. Definitions
Definition (Pasta fields). Pallas and Vesta are a 2-cycle of
prime-order curves defined in
pasta_curves. Let and
be the two primes, with and .
The Pallas base field is the Vesta scalar field, and
vice versa. This 2-cycle is what makes recursive proof composition
practical. Both primes have divisible by
, so contains primitive -th roots of
unity; the FFT supports any for .
Definition (Multiplicative subgroup of order ). Let and let be a primitive -th root of unity. The set is the multiplicative subgroup of order . The PLONKish circuit lives on : row corresponds to the point .
Definition (FFT). Given coefficients of a polynomial , the FFT computes the evaluations in field operations. The radix-2 Cooley-Tukey FFT used here recurses on and combines with the butterfly where and are the even-index and odd-index polynomials.
Definition (MSM, Pippenger's algorithm). Given scalars and points , the multi-scalar multiplication problem is to compute . Pippenger's algorithm windows the scalars into -bit chunks, accumulates each chunk's contribution into "buckets", reduces the buckets, and combines across windows. The cost is roughly point operations plus for finalization, where is the bit length of the scalars.
3. The code
3.1 The module surface
//! This module provides common utilities, traits and structures for group,
//! field and polynomial arithmetic.
pub use ff::Field;
use group::{
ff::{BatchInvert, PrimeField},
Group as _, GroupOpsOwned, ScalarMulOwned,
};
use maybe_rayon::prelude::*;
pub use pasta_curves::arithmetic::*;
use crate::multicore::{self, TheBestReduce};
/// This represents an element of a group with basic operations that can be
/// performed. This allows an FFT implementation (for example) to operate
/// generically over either a field or elliptic curve group.
pub trait FftGroup<Scalar: Field>:
Copy + Send + Sync + 'static + GroupOpsOwned + ScalarMulOwned<Scalar>
{
}
impl<T, Scalar> FftGroup<Scalar> for T
where
Scalar: Field,
T: Copy + Send + Sync + 'static + GroupOpsOwned + ScalarMulOwned<Scalar>,
{
}
pub use pasta_curves::arithmetic::*; re-exports the curve traits
(CurveAffine, CurveExt, Group, Coordinates, etc.) so the
rest of the crate can refer to them without naming
pasta_curves directly.
3.2 Multi-scalar multiplication
The fast path uses Pippenger:
///
/// This will use multithreading if beneficial.
pub fn best_multiexp<C: CurveAffine>(coeffs: &[C::Scalar], bases: &[C]) -> C::Curve {
assert_eq!(coeffs.len(), bases.len());
let c = if bases.len() < 4 {
1
} else if bases.len() < 32 {
3
} else {
(f64::from(bases.len() as u32)).ln().ceil() as usize
};
let mut multi_buckets: Vec<Buckets<C>> = vec![Buckets::new(c); (256 / c) + 1];
let num_threads = multicore::current_num_threads();
if coeffs.len() > num_threads {
multi_buckets
.par_iter_mut()
.enumerate()
.rev()
.map(|(i, buckets)| {
let mut acc = buckets.sum(coeffs, bases, i);
(0..c * i).for_each(|_| acc = acc.double());
acc
})
.the_best_reduce(C::Curve::identity, |a, b| a + b)
.expect("multi_buckets always contains at least 1 bucket")
} else {
multi_buckets
.iter_mut()
.enumerate()
.rev()
.map(|(i, buckets)| buckets.sum(coeffs, bases, i))
.fold(C::Curve::identity(), |mut sum, bucket| {
// restore original evaluation point
(0..c).for_each(|_| sum = sum.double());
sum + bucket
})
}
}
A few invariants worth knowing:
- The window size is chosen as a function of . The code uses a heuristic of for , falling back to for tiny inputs.
coeffsandbasesmust have equal length; the function does not check this on the hot path. Length mismatch is a programmer bug.- For very small inputs the dispatch falls through to
small_multiexp, a naive double-and-add:
/// Performs a small multi-exponentiation operation.
/// Uses the double-and-add algorithm with doublings shared across points.
pub fn small_multiexp<C: CurveAffine>(coeffs: &[C::Scalar], bases: &[C]) -> C::Curve {
let coeffs: Vec<_> = coeffs.iter().map(|a| a.to_repr()).collect();
let mut acc = C::Curve::identity();
// for byte idx
for byte_idx in (0..32).rev() {
// for bit idx
for bit_idx in (0..8).rev() {
acc = acc.double();
// for each coeff
for coeff_idx in 0..coeffs.len() {
let byte = coeffs[coeff_idx].as_ref()[byte_idx];
if ((byte >> bit_idx) & 1) != 0 {
acc += bases[coeff_idx];
}
}
}
}
acc
}
The 0.3.2 changelog notes "performance has been improved" for
best_multiexp; PR
#796 is the source of
that improvement.
3.3 The FFT
/// $\omega^{-1}$ in place of $\omega$ and dividing each resulting field element
/// by $n$.
///
/// This will use multithreading if beneficial.
pub fn best_fft<Scalar: Field, G: FftGroup<Scalar>>(a: &mut [G], omega: Scalar, log_n: u32) {
fn bitreverse(mut n: usize, l: usize) -> usize {
let mut r = 0;
for _ in 0..l {
r = (r << 1) | (n & 1);
n >>= 1;
}
r
}
let threads = multicore::current_num_threads();
let log_threads = log2_floor(threads);
let n = a.len();
assert_eq!(n, 1 << log_n);
for k in 0..n {
let rk = bitreverse(k, log_n as usize);
if k < rk {
a.swap(rk, k);
}
}
// precompute twiddle factors
let twiddles: Vec<_> = (0..(n / 2))
.scan(Scalar::ONE, |w, _| {
let tw = *w;
*w *= ω
Some(tw)
})
.collect();
if log_n <= log_threads {
let mut chunk = 2_usize;
let mut twiddle_chunk = n / 2;
for _ in 0..log_n {
a.chunks_mut(chunk).for_each(|coeffs| {
let (left, right) = coeffs.split_at_mut(chunk / 2);
// case when twiddle factor is one
let (a, left) = left.split_at_mut(1);
let (b, right) = right.split_at_mut(1);
let t = b[0];
b[0] = a[0];
a[0] += &t;
b[0] -= &t;
left.iter_mut()
.zip(right.iter_mut())
.enumerate()
.for_each(|(i, (a, b))| {
let mut t = *b;
t *= &twiddles[(i + 1) * twiddle_chunk];
*b = *a;
*a += &t;
*b -= &t;
});
});
chunk *= 2;
twiddle_chunk /= 2;
}
} else {
recursive_butterfly_arithmetic(a, n, 1, &twiddles)
}
}
The dispatcher best_fft chooses between a serial recursive
implementation and a parallel one based on log_n and the number
of rayon threads. The recursion is in recursive_butterfly_arithmetic:
/// This perform recursive butterfly arithmetic
pub fn recursive_butterfly_arithmetic<Scalar: Field, G: FftGroup<Scalar>>(
a: &mut [G],
n: usize,
twiddle_chunk: usize,
twiddles: &[Scalar],
) {
if n == 2 {
let t = a[1];
a[1] = a[0];
a[0] += &t;
a[1] -= &t;
} else {
let (left, right) = a.split_at_mut(n / 2);
multicore::join(
|| recursive_butterfly_arithmetic(left, n / 2, twiddle_chunk * 2, twiddles),
|| recursive_butterfly_arithmetic(right, n / 2, twiddle_chunk * 2, twiddles),
);
// case when twiddle factor is one
let (a, left) = left.split_at_mut(1);
let (b, right) = right.split_at_mut(1);
let t = b[0];
b[0] = a[0];
a[0] += &t;
b[0] -= &t;
left.iter_mut()
.zip(right.iter_mut())
.enumerate()
.for_each(|(i, (a, b))| {
let mut t = *b;
t *= &twiddles[(i + 1) * twiddle_chunk];
*b = *a;
*a += &t;
*b -= &t;
});
}
}
The slice a is mutated in place; on entry it holds bit-reversed
coefficients, on exit it holds evaluations in standard order.
3.4 Parallelization helper
}
/// This simple utility function will parallelize an operation that is to be
/// performed over a mutable slice.
pub fn parallelize<T: Send, F: Fn(&mut [T], usize) + Send + Sync + Clone>(v: &mut [T], f: F) {
let n = v.len();
let num_threads = multicore::current_num_threads();
let mut chunk = n / num_threads;
if chunk < num_threads {
chunk = n;
}
multicore::scope(|scope| {
for (chunk_num, v) in v.chunks_mut(chunk).enumerate() {
let f = f.clone();
scope.spawn(move |_| {
let start = chunk_num * chunk;
f(v, start);
});
}
});
}
The closure receives a mutable subslice and the global index of its
first element. This is the workhorse for almost every parallel
loop in the crate. Always prefer it to chunks_mut plus
rayon::join because it respects the multicore feature flag.
3.5 Other small but load-bearing helpers
eval_polynomial: Horner's method for evaluating .compute_inner_product: , used in the IPA prover.kate_division: divide a polynomial by , used inside the multiopen prover.lagrange_interpolate: given pairs, recover the polynomial. Used in tests and the IPA setup.
4. Failure modes
- Wrong root of unity. If a caller passes the wrong
to
best_fft, the polynomial in/out conversion silently corrupts and every downstream commitment is wrong. TheEvaluationDomaintype (chapter 04) exists exactly to avoid this; never callbest_fftdirectly outsidepoly/domain.rsunless you absolutely know what you are doing. - Edge case in
best_multiexpfor . The function returns the curve identity. The change in #796 required adding the testtest_create_proofprecisely to lock that behaviour in. - Forgetting
--features multicore. Without it, every parallel helper degrades to serial. The CI default has multicore enabled; benchmarks without it can be 8 to 16x slower. - Bit-reversal mismatch. The recursive FFT expects bit-reversed
input; if you ever call it from a new code path, you must
bit-reverse the input first (see how
EvaluationDomaindoes it).
5. Spec pointers
- The Pasta cycle and the choice of primes are documented in the
pasta_curvesREADME and in Daira Hopwood's explainer. - The Cooley-Tukey FFT is described in any algorithms textbook; for the field-theoretic NTT view, see chapter 30 of CLRS.
- Pippenger's algorithm: Pippenger, "On the evaluation of powers and monomials", SIAM J. Comput. (1976). For a modern exposition in the SNARK setting, see Bernstein et al., eprint 2020/1145.
6. Exercises
-
Open
halo2_proofs/src/arithmetic.rsand identify the function that chooses the Pippenger window size. Write down the formula it uses as a function ofm. -
Find the test at the bottom of
halo2_proofs/src/arithmetic.rsthat checksbest_multiexpagainstsmall_multiexpand run it:cargo test --release -p halo2_proofs arithmetic -
Add an extra
#[test]tohalo2_proofs/src/arithmetic.rsverifying thatbest_multiexpon an empty input returns the curve identity. Run the test and confirm it passes. -
Run the FFT benchmark:
cargo bench -p halo2_proofs --bench fftObserve how the per-operation cost scales with
log_n; the slope should be roughly linear (cost is per element).
Answers in the code
- Exercise 1: the window size selection is at the top of
best_multiexpinhalo2_proofs/src/arithmetic.rs. - Exercise 2: the matching test is named
test_best_multiexp(or similar) inside the#[cfg(test)] modat the bottom of the same file.
7. Further reading
- The
pasta_curvessource for the curve implementations themselves. - The
groupandfftraits, which are the abstract interfaces halo2 codes against.