Field Arithmetic, FFT, and MSM

1. Why this chapter exists

Everything halo2 does, from polynomial commitment to verifier challenges, eventually bottoms out in one of three primitives:

field arithmetic on $\mathbb{F}_p$ (the Pallas / Vesta scalar field),
the radix-2 number-theoretic transform (NTT) that we will simply call FFT,
multi-scalar multiplication on the corresponding curve, $\sum_i k_i G_i$ .

If you change any of them and the constants do not match the formal spec, every test in the suite passes and the verifier still accepts forged proofs. That is the failure mode the maintainers worry about, and it is the failure mode CI cannot catch on its own. This chapter tells you which file is the law for each primitive.

2. Definitions

Definition (Pasta fields). Pallas and Vesta are a 2-cycle of prime-order curves defined in pasta_curves. Let $p$ and $q$ be the two primes, with $p \approx 2^{255}$ and $q \approx 2^{255}$ . The Pallas base field $\mathbb{F}_p$ is the Vesta scalar field, and vice versa. This 2-cycle is what makes recursive proof composition practical. Both primes have $|\mathbb{F}^{\times}|$ divisible by $2^{32}$ , so $\mathbb{F}$ contains primitive $2^{32}$ -th roots of unity; the FFT supports any $n = 2^k$ for $k \leq 32$ .

Definition (Multiplicative subgroup of order $n$ ). Let $n = 2^k$ and let $\omega \in \mathbb{F}_p^\times$ be a primitive $n$ -th root of unity. The set $H = \{1, \omega, \omega^2, \dots, \omega^{n-1}\}$ is the multiplicative subgroup of order $n$ . The PLONKish circuit lives on $H$ : row $i$ corresponds to the point $\omega^i$ .

Definition (FFT). Given coefficients $(a_0, \dots, a_{n-1})$ of a polynomial $a(X) = \sum_i a_i X^i$ , the FFT computes the evaluations $(a(\omega^0), \dots, a(\omega^{n-1}))$ in $O(n \log n)$ field operations. The radix-2 Cooley-Tukey FFT used here recurses on $n / 2$ and combines with the butterfly $\bigl(a(\omega^i), a(\omega^{i+n/2})\bigr) = \bigl(e(\omega^{2i}) + \omega^i o(\omega^{2i}), e(\omega^{2i}) - \omega^i o(\omega^{2i})\bigr)$ where $e$ and $o$ are the even-index and odd-index polynomials.

Definition (MSM, Pippenger's algorithm). Given scalars $k_1, \dots, k_m \in \mathbb{F}_p$ and points $G_1, \dots, G_m \in \mathbb{G}$ , the multi-scalar multiplication problem is to compute $\sum_i [k_i] G_i$ . Pippenger's algorithm windows the scalars into $w$ -bit chunks, accumulates each chunk's contribution into $2^w$ "buckets", reduces the buckets, and combines across windows. The cost is roughly $m / \log m$ point operations plus $O(\lambda)$ for finalization, where $\lambda$ is the bit length of the scalars.

3. The code

3.1 The module surface

halo2_proofs/src/arithmetic.rs
//! This module provides common utilities, traits and structures for group,
//! field and polynomial arithmetic.

pub use ff::Field;
use group::{
    ff::{BatchInvert, PrimeField},
    Group as _, GroupOpsOwned, ScalarMulOwned,
};
use maybe_rayon::prelude::*;
pub use pasta_curves::arithmetic::*;

use crate::multicore::{self, TheBestReduce};

/// This represents an element of a group with basic operations that can be
/// performed. This allows an FFT implementation (for example) to operate
/// generically over either a field or elliptic curve group.
pub trait FftGroup<Scalar: Field>:
    Copy + Send + Sync + 'static + GroupOpsOwned + ScalarMulOwned<Scalar>
{
}

impl<T, Scalar> FftGroup<Scalar> for T
where
    Scalar: Field,
    T: Copy + Send + Sync + 'static + GroupOpsOwned + ScalarMulOwned<Scalar>,
{
}

pub use pasta_curves::arithmetic::*; re-exports the curve traits (CurveAffine, CurveExt, Group, Coordinates, etc.) so the rest of the crate can refer to them without naming pasta_curves directly.

3.2 Multi-scalar multiplication

The fast path uses Pippenger:

halo2_proofs/src/arithmetic.rs
///
/// This will use multithreading if beneficial.
pub fn best_multiexp<C: CurveAffine>(coeffs: &[C::Scalar], bases: &[C]) -> C::Curve {
    assert_eq!(coeffs.len(), bases.len());

    let c = if bases.len() < 4 {
        1
    } else if bases.len() < 32 {
        3
    } else {
        (f64::from(bases.len() as u32)).ln().ceil() as usize
    };

    let mut multi_buckets: Vec<Buckets<C>> = vec![Buckets::new(c); (256 / c) + 1];
    let num_threads = multicore::current_num_threads();
    if coeffs.len() > num_threads {
        multi_buckets
            .par_iter_mut()
            .enumerate()
            .rev()
            .map(|(i, buckets)| {
                let mut acc = buckets.sum(coeffs, bases, i);
                (0..c * i).for_each(|_| acc = acc.double());
                acc
            })
            .the_best_reduce(C::Curve::identity, |a, b| a + b)
            .expect("multi_buckets always contains at least 1 bucket")
    } else {
        multi_buckets
            .iter_mut()
            .enumerate()
            .rev()
            .map(|(i, buckets)| buckets.sum(coeffs, bases, i))
            .fold(C::Curve::identity(), |mut sum, bucket| {
                // restore original evaluation point
                (0..c).for_each(|_| sum = sum.double());
                sum + bucket
            })
    }
}

A few invariants worth knowing:

The window size $w$ is chosen as a function of $m$ . The code uses a heuristic of $w \approx \log_2 m - 3$ for $m \geq 32$ , falling back to $w = 3$ for tiny inputs.
coeffs and bases must have equal length; the function does not check this on the hot path. Length mismatch is a programmer bug.
For very small inputs the dispatch falls through to small_multiexp, a naive double-and-add:

halo2_proofs/src/arithmetic.rs
/// Performs a small multi-exponentiation operation.
/// Uses the double-and-add algorithm with doublings shared across points.
pub fn small_multiexp<C: CurveAffine>(coeffs: &[C::Scalar], bases: &[C]) -> C::Curve {
    let coeffs: Vec<_> = coeffs.iter().map(|a| a.to_repr()).collect();
    let mut acc = C::Curve::identity();

    // for byte idx
    for byte_idx in (0..32).rev() {
        // for bit idx
        for bit_idx in (0..8).rev() {
            acc = acc.double();
            // for each coeff
            for coeff_idx in 0..coeffs.len() {
                let byte = coeffs[coeff_idx].as_ref()[byte_idx];
                if ((byte >> bit_idx) & 1) != 0 {
                    acc += bases[coeff_idx];
                }
            }
        }
    }

    acc
}

The 0.3.2 changelog notes "performance has been improved" for best_multiexp; PR #796 is the source of that improvement.

3.3 The FFT

halo2_proofs/src/arithmetic.rs
/// $\omega^{-1}$ in place of $\omega$ and dividing each resulting field element
/// by $n$.
///
/// This will use multithreading if beneficial.
pub fn best_fft<Scalar: Field, G: FftGroup<Scalar>>(a: &mut [G], omega: Scalar, log_n: u32) {
    fn bitreverse(mut n: usize, l: usize) -> usize {
        let mut r = 0;
        for _ in 0..l {
            r = (r << 1) | (n & 1);
            n >>= 1;
        }
        r
    }

    let threads = multicore::current_num_threads();
    let log_threads = log2_floor(threads);
    let n = a.len();
    assert_eq!(n, 1 << log_n);

    for k in 0..n {
        let rk = bitreverse(k, log_n as usize);
        if k < rk {
            a.swap(rk, k);
        }
    }

    // precompute twiddle factors
    let twiddles: Vec<_> = (0..(n / 2))
        .scan(Scalar::ONE, |w, _| {
            let tw = *w;
            *w *= &omega;
            Some(tw)
        })
        .collect();

    if log_n <= log_threads {
        let mut chunk = 2_usize;
        let mut twiddle_chunk = n / 2;
        for _ in 0..log_n {
            a.chunks_mut(chunk).for_each(|coeffs| {
                let (left, right) = coeffs.split_at_mut(chunk / 2);

                // case when twiddle factor is one
                let (a, left) = left.split_at_mut(1);
                let (b, right) = right.split_at_mut(1);
                let t = b[0];
                b[0] = a[0];
                a[0] += &t;
                b[0] -= &t;

                left.iter_mut()
                    .zip(right.iter_mut())
                    .enumerate()
                    .for_each(|(i, (a, b))| {
                        let mut t = *b;
                        t *= &twiddles[(i + 1) * twiddle_chunk];
                        *b = *a;
                        *a += &t;
                        *b -= &t;
                    });
            });
            chunk *= 2;
            twiddle_chunk /= 2;
        }
    } else {
        recursive_butterfly_arithmetic(a, n, 1, &twiddles)
    }
}

The dispatcher best_fft chooses between a serial recursive implementation and a parallel one based on log_n and the number of rayon threads. The recursion is in recursive_butterfly_arithmetic:

halo2_proofs/src/arithmetic.rs

/// This perform recursive butterfly arithmetic
pub fn recursive_butterfly_arithmetic<Scalar: Field, G: FftGroup<Scalar>>(
    a: &mut [G],
    n: usize,
    twiddle_chunk: usize,
    twiddles: &[Scalar],
) {
    if n == 2 {
        let t = a[1];
        a[1] = a[0];
        a[0] += &t;
        a[1] -= &t;
    } else {
        let (left, right) = a.split_at_mut(n / 2);
        multicore::join(
            || recursive_butterfly_arithmetic(left, n / 2, twiddle_chunk * 2, twiddles),
            || recursive_butterfly_arithmetic(right, n / 2, twiddle_chunk * 2, twiddles),
        );

        // case when twiddle factor is one
        let (a, left) = left.split_at_mut(1);
        let (b, right) = right.split_at_mut(1);
        let t = b[0];
        b[0] = a[0];
        a[0] += &t;
        b[0] -= &t;

        left.iter_mut()
            .zip(right.iter_mut())
            .enumerate()
            .for_each(|(i, (a, b))| {
                let mut t = *b;
                t *= &twiddles[(i + 1) * twiddle_chunk];
                *b = *a;
                *a += &t;
                *b -= &t;
            });
    }
}

The slice a is mutated in place; on entry it holds bit-reversed coefficients, on exit it holds evaluations in standard order.

3.4 Parallelization helper

halo2_proofs/src/arithmetic.rs
}

/// This simple utility function will parallelize an operation that is to be
/// performed over a mutable slice.
pub fn parallelize<T: Send, F: Fn(&mut [T], usize) + Send + Sync + Clone>(v: &mut [T], f: F) {
    let n = v.len();
    let num_threads = multicore::current_num_threads();
    let mut chunk = n / num_threads;
    if chunk < num_threads {
        chunk = n;
    }

    multicore::scope(|scope| {
        for (chunk_num, v) in v.chunks_mut(chunk).enumerate() {
            let f = f.clone();
            scope.spawn(move |_| {
                let start = chunk_num * chunk;
                f(v, start);
            });
        }
    });
}

The closure receives a mutable subslice and the global index of its first element. This is the workhorse for almost every parallel loop in the crate. Always prefer it to chunks_mut plus rayon::join because it respects the multicore feature flag.

3.5 Other small but load-bearing helpers

eval_polynomial: Horner's method for evaluating $\sum_i a_i x^i$ .
compute_inner_product: $\sum_i a_i b_i$ , used in the IPA prover.
kate_division: divide a polynomial by $X - b$ , used inside the multiopen prover.
lagrange_interpolate: given $(x_i, y_i)$ pairs, recover the polynomial. Used in tests and the IPA setup.

4. Failure modes

Wrong root of unity. If a caller passes the wrong $\omega$ to best_fft, the polynomial in/out conversion silently corrupts and every downstream commitment is wrong. The EvaluationDomain type (chapter 04) exists exactly to avoid this; never call best_fft directly outside poly/domain.rs unless you absolutely know what you are doing.
Edge case in best_multiexp for $m = 0$ . The function returns the curve identity. The change in #796 required adding the test test_create_proof precisely to lock that behaviour in.
Forgetting --features multicore. Without it, every parallel helper degrades to serial. The CI default has multicore enabled; benchmarks without it can be 8 to 16x slower.
Bit-reversal mismatch. The recursive FFT expects bit-reversed input; if you ever call it from a new code path, you must bit-reverse the input first (see how EvaluationDomain does it).

5. Spec pointers

The Pasta cycle and the choice of primes are documented in the pasta_curves README and in Daira Hopwood's explainer.
The Cooley-Tukey FFT is described in any algorithms textbook; for the field-theoretic NTT view, see chapter 30 of CLRS.
Pippenger's algorithm: Pippenger, "On the evaluation of powers and monomials", SIAM J. Comput. (1976). For a modern exposition in the SNARK setting, see Bernstein et al., eprint 2020/1145.

6. Exercises

Open halo2_proofs/src/arithmetic.rs and identify the function that chooses the Pippenger window size. Write down the formula it uses as a function of m.
Find the test at the bottom of halo2_proofs/src/arithmetic.rs that checks best_multiexp against small_multiexp and run it:
```
cargo test --release -p halo2_proofs arithmetic
```
Add an extra #[test] to halo2_proofs/src/arithmetic.rs verifying that best_multiexp on an empty input returns the curve identity. Run the test and confirm it passes.
Run the FFT benchmark:
```
cargo bench -p halo2_proofs --bench fft
```
Observe how the per-operation cost scales with log_n; the slope should be roughly linear (cost is $O(\log n)$ per element).

Answers in the code

Exercise 1: the window size selection is at the top of best_multiexp in halo2_proofs/src/arithmetic.rs.
Exercise 2: the matching test is named test_best_multiexp (or similar) inside the #[cfg(test)] mod at the bottom of the same file.

7. Further reading

The pasta_curves source for the curve implementations themselves.
The group and ff traits, which are the abstract interfaces halo2 codes against.

1. Why this chapter exists​

2. Definitions​

3. The code​

3.1 The module surface​

3.2 Multi-scalar multiplication​

3.3 The FFT​

3.4 Parallelization helper​

3.5 Other small but load-bearing helpers​

4. Failure modes​

5. Spec pointers​

6. Exercises​

Answers in the code​

7. Further reading​