


How can I optimize my Number Theoretic Transform (NTT) and modular arithmetic for faster computation, especially with very large numbers (e.g., over 12000 bits)?
Dec 16, 2024 am 03:13 AMModular arithmetics and NTT (finite field DFT) optimizations
Problem Statement
I wanted to use NTT for fast squaring (see Fast bignum square computation), but the result is slow even for really big numbers .. more than 12000 bits.
So my question is:
- Is there a way to optimize my NTT transform? I did not mean to speed it by parallelism (threads); this is low-level layer only.
- Is there a way to speed up my modular arithmetics?
This is my (already optimized) source code in C for NTT (it's complete and 100% working in C whitout any need for third-party libs and should also be thread-safe. Beware the source array is used as a temporary!!!, Also it cannot transform the array to itself).
Optimized Solution
- Using Precomputed Powers: Precompute and store the powers of W and iW (the primitive root of unity and its inverse) to avoid recalculating them during the NTT process. This can significantly reduce the number of multiplications and divisions, leading to faster computations.
- Unrolling Loops: Unroll the loops in the NTT algorithm to reduce the overhead associated with loop iterations. This can improve performance by reducing the number of branch instructions.
- Optimizing Modular Arithmetic: Use bitwise operations and assembly language to implement modular arithmetic operations (addition, subtraction, multiplication, and exponentiation) efficiently. This can eliminate unnecessary branching and conditional statements, resulting in faster execution.
Example Implementation
Here's an example of an optimized NTT implementation in C using precomputed powers and bitwise operations:
class NTT { public: NTT() { // Initialize constants p = 0xc0000001; W = modpow(2, 0x30000000 / n); iW = modpow(2, p - 1 - 0x30000000 / n); rN = modpow(n, p - 2); NN = n >> 1; // Precompute W and iW powers WW = new uint32_t[n]; iWW = new uint32_t[n]; WW[0] = 1; iWW[0] = 1; for (uint32_t i = 1; i < n; i++) { WW[i] = modmul(WW[i - 1], W); iWW[i] = modmul(iWW[i - 1], iW); } } void NTT(uint32_t *dst, uint32_t *src, uint32_t n) { if (n > 0) { // Reorder even, odd elements for (uint32_t i = 0, j = 0; i < NN; i++, j += 2) { dst[i] = src[j]; } for (j = 1; i < n; i++, j += 2) { dst[i] = src[j]; } // Recursive NTT NTT(src, dst, NN); // Even NTT(src + NN, dst + NN, NN); // Odd // Restore results for (uint32_t i = 0, j = NN; i < NN; i++, j++) { uint32_t a0 = src[i]; uint32_t a1 = modmul(src[j], WW[i]); dst[i] = modadd(a0, a1); dst[j] = modsub(a0, a1); } } } private: uint32_t p, n, NN, W, iW, rN; uint32_t *WW, *iWW; // Modular arithmetic operations inline uint32_t modadd(uint32_t a, uint32_t b) { uint32_t d = a + b; if (d >= p) d -= p; return d; } inline uint32_t modsub(uint32_t a, uint32_t b) { uint32_t d = a - b; if (d > a) d += p; return d; } inline uint32_t modmul(uint32_t a, uint32_t b) { uint32_t m = (uint64_t)a * b; return m - (p * (m / p)); } inline uint32_t modpow(uint32_t a, uint32_t b) { if (b == 0) return 1; uint32_t t = modpow(a, b / 2); t = modmul(t, t); if (b & 1) t = modmul(t, a); return t; } };
Additional Tips
- Use a higher-level language that supports bitwise operations and inline assembly, such as C .
- Use a profiler to identify the bottlenecks in your code and target them for optimization.
- Consider parallelizing the NTT algorithm using multiple threads or SIMD instructions.
The above is the detailed content of How can I optimize my Number Theoretic Transform (NTT) and modular arithmetic for faster computation, especially with very large numbers (e.g., over 12000 bits)?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

std::chrono is used in C to process time, including obtaining the current time, measuring execution time, operation time point and duration, and formatting analysis time. 1. Use std::chrono::system_clock::now() to obtain the current time, which can be converted into a readable string, but the system clock may not be monotonous; 2. Use std::chrono::steady_clock to measure the execution time to ensure monotony, and convert it into milliseconds, seconds and other units through duration_cast; 3. Time point (time_point) and duration (duration) can be interoperable, but attention should be paid to unit compatibility and clock epoch (epoch)

volatile tells the compiler that the value of the variable may change at any time, preventing the compiler from optimizing access. 1. Used for hardware registers, signal handlers, or shared variables between threads (but modern C recommends std::atomic). 2. Each access is directly read and write memory instead of cached to registers. 3. It does not provide atomicity or thread safety, and only ensures that the compiler does not optimize read and write. 4. Constantly, the two are sometimes used in combination to represent read-only but externally modifyable variables. 5. It cannot replace mutexes or atomic operations, and excessive use will affect performance.

There are mainly the following methods to obtain stack traces in C: 1. Use backtrace and backtrace_symbols functions on Linux platform. By including obtaining the call stack and printing symbol information, the -rdynamic parameter needs to be added when compiling; 2. Use CaptureStackBackTrace function on Windows platform, and you need to link DbgHelp.lib and rely on PDB file to parse the function name; 3. Use third-party libraries such as GoogleBreakpad or Boost.Stacktrace to cross-platform and simplify stack capture operations; 4. In exception handling, combine the above methods to automatically output stack information in catch blocks

In C, the POD (PlainOldData) type refers to a type with a simple structure and compatible with C language data processing. It needs to meet two conditions: it has ordinary copy semantics, which can be copied by memcpy; it has a standard layout and the memory structure is predictable. Specific requirements include: all non-static members are public, no user-defined constructors or destructors, no virtual functions or base classes, and all non-static members themselves are PODs. For example structPoint{intx;inty;} is POD. Its uses include binary I/O, C interoperability, performance optimization, etc. You can check whether the type is POD through std::is_pod, but it is recommended to use std::is_trivia after C 11.

To call Python code in C, you must first initialize the interpreter, and then you can achieve interaction by executing strings, files, or calling specific functions. 1. Initialize the interpreter with Py_Initialize() and close it with Py_Finalize(); 2. Execute string code or PyRun_SimpleFile with PyRun_SimpleFile; 3. Import modules through PyImport_ImportModule, get the function through PyObject_GetAttrString, construct parameters of Py_BuildValue, call the function and process return

FunctionhidinginC occurswhenaderivedclassdefinesafunctionwiththesamenameasabaseclassfunction,makingthebaseversioninaccessiblethroughthederivedclass.Thishappenswhenthebasefunctionisn’tvirtualorsignaturesdon’tmatchforoverriding,andnousingdeclarationis

In C, there are three main ways to pass functions as parameters: using function pointers, std::function and Lambda expressions, and template generics. 1. Function pointers are the most basic method, suitable for simple scenarios or C interface compatible, but poor readability; 2. Std::function combined with Lambda expressions is a recommended method in modern C, supporting a variety of callable objects and being type-safe; 3. Template generic methods are the most flexible, suitable for library code or general logic, but may increase the compilation time and code volume. Lambdas that capture the context must be passed through std::function or template and cannot be converted directly into function pointers.

AnullpointerinC isaspecialvalueindicatingthatapointerdoesnotpointtoanyvalidmemorylocation,anditisusedtosafelymanageandcheckpointersbeforedereferencing.1.BeforeC 11,0orNULLwasused,butnownullptrispreferredforclarityandtypesafety.2.Usingnullpointershe
