# prime::namespace

function templates

This documentation of version 0.6 was last edited .

### Contents

# Introduction: What this is and what it does

Hands-on!

## Fast Start Guide

## Demo Program

# Functions

## bool prime::isprime (T n)

## T prime::nextprime (T n)

## T prime::prevprime (T n)

## std::vector<mpz_class> prime::primefactors (mpz_class n)

## std::vector<mpz_class> prime::primefactors_st (mpz_class n)

## std::vector<mpz_class> prime::primefactors_mt (mpz_class n)

## Data Types

## Compilation

# The Algorithms in Detail

## The isprime Standard Algorithm

## The isprime 32bit Optimization

## The Single-Threading Prime Factorization

## The Multi-Threading Prime Factorization

# Theoretic Considerations and Measurements

## Testing up to \(\sqrt{n}\)

## The Sequence of Identified Composites

## CPU Time Distribution on Primes and Composites

## For 32bit clang++ is faster than g++

# Remarks

## Value Judgements applied to the 32bit Optimization

## No Multi-Threading for 32bit

## Why Char and Floating Point Types are not allowed

## The Problem of Implicit Typecasting

## The Problem of Function Overloading

## The Solution: constexpr

and <type_traits>

# Licence Information – GNU GPLv3

# Introduction: What this is and what it does

Learning c or c++ at some point means to write a program that tests numbers for being a prime or a composite, and it is meant as an exercise to practice loops and think about the efficiency of an algorithm. These function templates origin in this exercise, but go far beyond the idea of a beginner's exercise. Programming these function templates became first of all a sportive and intellectual challenge to push speed to the limit. That's the fun of programming for primes!

The second objective is to keep the speed thing behind the curtain, while maintaining hardware independency and offering a safe and easy to use interface to the programmer using these templates. That is why they are templates that accept all reasonable Data Types. The functions are declared within an own namespace named `prime::`

as the title suggests. So if you want speedy calculation of primes with multi-threading and prefer an easy c++ interface this one is for you.

Hands-on

## Fast Start Guide

**1. Get libgmp-dev.** These function templates, for the majority of its use cases, depend on the the GNU Multiple Precision Aricthemtic Library. It is much easier to just install it, than to figure out if you need it. Get it at www.gmplib.org. On a linux machine you may ask your favourite package manager to do you this little favour. On Debian GNU/Linux Buster it is done as root with:

```
apt install libgmp-dev
```

**2. Download.** Get the prime::namespace function templates from berndneumann.eu and unpack the tar.gz into your source code's dirctory. The tar.gz will unpack its files into a new directory called prime

.

**3. Include.** Start editing your program that you want to use these function templates in. Just include the header file for all functions of the prime::namespace functions templates:

```
#include "prime/prime.hpp"
```

**4. Compilation.** You can do nothing wrong with:

```
g++ -std=c++17 -Wall -O3 -pthread prog.cpp -lgmp -lgmpxx
```

Or using clang++, which for some functions makes the faster machine code:

```
clang++ -std=c++17 -Wall -O3 -pthread prog.cpp -lgmp -lgmpxx
```

Note that although compilers help pages tell us to give options before the filename the arguments for the linker should be given after the filename.

See: https://c-faq.com/lib/libsearch.html.

Enjoy Coding!

## Demo Program

This demo program will quickly introduce you to the essential functions provided by these function templates. Furthermore, it gives you a minimal how-to for the c++ class interface of libgmp. See the code below or download it here to enjoy your favourite editor's syntax highlighting.

```
#include <iostream>
#include <iomanip> // output formating
#include <gmp.h> // gmplib
#include <gmpxx.h> // gmplib c++ class interface
#include <climits> // UINT32_MAX and UINT64_MAX
#include <vector> // for the return type of primefactors()
#include "prime/prime.hpp"
int main(){
// Please output booleans as "true" or "false".
std::cout << std::boolalpha;
// You first need an mpz_class object. This kind of equals
// abitrary long signed int n = 0;
mpz_class n("0", 10);
// You can handle it like any other integer.
n++;
n += 10;
std::cout << "n=" << n;
// outputs: n=11
// Test if 11 is a prime number.
bool isprime = prime::isprime(n);
std::cout << " isprime: " << isprime;
std::cout << std::endl;
// outputs: true
// Let's get to the larger ones.
n = UINT64_MAX;
std::cout << "n=" << n << std::flush;
// outputs: n=18446744073709551615
// This number is obviously divisible
// by 5 and not a prime.
std::cout << " isprime: " << prime::isprime(n);
std::cout << std::endl;
// outputs: false
// Set n to the next higher prime.
n = prime::nextprime(n);
std::cout << "n=" << n << std::endl;
// This took a moment.
// Keep in mind: Be careful what you ask for.
// outputs: n=18446744073709551629
// To demonstrate prime factorization
// set n to a prime close to 2^32.
n = prime::nextprime( mpz_class(UINT32_MAX) );
std::cout << "n=" << n << std::endl;
// outputs: n=4294967311
// Multiply n by the following 15 primes.
mpz_class factor("1", 10);
factor = n;
for (int i=0; i<15; i++)
{
factor = prime::nextprime(factor);
n *= factor;
std::cout << (i+1) << "x " << std::flush;
}
std::cout << std::endl;
std::cout << "n=" << n << std::endl;
// outputs:
// n=13407816439831635077991
// 3324909833311228507869985
// 7066650453828383306478623
// 4024933102397559136732543
// 5497559187560497607050751
// 2755268107080415248789052
// 1572291
// Let's find the
// primefactors of n now!
// You need a vector for mpz_class objects
// to store the results.
std::vector<mpz_class> primefactors;
// Fill the vector with the prime factors of n.
primefactors = prime::primefactors(n);
// Output the vector of prime factors.
std::cout << "The primefactors of n are:" << std::endl;
for (unsigned int i=0; i<primefactors.size(); i++)
{
std::cout << std::setw(3) << i;
std::cout << std::setw(12) << primefactors[i];
std::cout << std::endl;
}
// Thanks for watching!
return 0;}
```

# Functions

## bool prime::isprime (T n)

Returns true is \(n\) is a prime and false if \(n\) is not a prime. The term not a prime

includes all composites, negative numbers, zero and one.

## T prime::nextprime (T n)

Returns the next larger prime number found after \(n\). The data type of the return value is the same as the data type of \(n\).

An array of `char`

is not allowed.

If you need to call `prime::nextprime`

for a number stored in an array of `char`

then the best you can do is to create an object of `mpz_class`

to make the call:

```
char c_str[] = "5456";
mpz_class mpz_obj(c_str, 10);
std::cout << prime::nextprime(mpz_obj) << std::endl;
```

This will output 5471

as an object of `mpz_class`

. See the documentation of the gmplib to eventually convert the result back to a number stored in an array of `char`

.

## T prime::prevprime (T n)

Returns the next smaller prime number found before \(n\). The data type of the return value is the same as that of \(n\). An array of `char`

is not allowed. See `prime::nextprime()`

for an example to convert `char*`

to `mpz_class`

. If \(n\) is smaller than or equals 2 the function will return 0.

## std::vector<mpz_class> prime::primefactors (mpz_class n)

Takes, by now, only an `mpz_class`

object as an argument and accordingly returns a vector of `mpz_class`

objects that are the prime factors. This function is a wrapper: If the argument is smaller than 1e12 it calls the single-threading algorithm, if it is larger it calls the multi-threading one. The returned vector is empty (`.size()==0`

) if called with any number smaller than two.

## std::vector<mpz_class> prime::primefactors_st (mpz_class n)

The single-threading algorithm takes, by now, only an `mpz_class`

object as an argument and accordingly returns a vector of `mpz_class`

objects that are the prime factors. Unless you are sure that you want single-threading, which is slower than multi-threading for larger numbers, it is better to call the wrapper function `primefactors()`

. The returned vector is empty (`.size()==0`

) if called with any number smaller than two.

## std::vector<mpz_class> prime::primefactors_mt (mpz_class n)

The multi-threading algorithm takes, by now, only an `mpz_class`

object as an argument and accordingly returns a vector of `mpz_class`

objects that are the prime factors. Unless you are sure that you want multi-threading, which is slower than single-threading for smaller number, it is better to call the wrapper function `primefactors()`

. The returned vector is empty (`.size()==0`

) if called with any number smaller than two.

## Data Types

The function templates can be used with any of the following data types:

`short`

,`int`

,`long int`

and`long long int`

`signed`

and`unsigned`

`std::string`

(needs gmplib)`mpz_class`

(needs gmplib)`char[]`

(needs gmplib) (only`prime::isprime()`

)

The `prime::primefactors()`

functions are, by now, only implemented for use with `mpz_class`

objects.

An *intended* compilation error will occur if you call any function with a char or any floating point data type. See the remark Why Char and Floating Point Types are not allowed to learn why.

A few hints to get faster code for your implementation:

- Use
`unsigned`

types for integrals. Any call with a`signed`

integral type just adds that it is checked to be negative to return false instantly and then cast the data type safely to an`unsigned long long int`

type if the call was done with a positive number. - If you do not need numbers larger than those who fit inside
`uint32_t`

use the 32 bit data type. The optimized algorithm for these smaller numbers is much faster wheremuch

means 50 times and more. Choose clang++ as your compiler here, because it compiles this optimized code better than g++. - Prefer
`mpz_class`

objects to an array of`char`

or`std::string`

. The latter ones are just converted to call the functions for the`mpz_class`

object and converted back to the respective calling data type.

## Compilation

Note that although compilers help pages tell us to give options before the filename the arguments for the linker should be given after the filename.

See: https://c-faq.com/lib/libsearch.html.

`-std=c++17`

needed to evaluate `type_traits`

and `constexpr`

without compiler warnings.

`-pthread`

needed for all data types above 32bit and for `char`

, `std::string`

and `mpz_class`

.

`-lgmp -lgmpxx`

needed for `char`

, `std::string`

and `mpz_class`

.

`-O3`

never needed, but probably wanted for faster machine code.

Performance: clang++ produces more than two times faster code for the 32bit optimization. 37 minutes vs 14,5 mins. There is no measurable difference for the other algorithms/data types.

# The Algorithms in Detail

In this section the main algorithms are describes. We start with the algorithm for `isprime()`

that uses multi-threading and it is refered to as the standard algorithm. In fact there are two slightly different versions of this one implemented, because it can be called with a negative number an `mpz_class`

object is used. But apart from the fact that big numbers have to be treated differently in the core of the cpu the algorithm is the same as the standard algorithm. If you call it with an integral primitive that has a size of up to 32 bit the algorithm described in section The 32bit Optimization is called. It does not use multi-threading. It is so fast that spinning up a thread takes more time than the algorithm needs to finish with a single thread.

Next are the algorithms for finding the prime factors of a given number. Here we have a simple single-threading algorithm with a small, easy and obvious code. It works great and as the largest prime factor of a number determines the time we need to calculate this algorithm is unbeaten for any number that's largest prime factor is quite small. But although there are infinite large number that have only small prime factors (think of 2 to the power of any large number) the great majority of larger numbers have at least one larger prime factor, and as larger the number the larger the largest prime factor. Thus it must make sense to start multi-threading at some point. The multi-threading algorithm that does the job for majority of large numbers is a bit tricky and so far the most complex programming problem I had to solve for these prime templates.

## The Standard Algorithm

The Standard Algorithm consists of two stages. In the first stage we focus on a rapid identification of composites. It follows the basic idea explained in section The Sequence of Identified Composites and makes hardcoded tests by the prime numbers from 3 up to 31:

```
if(n%3 == 0){
if(n == 3){return true;} else {return false;}}
if(n%5 == 0){
if(n == 5){return true;} else {return false;}}
// and so forth till
if(n%31 == 0){
if(n == 31){return true;} else {return false;}}
```

These few tests are done within nanoseconds and they will identify almost 70 percent of all odd numbers as a composite. To do these tests with prime numbers up to 31 and not less or more is an arbitrary decision. Important within this first stage is that a »hit« leads to an immediate return statement. We do not spend any time to start threads and do not have to wait for joining them again.

The preparation of the multi-threading part consists of defining a boolean variable:

```
bool isprime=true;
```

This variable fulfills two purposes: 1. It is our return value. We initialize it as `true`

and the threads' job is trying to find a divisor. If a thread finds a divisor it is set to `false`

and if no divisor is found it just stays true. 2. In the case that one thread has found a divisor setting this variable to `false`

signals to the other threads that they should stop working so we can join them and return to the caller. This organization is simpler and thereby faster than any solution with a scheduler could be.

The preparation of multithreading further includes to calculate the truncated square root as the upper limit of the numbers that we have to test with. For the truncated square root as the end of the test interval see section Testing up to \(\sqrt{n}\). To calculate this square root only once and save it in the variable `max_test`

gives a small performance improvement. We must calculate it before we start the threads, because we do not want a data race.

Then we create an array of threads according to the output of the function `std::thread::hardware_concurrency()`

and we are ready to go. We loop through the array of threads and start them with a lambda function. To that lambda we pass all values by reference except the number of the thread. We have no scheduler so each thread must calculate on his own which tests are his job. Therefore he needs to know which thread he is. So all threads can read `max_test`

and `isprime`

and even write to the latter to signal for stopping work.

Let me add: If we pass the thread's number by reference there is a data race that in most cases does no harm, because the thread has calculated his starting point before the iterator is incremented. But if not the work of one thread is done twice and the work of the other is omitted. If then this omitted work is neccesary to identify a number as a composite the algorithm identifies the composite as a prime. Many conditions must be fulfilled to actually show the undefined behavior. It was not funny to debug this terrible mistake.

How do the threads calculate what their job is? In the first stage we tested till 31. The next prime is 37. That's where the zeroth threads has to start testing: 37=37+2*0. The next thread starts at 39=37+2*1, the second next at 41=37+2*2 and so on. All threads iterate by `2*std::thread::hardware_concurrency()`

. We do 2*

, because we only test odd numbers. This way of iteration is done for the fact that smaller numbers have a higher probability to be a divisor. The threads do this in a while loop. It's condition is that `isprime`

is true and that their test iterator is smaller or equal to `max_test`

.

While the threads are running the main instance looks if the number we test is smaller than one\footnote{The `mpz_class`

version does this before it calculates the square root, because in this case the tested number can still be negative.} or if it is divisible by 2. This is done quite late after starting threads, because it is assumed to be a stupid question to ask if a negative number or an even number is a prime. – For later improvements, this may be the place in the main instance to add advanced math to fastly find out that a number is a composite.

To finish the function we join the threads and test if the number equals 2. This can only be done after joining the threads. In the end, of course, we return to the caller the value that is stored in the boolean `isprime`

.

## The 32bit Optimization

Calculating primes is a matter of speed. If you call a function with a data type that is not larger than 32 bit a special algorithm will be used. On my machine, an Intel i5-7200U, it responds about 44 times faster than the standard algorithm if the number is a prime, respsonds 100 times faster than the standard algorithm if the number is a composite, making it about 55 times faster than the standard algorithm in total. Notice that we do not talk about percent faster

but about times faster

so this optimiziation is probably worth your consideration whenever possible.

To explain how it works let's start with a simple question: What is for a given number \(n\) the bare minimum that has to be done to get the result that its a composite or a prime?

This question will lead to this idea of optimization. For some remarks on the term optimization

see section Optimization and Value Judgement. To find out that a number is a composite we need to find that it is divisible by another number. Analogous to the first stage of the standard algorithm, see section The Standard Algorithm, we put testing for even numbers to a later position and start with testing by 3. Because we do not know if \(n\) is 3 we must test that, too. Thus we start:

```
if(n == 3){return true;}
if(n%3 == 0){return false;}
```

We continue to test only by prime numbers. This way we fastly catch a lot of composites as shown by the sequence explained in section The Sequence of Identified Composites. Compared with the standard algorithm we make less tests, because skip testing by composites like 9, 15, 21, 25 and so on. Further we don't need the addition operation to increase the testing variable.

Catching composites fastly is half of the job done. We still need to identify that we have made enough tests to conclude that \(n\) is a prime. That is the case as soon as the square of the last prime we tested with is larger than \(n\). On this occasion we, of course only once, test for even numbers and for \(n\) being smaller or equal to 1. In the following snippet this is abitrarily done after testing by the first 24 odd prime numbers:

```
if(n == 97){return true;}
if(n%97 == 0){return false;}
if(n <= 1){return false;}
if(n == 2){return true;}
if(n%2 == 0){return false;}
if(9409 > n){return true;}
```

Notice that we even do not need to calculate the square root of \(n\), because we have already precalculated the square of the last prime we tested with.

After the last line of this snippet did not trigger to return `true`

we know that \(n\) is larger than 9409. So we continue testing without asking for equality:

```
if(n%101 == 0){return false;}
// (...)
if(n%227 == 0){return false;}
if(51529 > n){return true;}
```

With such pieces of code we continue testing: Several, here 24 are choosen, tests by primes to find composites and one test to ask if we made enough tests. The algorithm closes with:

```
if(n%65521 == 0){return false;}
return true;
```

Notice that we strongly control how often cpu time is spend on asking if we have made enough tests to conclude that \(n\) is a prime. We so force the cpu to concentrate on its work and not to constantly ask if it is done. This can, in comparison with the standard algorithm, not be done with a while loop that executes as long the test iterator is smaller or equal to the square root \(n\). – This is the optimized algorithm that does only the bare minimum of calculations needed to get the desired result.

I have to mention a drawback of this alorithm. The resulting machine code is larger, because we use hardcoded numbers instead of calculations whereever we can. This algorithm has a longer compile time, too and we talk about 8 seconds on my Intel i5-7200U. Thus we can not continue to do the bare minimum

forever. More and more hardcoded primes would soon lead to such a large executable binary that it is slown down by memory access. To do this for all numbers inside `uint32_t`

we need the primes inside `uint16_t`

and those are only 6542 prime numbers. That's not too many and we can afford these few kilobytes in exchange for an extraordinarily fast algorithm.

Although this algorithm reasonably claims to do nothing more than the bare minimum, it can not be denied that there still may be room for improvement. But from this point on any improvement probably means to make things faster for some numbers but slower for others. Thus the claim of improvement must answer the question better in which sense?

. This can not be done in a mathematical way or with a stop watch, because it implies some kind of value judgement. This issue is further addressed in section Optimization and Value Judgement.

## The Single-Threading Prime Factorization

Single-threading prime factorization is easy and that's why it works very fast up to numbers whose largest prime factor does not exceed a certain size. What this size is depends on the hardware used. The algorithm is called with an integer \(n\) as an argument and it returns a vector of integers that are the prime factors of \(n\). The algorithm starts trying to divide \(n\) by two, and as long as this is possible it divides \(n\) by two and writes a two into the result vector. When this is no longer possible, the algorithm tries to divide \(n\) by all odd numbers in ascending order starting at three. Each time a division is possible a prime factor is found, and as such is written into the result vector and \(n\) is divided by it. The algorithm continues until it tried to divide by the integer square root of the \(n\) that is left after previous divisions.

Due to those divisions \(n\) now either equals 1 or not. See these two examples to get what it means: Let our original \(n\) be 3*13*13=507. We can divide by 3, leaving us 13*13=169. This we can divide twice by 13, so after these divisions our \(n\) has become 1. And that's it. We found all prime factors and there is nothing more to do. This happens if the largest prime factor is for two or more times a prime factor of \(n\). For the other case consider we must find the prime factors of 2*2*43=172. We can divide by 2, leaving us 86, again by 2, leaving us 43. We try to divide by 3, by 5 and 7, and it does not work. But the integer square root of 43 is 6. We tested up to the integer square root and thus we know that 43 is a prime number (see section Testing up to \(\sqrt{n}\)). We do not need to test by 9, 11 and so on. So we must add the 43 to the results and we are done.

## The Multi-Threading Prime Factorization

An algorithm that does the same job by multi-threading is a bit more complex and tricky. It must solve two major problems, which by the beauty of simplicity do not occur in single-threading. One must understand these problems to see how they are solved by this multi-threading algorithm for prime factorization to achieve efficiency in terms of speed.

The first problem is the data race for composites. If we find a divisor it must be a prime number and never a composite. In single-threading this is always the case, but if we search for a divisor with multiple threads things can go wrong. A thread can find a composite as a divisor, because the threads responsible for the actual prime divisors were a little late at start or blocked by other uses of the system. Thus a multi-threading algorithm must always make sure that it does not take a composite as a divisor. This equals to the condition that the divisor of the fastest thread must never be larger than the square of the divisor of the slowest thread (that tests a divisor up to that all possible divisors have already been tested): If we have tested by 2, 3, 5 and 7, and applied the division to the originally given number, we are save to test up to 49. This condition must be fulfilled at all times.

There are at least two fundamental approaches to solve this problem. The first is to implement code and there are plenty of ways to do so with additional code. Some solutions are better, some are worse, and those implementing additional code into the main working loop (test divisibility, add to divisor) are probably among the latter. Anyway all these solutions slow down the hole algorithm too much, whereupon too much

means that single-threading keeps being faster even for numbers with larger prime factors on my Intel i5-7200U with 2 cores and 4 threads. It's easy to take a machine with plenty of cores and threads, waste ressources but still claim to have efficient multi-threading, because it is faster than single-threading. This is not how we do. – The solution pursued here is to follow the lead of the single-threading algorithm: to organize in such a way that this problem can never occur and thus does not need to be solved with additional code.

The second problem is that of dividing the given number \(n\). We must take into account that the number we are asked to find the prime factors for is huge. As such it needs more or less cpu cycles to be tested for divisibility by a certain number. If we find prime factors the division will make \(n\) smaller and possibly take less limbs (so the gmplib calls them), so thereafter we will need less cpu cycles to process it for the upcoming divisibility tests. Thus we can not afford to skip the division and do it later, but must apply found prime factors to the given number while still working on divisibility tests.

Even more important than speed is this problem's relation to the first one: Let the given \(n\) have the prime factors 11 and some huge number. If we skip the division by 11 the \(n\) keeps being divisible by all 11 to the power of an integer: 121, 1331 and so on. Thus the condition that must be fulfilled at all times extends to: The largest divisor already tested is not larger than the square of the divisor up to that all possible divisors have already been tested and applied to \(n\).

The core point of this second problem is that all threads read the number continously for testing and we sometimes must write to it. This looks like a simple issue that calls for the `shared_mutex`

. This solution works, but as it must be implemented into the main working loop it slows us down too much

. Both problems combined can easily lead into a mess of a code (I am guilty for that) that actually runs slower than the single-threading algorithm. We solve this by giving each thread his own copy of \(n\) to which he applies the divisors found by himself and those of the other threads. This is done in such a way that no thread can ever test a divisor that violates the condition that must be fulfilled at all times.

Now that these issues are understood lets start with a vivid explanation: Imagine the threads sit like smurfs with glasses at a round table. While they all take a seat and get out their little note pads, the main smurf quickly tries to divide the given number by all possible divisors between 2 and 31397. If he finds divisors, he applies them. This makes sure that in the first round of work no smurf can find a composite as a divisor. After he has finished he reads the remaining number to the ones around the table where each writes down his own copy of this number. The smurfs around the table then calculate in parallel, each on a certain set of possible divisors. If one finds a divisor, he writes it on his note pad and divides his copy of \(n\) accordingly. Meanwhile a sheet of paper, the global result

paper, is circulating from one to the next. As soon as one finishes his work he takes that global paper from his left neighbour or waits for it. He then copies his new results from his note pad to the global paper, applies all new results found by the others to his copy of \(n\) and finally passes the global results to the one on his right. Thereafter he calculates his next start and end for divisibility testing, thereby looks if work is finished so he could exit or, if it is not, he goes back to work.

This may sound like this could lead to a cumulative block: All or many could be waiting for the global piece of paper while one is still working on divisibility tests and the global result paper is lying unused to his left. This could happen. Yes. But actually this does not happen in any disadvantagous way. Tests have shown that if such a block happens then it is advantagous for the total performance. That's why this algorithm runs up to 3x faster than the single-threading one on my cpu with 2 cores and 4 threads.

For a detailed insight it is best to take a look at the explaning comments in the code. For the purpose of this documentation I want to point out how the two problems mentioned are actually solved:

While the threads are spinning up we have a bit of time in the main() function before the threads start executing their respective code. We use this time to start testing for prime factors in a single-threading mode and order the threads to wait till this is finished to keep control. In main() we test for divisors from 2 up to 31397, and of course divide \(n\) if we can and add the divisors found to the results. Beside making use of the time that threads need to spin up this solves the first problem of accidently finding a composite as a divisor: The first thread starts testing at the next prime after 31397, which is 31469, and his first work ends at 31469+100000. When the first thread goes to work the second time, he has applied his results from the first run. His work then is save up to the square of (31469+100000). So he is always save. The second thread starts at 31469+100000+2 and ends at 31469+100000+2+100000. So if there are \(t\) threads in total the last of them ends his first work at 31469 + 100000*t + (t-1)*2. We know that this is smaller than the square of 31397 if:

\[\begin{aligned} 31469 + 100000t + (t-1)*2 &< 31397^2\\ 31469 + 100000t + 2t -2 &< 31397^2\\ t &< \frac{31397^2-31469+2}{100002}\\ t &<= 9858 \end{aligned}\]So the algorithm is save for all systems with up to 9858 threads, because no thread starts work where he can find a composite as a divisor. If we consider that the largest cpus today have 128 threads, we can for now and the nearest future promise safety unless one works in a cluster of many such cpus. For this case the algorithm can be easily adjusted by increasing the amount of work done in single-threading mode. The price is to be a bit slower on numbers with smaller prime factors. But if you intent to run this code on anything that has more than 9858 threads, you probably plan something that will take a few hours. For this case I would recommend some changes to the code anyway to optimize it for use in a cluster.

The second problem is the division of \(n\). During the single-threading phase while threads are spinning up all divisors are applied to \(n\). After that main() allows the threads to start working. Then the threads read a copy of \(n\) into their scope right before they enter their work loop. After doing their amount of work the threads evaluate and apply results in a well defined order. They apply their own and the other's results to their own copy of \(n\) in such a way that no thread could ever find a composite as a result while there is no need for any other thread to stop his work. Only the result vector is locked and the unlock is given from one to the next in a destinct order. This is achieved without a mutex but with an integer: A thread waits in a loop for this integer to equal his respective number, then works on the result vector and unlocks by adding one to the integer. Only the last thread does not add but sets this integer to zero for the first thread. This distinct order, which is quite similar to the single-threading algorithm, makes sure that nothing can go wrong. Thus no additional code is needed to catch if something has gone wrong. – This is indeed for the price of having as many copies of \(n\) as we have threads and thereby not looking slim and smart anymore. But if \(n\) fits into the L1 cache multiple times this in fact does not make anything fat. Consider that a unbelievably huge number like 2 to the power of 4096 occupies 513 Bytes.

Furthermore the distinct sequence multi-threading makes sure that the prime factors are written in the same sequence as the single-threading algorithm would do, thus there is no sorting required to make this sure. In the end this algorithm using a distinct sequence to keep control of what happens and solve the two problems mentioned is much faster than any other I tried. I measured this one to be up to 3x faster than single-threading and I can hardly imagine that much more could be possible on a 2 core/4 thread machine. I therefore claim that this is an efficient algorithm.

# Theoretic Considerations and Measurements

## Testing up to \(\sqrt{n}\)

We have a given natural number \(n\) and want to test if it is a prime number. Let's assume it is a composite as it is the product of the two factors \(s\) and \(l\) with \(s<=l\) so that \(n=sl\). We then know that:

$$\begin{align*} s^2~&<=~sl\\ sl~&<=~l^2 \end{align*}$$Because of \(n=sl\) we can say:

$$\begin{align*} s^2~<=~&n~<=~l^2\\ s~<=~&\sqrt{n}~<=~l \end{align*}$$Thus we have the statement that the maximum value of the smaller faktor \(s\) is \(\sqrt{n}\). It can never ever be larger than the square root of \(n\). Further it is enough to test up to the truncated integer of the square root of \(n\), written as \(\lfloor \sqrt{n}\rfloor\). The next integer \(\lfloor \sqrt{n}\rfloor +1\) is always larger than \(\sqrt{n}\):

$$\begin{align*} \lfloor \sqrt{n}\rfloor \quad<=\quad \sqrt{n}\quad <\quad \lfloor \sqrt{n} \rfloor + 1 \end{align*}$$For the 32 bit optimization, see The 32bit Optimization, we further need to argue that it is suffient to test up to the largest prime number smaller or equal to \(\lfloor \sqrt{n}\rfloor\). Let \(\bar{p}\) be the largest prime number in the intervall \([0; \lfloor \sqrt{n}\rfloor]\). The question is if we have to test with any number in \()]\bar{p}; \lfloor \sqrt{n}\rfloor[\). Bear in mind that this intervall can be empty which is, for example, the case if \(\bar{p}=\lfloor \sqrt{n}\rfloor\). If it is not empty the question arises if we need to test with those numbers. But we know that those numbers are composites, because \(\bar{p}\) was the largest prime. As they are composites any number \(n\) that is divisible by this composite must have been found to be divisible ealier by testing with any of this composite's prime factors which must have been found in the interval \([0; \sqrt[4]{n}]\).

## The Sequence of Identified Composites

Dealing with prime numbers practically means dealing with odd positive numbers that are greater than one. Let us call this set of numbers \(\mathbb{T}\) and be \(n \in \mathbb{T}\). We want to test if \(n\) is a prime or a composite. Therefore we first test if \(n\) is divisible by 3. If so we must catch the case that \(n\) equals 3 and we already have a result that is true if $n$ equals 3 and false if not. That's the very obvious part.

Let's ask a simple question now: For how many of all numbers in \(\mathbb{T}\) we now have a result? That is obviously \(\frac{1}{3}\) of all numbers in \(\mathbb{T}\). For the other \(\frac{2}{3}\) we must continue testing with the 2nd odd prime number. After testing by 5 we ask again: For how many numbers in \(\mathbb{T}\) we now have a result? Of course \(\frac{1}{5}\) of all numbers in \(\mathbb{T}\) is divisible by 5 but \(\frac{1}{3}\) of them was already divisible by 3. We see that it is not that simple anymore but we can simplify the question to give a part of the answer. This simplification will lead to an intuitive understanding of the sequence we calculate here. The simplified question is: For how many *more* (after we already tested by 3) numbers in \(\mathbb{T}\) we get a result by testing with 5?

This is \(\frac{1}{5}\) minus those of this \(\frac{1}{5}\) which we already identified as composites. That is (for now) just those that were divisible by 3 or in other words we have to substract \(\frac{1}{3}\) of this \(\frac{1}{5}\). So testing by 5 catches us a result for an additional

$$\begin{align*} \frac{1}{5}~-~\frac{1}{3}\;\frac{1}{5}\quad=\quad\frac{2}{15} \end{align*}$$of all numbers in \(\mathbb{T}\). The answer to the main question is easy now. We add this (little) \(\frac{2}{15}\) to the \(\frac{1}{3}\) that we already have a result for. After testing by 3 and by 5 we thus have a result for

$$\begin{align*} \frac{1}{3}~+~\frac{2}{15}\quad=\quad\frac{7}{15} \end{align*}$$Next we test by 7. Simplified question: How many more we get by testing 7? That is \(\frac{1}{7}\) minus \(\frac{1}{7}\)th of those \(\frac{7}{15}\) we already got. We add this to what we already got and have

$$\begin{align*} \frac{7}{15}~+~\left(\frac{1}{7}-\frac{1}{7}~\frac{7}{15}\right)\quad = \quad \frac{57}{105} \end{align*}$$We see how it works and make some definitions. We say that $\mathbb{P}$ is the set of all prime numbers where 2 is the zeroth prime $p_0$ to have 3 as the first and first odd prime number $p_1$. We now have the sequence $s$ that's elements $s_i$ are the fraction of all numbers in $\mathbb{T}$ that we get a result for after testing with the $i$th (odd) prime number. The value of $s_1$ equals $\frac{1}{3}$ and the following elements are:

$$\begin{align*} s_i \quad $= \quad s_{i-1}~+~\frac{1}{p_i}~-~\frac{1}{p_i}~s_{i-1} \end{align*}$$Now we can write a program to calculate these values or just use a spreadsheet. This table shows the first prime numbers, how much testing by it adds to the found composites and the cumulative sum. The fractions calculated above are here display as percentage values of all odd numbers:

prime | adds | sum |
---|---|---|

3 | 33.3 | 33.3 |

5 | 13.3 | 46.7 |

7 | 7.6 | 54.3 |

11 | 4.2 | 58.4 |

13 | 3.2 | 61.6 |

17 | 2.3 | 63.9 |

19 | 1.9 | 65.8 |

23 | 1.5 | 67.3 |

29 | 1.1 | 68.4 |

31 | 1.0 | 69.4 |

37 | 0.8 | 70.3 |

So testing an odd number by just ten primes makes us catch in almost 70 %percnt; of all cases that this number is a composite. This is acknowledged by the first stage of the `prime::isprime()`

algorithm as described in section The Standard Algorithm in which these tests are implemented in the fastest way that is possible.

## CPU Time Distribution on Primes and Composites

Scanning a given interval of numbers for primes means that we will spend time on composites and time on primes. We know that 1st It is easier to find that a number is a composite than it is to prove a prime and 2nd that there are much more composites than primes. Let's ask a question of pure curiosity: How much of its time our cpu spends on finding composites and how much it spends on proving a prime?

I wrote a short program that runs through an interval and calls the function `prime::isprime()`

for odd numbers and we take the time with `<Nchrono>`

right before and right after the calculation. The important lines of code are:

```
time_start = std::chrono::steady_clock::now();
result = prime::isprime(number);
time_end = std::chrono::steady_clock::now();
```

The time needed for composites and primes are added. We then divide the total time needed for primes by the total time needed for primes and composites. So the resulting number expresses that part of the total time that is used for prooving primes. The alorithm used was the standard algorithm, see section The Standard Algorithm, for numbers of the `unsigned long long int`

data type.

The result is that this ratio increases as expected like a square root function as the work needed to proof primes increases. It starts with about 56 percent measured for the first 20 million odd numbers. It climbs up to 78 percent measued for the last 20 million odd numbers inside `uint32_t`

. At the upper end of `uint64_t`

the ratio reaches 91 percent measured for the last 10.000 odd numbers inside `uint64_t`

.

We can conclude that composites are not the bigger part of the problem. We know that some composite numbers are uncomfortable, because they are only divisible by a number close to their square root. The worst cases are the squares of prime numbers, because they fail at the last test we make. These uncomfortable composites are there but they are very rare as the sequence explained in section The Sequence of Identified Composites suggests. This measurement confirms what was to be expected and gives us a practical idea of what the CPU does. This also gives us a hint for future improvements.

## For 32bit clang++ is faster than g++

The code for the 32bit Optimization is fast, because it is simpel and does just the bare minimum. Hence we would expect that the choosen compiler could not make any difference, and it is reasonable to expect that both produce at least very similar if not identical assembly code. The opposite is the case and it makes a huge difference in the total computation speed! The following table shows the time (in seconds) needed and the binary code size with various compiler settings to find the upper most 1000000 primes in `uint32`

.

-O0 | -O1 | -O2 | -O3 | |

`g++` |
30.3 300008 |
15.6 164688 |
15.6 148304 |
15.7 148304 |

`clang++` |
26.9 217976 |
6.2 222496 |
6.1 222168 |
6.1 222168 |

So clang++ is more than twice as fast here than g++. I had never ever expected such a result and I have no idea what makes this difference so far.

# Remarks

## Optimization and Value Judgement

When we say that we have optimized something, we claim that we have made an improvement, claim that we have made something better. For an algorithm, better

usually means faster

, but it can also mean more easy, shorter or more platform independent, or it can ensure a more intuitive user interface. In the context of prime numbers and a sufficiently defined and fulfilled easy to use

condition, it is all about speed. But speed

is still ambiguous. We have to decide which is that something

that we have made better by making it faster.

This question is only of analytical interest as long as the improvement of something

can be achieved without significant disadvantages for something else. This is probably the case for improvements at an early stage of development. But at some point, improving something means making something else worse. The improvement can still be an improvement, but for this evaluation we have to weigh the gains of one against the losses of the other. Optimization implies a *value judgement* that cannot be clearly made by calculation and measurement, but must be argued and decided.

The same applies to the hardware that is used and the environment the code is running in. Regarding the effectiveness of multithreading it can make a difference if the code runs on a cpu with a few fast cores or on a cpu with a lot of slower cores. Regarding code size and memory usage it may be a difference if the code is running on a dedicated machine or on a heavily used one. While comparing a buggy first (at last) running version of a program with a more advanced one the latter will probably run better on every hardware and in all environments. But at some point, again, any further improvement means faster code for some hardware and evironments while it is disadvantageous for others and a value judgement is needed to weight the advantages and disadvantages.

The decision that is based on such value judgement is practically done by choosing a benchmark test that measures and puts into numbers if there is an improvement or not. This quantification should not veil that this number is not a measurement like in physics but expresses a value judgement.

## Value Judgements applied to the 32bit Optimization

These general considerations about value judgements from the previous section apply in the context of the 32 bit optimization as described in The 32bit Optimization. The initially followed principle of optimization is to minimize the number of performed calculations. This is benchmarked by measuring the time needed to test all odd numbers inside `uint32_t`

for prime. Fewer calculations need less time. Sounds resonable but there is a value judgement involved. It becomes more obvious if we contrast it with another. To measure the efficiency of the 32bit-Optimization we could say:

- Minimize the time needed to find all primes in
`uint32_t`

. - Minimize the maximal time needed to respond to any number in
`uint32_t`

.

In the first case we value each nanosecond equally, not matter if is one of few nanoseconds (in most cases below 100 ns) for an easy to find composite or one of those used to proof a prime at the upper end of `u_int32_t`

which takes a bit less than 10000 ns on my Intel i5-7200U. In the second case we do not value any nanosecond used equally but only value those of the slower responses. – None of these two principles of optimization is better or worse and more or less correct than the other unless we imply a value judgement. This is the point of value judgements.

## No Multi-Threading for 32bit

Suppose the code is fully optimized in the sense that the number of computations is minimized, so that in a single-core environment we can no longer reduce the time needed. On any modern system that has more than one core inevitably the question arises if multithreading could be an improvement.

This we can say will (almost) never be the case. Measurements have shown that starting a thread takes in average 8000 ns, thus it makes no sense to start a thread trying to make something faster that in total takes round about 10000 ns.

## Why Char and Floating Point Types are not allowed

If you call the `prime::isprime()`

or a function related to it with any floating point type or with any char type the compilation will *intentionally* fail. One may argue that char could be a valid data type to ask if its prime or not. But this only works for numbers up to 255 so it is not a reasonable idea to do so. While a `char`

is actually an integer value like the prime number 71 it usually represents a letter. The 71 represents the upper case letter 'G' and it is pointless to say that 'G' is a prime. Even worse is that a char representing a 7

while the underlying integer is 55. This would lead to the confusing result that 7

with the underlying integer 55 would be returned to be a composite, because it is divisible by 5.

But more important to me was a practical reason: I explicitly forbid this data type and force the compilation to fail, because I assume you wanted to call `prime::isprime()`

with a pointer to an array of chars and forgot to write the indirection operator (*).

Floats are not allowed, because dealing with primes means dealing with natural numbers. To ask what is the next prime after 5.01?

could eventually make sense while it as well makes sense to exclude any floating point types here. In the end it is a question of preference and decision. I choosed to exclude floats for simplicity.

It is of course still possible to do things with floating point data types if you are willing to deal with the deamons you are calling. For small numbers a simple type cast will work. To be precise for 32bit floating point variables according to the usual IEEE 754 standard (Wikipedia: IEEE 754) this works up to 16777258. From this number upwards a 32bit float can only represent even numbers. As all primes but 2 are odd numbers there will be no more primes representable. Game Over for floats.

## The Problem of Implicit Typecasting

Let us have a single isprime function declared as:

```
bool isprime(unsigned long long int);
```

We here implicitly assume that any user will call it with positive numbers only. Everything is fine as long as the user calls this function with any unsigned integral. An `unsigned int`

is implicitly casted to `unsigned long long int`

and does not cause any problems. If the user calls our function with a signed integral, there is still no problem while its value is positive.

But if the function is called with a signed integral that has a negative value there starts a drama. Let the user call our function with a `signed long long int`

that is assigned the value -59. We expect that an isprime function fastly returns the answer that -59 is not a prime number. But by implicitly typecasting it to `unsigned long long int`

the value changes to `(ULLONG_MAX - 60)`

which is the largest prime number representable in the space of 64bit unsigned integrals. So the function will calculate a while and state that -59 is a prime. We do not want that.

## The Problem of Function Overloading

If we look at the trouble that implicit typecasting can cause we may feel like offering the compiler an alternative by function overloading. While the return type always is a `bool`

overloading works great as long as the argument is `char[]`

, `std::string`

or an object of the `mpz_class`

from the gmplib. But to solve the problem of negative integral types from the previous section we would need to overload like:

```
bool isprime(unsigned long long int);
bool isprime(signed long long int);
```

A program calling this overloaded function with `isprime(7)`

or `isprime(-7)`

, unfortunately, does not compile, because the overload is ambigious. One can work around this ambiguity by typecasting -7 explicitly to an excat match:

```
isprime( static_cast<signed long long int>(-7) );
```

This compiles and calls the correct function but thereby we put the problem described in the previous section back to the caller. We didn't solve it.

## The Solution: constexpr

and <type_traits>

In C++11 the keyword `constexpr`

was introduced (see cppreference.com: constexpr). It allows us to let the compiler, if possible, evaluate an expression at compile time. In C++17 we got `type_traits`

added to the standard library (see cppreference.com: type_traits). They allow us to test a variable for its properties. We use this amazing feature and that is why we need the `-std=c++17`

compiler option. Both together make the magic: 1. We can distinguish between any signed and unsigned integral type. This was the underlying core problem that could not be solved with implicit typecasting nor with overloading as explained in the previous two sections. 2. The evaluation can be done at compile time and the unused code can be ignored if the `constexpr`

does not apply.

On my system only g++ does this job while clang++ compiles the whole code even if it is not needed. Make sure to call the compiler with the `-O3`

Option. I tried this out with a test program that `couts`

a little if the function gets called with an unsigned integer type and couts a lot if its called with a signed one. The size of the resulting binaries gives the answer. For Details about compilers advantages and disadvanteges see section Compilation. This code snippet shows how this issue has been solved:

```
template<typename T>
bool isprime(const T n)
{
if constexpr (std::is_integral<T>::value)
{
if constexpr (std::is_signed<T>::value)
{
if(n <= 1){return false;}
return isprime( static_cast<unsigned long
long int>(n) );
}
if constexpr (std::is_unsigned<T>::value)
{
// algorithm for positive n here
}
}
}
```

This way one can safely call `prime::isprime()`

with any variable that has been declared as a signed integer type. Negative values will get a fast answer and positive values can safely be casted (and eventually promoted) to `unsigned long long int`

. Now even a questionable call to the isprime function will not evoke a wrong answer.

# Licence Information – GNU GPLv3

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

See http://www.gnu.org/licenses/ for details.