The sake of this post is going to be steps I've personally learned to take in optimizing a function, so please feel free to fill me in on something I don't know, add further comments or ask questions. I'm still an aspiring programmer, and thought it would be cool to have a "full fledged" math library that I could tie into my game engine that's cross-platform in nature.  As such, I thought if I ever needed my own data type, it would be useful to be able to perform mathematical operations on it, which means I might need my own math library; of course I could try converting to a format supported for the hardware, as it'd typically be faster, but at this point I'm more concerned about learning what goes on behind the math library.  And for whatever reason, if you needed a higher precision, then you'd need a custom math library anyway.

So, recently I started looking up how to program a sin function; I didn't find much, but I did find a Wikipedia article that detailed a mathematical model for sin, which you can find here: Trigonometric functions - Series definitions.  You can view the function below:

Now I'm going to write this function in C++, and the main question we're looking at here is: How do I go about optimizing a function?(Not so much concerned about language semantics as basic fundamentals you should try to apply in just about any language) Here's the function quickly typed up, notice I chose to write it as an iterative function than as a recursive function.  A recursive function will always be slower than an iterative function, which are not the details of this post, but it has to do with the stack and storing return variables and unwrapping the stack, etc.

  1. Try to always write your performance function as an iterative function rather than a recursive one.
// size_t n, is the number of loops; the higher number of loops, the higher the precision
template <class T>
inline T MathUtil<T>::Sin(const T& val, const size_t n)
{
	T num = val;

	for(size_t i = 1; i <= n; ++i)
	{
		size_t odd = 2 * i + 1;

		// Factorial is written elsewhere
		num += powf(-1, i) * powf(val, odd) / Factorial(odd);
	}

	return num;
}

Which you could use like this:

double result = Sin<>(1.0,12);
// or:
result = Sin<double>(1,12);

So this is a very simple function, and in release mode runs close to what the real thing does.  Here's the results in debug: my sin: 3.57916ms std sin: 0.346908ms   If you look back to the image, you'll see (-1)^n.  This function merely alternates the addition and subtraction of our function, which can be simplified to this:

// size_t n, is the number of loops; the higher number of loops, the higher the precision
template <class T>
inline T MathUtil<T>::Sin(const T& val, const size_t n)
{
	T num = val;

	for(size_t i = 1; i <= n; ++i)
	{
		size_t odd = 2 * i + 1;

		// Factorial is written elsewhere
		if(i % 2 == 0)
		{
			num += powf(val, odd) / MathUtil::Factorial(odd);
		}
		else
		{
			num -= powf(val, odd) / MathUtil::Factorial(odd);
		}
	}

	return num;
}

This results in new times (remember, times will vary): my sin: 2.68451ms std sin: 0.348312ms Notice this does help our performance, though it's frankly a minimal change.  There are a few more things we can simplify out that are crucial, we're redoing the factorial calculations, as well as the power calculations every loop.  We can add these operations up through the loop.  Look back to the image, you'll notice as we through the loop, we'll start with 1!, then 3!, then 5!.  We're iterating through odd numbers, and we can compound the operations as 5! contains 3!: 5 * 4 * 3!.  3! contains 1!: 3 * 2 * 1!.  So we just need to have a variable to store the factor total, and the number we're currently on; multiply the stored total by the number - 1 and the number, as shown before.  Again, if we're in the loop at number 7, then we have 5! stored from the previous loop, multiply that by 6 then by 7.  In the next loop 7! x 8 then by 9.  Etc.  We can also do this with the odd exponent.  We now have the new function below.

// size_t n, is the number of loops; the higher number of loops, the higher the precision
template <class T>
inline T MathUtil<T>::Sin(const T& val, const size_t n)
{
	T num = val;

	size_t factNum = 3;
	size_t factTot = 1;

	T mult = val;

	for(size_t i = 1; i <= n; ++i, factNum += 2)
	{
		mult *= -val * val;

		factTot *= factNum - 1;
		factTot *= factNum;

		num += mult / static_cast<T>(factTot);
	}

	return num;
}

So we got rid of the factorial and power functions all together. So this is something you want to keep in mind when trying to make functions faster, try to minimize your function calls as much as possible, I chose "val * val" instead of "pow(val,2)" for this reason. You remember how we have to alternate adding and subtracting over the odd numbers? I do this by negating val and storing it in the variable mult which will flip sign every loop, doing just what I wanted; this is better than the first two solutions. Here's the results: my sin: 0.586781ms std sin: 0.351613ms

  1. Minimize function calls as much as possible, get rid of any that you can.  

And as we already started to do...

  1. Try to simplify the mathematical operations, using any tricks based off of patterns and mathematical properties you notice.

Anything else we can do? Look back at the image of our functions and refer to some of the variables we're now using. We're trying to simplify the exponential and factorial calculations using a factorial total and mult for the exponential totals. As we're using multiplication, we can concatenate the results from an earlier part of the equation and use it for a later one so we have to do less calculations.

// size_t n, is the number of loops; the higher number of loops, the higher the precision
template <class T>
T Sin(const T val, const size_t n)
{
	const T val2 = -val * val;

	T num = val;

	T mult = val;

	size_t factTot = 1;

	const size_t odd = 2 * n + 1;
	for(size_t f = 3; f <= odd; f += 2)
	{
		mult *= val2;

		factTot *= f - 1;
		factTot *= f;

		num += mult / static_cast<T>(factTot);
	}

	return num;
}

I'm still looking into other ways to improve this function, but this is all of what I have thus far.  Notice I moved "-val * val" to it's own variable so it's doing one less negation and multiplication every loop of the algorithm; it doesn't affect t he performance that much, but it's something that seems to me should be done.  I'm currently reading this paper on Optimizing C++, which I highly suggest you give it a read, as well as consider reading this site on Writing Efficient C.  You'll notice that in Optimizing C++ they mention that it's a performance hit to switch between integers and floats, however, the extra time in the algorithm to calculate factTot and f as the float calculations override that performance concern, so what I have above is the best way from what I've tried, by casting factTot to double, or whatever type you're using, and dividing it by mult. The following is the test code I used to generate my results:

Benchmark std = Benchmark();
for(int i = 0; i < 100000; ++i)
{
	std.Start();
	double result6 = sin(1.0);
	std.Stop();
}
Sleep(50);
for(int i = 0; i < 100000; ++i)
{
	std.Start();
	double result6 = sin(1.0);
	std.Stop();
}
Sleep(100);
for(int i = 0; i < 100000; ++i)
{
	std.Start();
	double result2 = sin(1.0);
	std.Stop();
}

sTrace(make_string() << "std    Sin:" << std.Get_AvgMicroSeconds() << "ms");
sTrace(make_string() << "std MinSin:" << std.Get_MinMicroSeconds() << "ms");
sTrace(make_string() << "std MaxSin:" << std.Get_MaxMicroSeconds() << "ms");

Benchmark mys = Benchmark();
for(int i = 0; i < 100000; ++i)
{
	mys.Start();
	double result6 = Sin<double>(1,6);
	mys.Stop();
}
Sleep(50);
for(int i = 0; i < 100000; ++i)
{
	mys.Start();
	double result6 = Sin<double>(1,6);
	mys.Stop();
}
Sleep(100);
for(int i = 0; i < 100000; ++i)
{
	mys.Start();
	double result6 = Sin<double>(1,6);
	mys.Stop();
}

sTrace(make_string() << "my    Sin:" << mys.Get_AvgMicroSeconds() << "ms");
sTrace(make_string() << "my MinSin:" << mys.Get_MinMicroSeconds() << "ms");
sTrace(make_string() << "my MaxSin:" << mys.Get_MaxMicroSeconds() << "ms");

What the results show you is the average run-time, the minimum run-time, and the maximum run-time during my benchmarks.  Here is my first set of results in debug x64:

std    Sin:0.15552ms
std MinSin:0ms
std MaxSin:61.4676ms

my    Sin:0.194272ms
my MinSin:0ms
my MaxSin:61.4675ms

Here is my second set of results in debug mode x64:

std    Sin:0.154726ms
std MinSin:0ms
std MaxSin:48.567ms

my    Sin:0.192029ms
my MinSin:0ms
my MaxSin:55.7761ms

Here is a result from a release build x64:

std    Sin:0.0923657ms
std MinSin:0ms
std MaxSin:48.9464ms

my    Sin:0.0920976ms
my MinSin:0ms
my MaxSin:24.2835ms

Here the next result from the same release build x64:

std    Sin:0.0919509ms
std MinSin:0ms
std MaxSin:4.55315ms

my    Sin:0.0920293ms
my MinSin:0ms
my MaxSin:10.624ms

Result from release build x32:

std    Sin:0.190925ms
std MinSin:0ms
std MaxSin:184.78ms

my    Sin:0.187989ms
my MinSin:0ms
my MaxSin:173.777ms

Result from the same release build x32:

std    Sin:0.183714ms
std MinSin:0ms
std MaxSin:25.4215ms

my    Sin:0.183407ms
my MinSin:0ms
my MaxSin:77.0233ms

Just as a disclaimer, for those of you who don't know, these kind of benchmarks aren't an end-all be-all of determining what is or isn't better.  Also, my benchmark util has it's own overhead so your own results may be different if you compare algorithms, though the results should be comparatively similar.  And, obviously these functions didn't take 0ms, it has to do with the accuracy of the timer, which uses the Windows function QueryPerformanceCounter and QueryPerformanceFrequency, which I've read are accurate only to 1ms and these measurements are smaller than that.  I'm running an Intel Core i7 920 and during these tests was clocked at just 2.8322GHz. It seems this algorithm runs close enough to the standard solution, however, keep in mind that depending on the hardware you're using that the standard function could be calculated using hardware and could be even faster; unless of course my build is already doing that, then I'd be impressed at how well this algorithm is running.  You can also use this algorithm with smaller precision Sin<double>(1,3) instead of Sin<double>(1,6).  In case you couldn't gather, the first number is the number we're taking the sin of; and the second number represents the amount of looping being done to increase the precision, so the higher the number the higher the precision.  Realize however, there is a point where these floating point calculations won't be accurate due to the limitations of double and float.  There should be a function to calculate the level of precision necessary for each data type for this algorithm, I just don't have it at my fingertips at the moment.  Frankly these numbers start to not mean too much as it's hard to get accurate results, but this was more of an academic goal than anything, so it's up to you if you'd find something like this useful.  Honestly, this endeavor was so I could learn how to program a sin and cos function so I can program a compile time version which has no run time costs; it has limited uses, but is still something that is cool and useful.  I'll detail this in a later post, but for now, here's the cos version of this algorithm, which builds off of even numbers instead of odd numbers.

// size_t n, is the number of loops; the higher number of loops, the higher the precision
template <class T>
T Cos(const T val, const size_t n)
{
	T cosResult = 1;

	size_t factorialTotal = 1;

	T multiplicationTotal = 1;

	const T val2 = -val * val;

	const size_t even = 2 * n;
	for(size_t f = 2; f <= even; f += 2)
	{
		multiplicationTotal *= val2;

		factorialTotal *= f - 1;
		factorialTotal *= f;

		cosResult += multiplicationTotal / static_cast<T>(factorialTotal);
	}

	return num;
}

This is used similarly to the sin function, Cos(1,6) or Cos<>(1.0,6).   I have done everything I sought out to do with this post, and I'll continue to experiment and see if I come up with anything else while doing more reading on C++ optimization, thinking if there are other algorithmic shortcuts I can make (Yes I realize there wasn't tooo much optimization I did here, it was merely doing what I needed to, to reduce the factorial calculations and other multiplication.  Also, again, these basic steps for optimization are general things to keep in mind. ).  I'll post updates when I find something, or if you have something you'd like to contribute please feel free, this is a good learning opportunity for all of us! (: [UPDATE - 11/9/2011] I forgot how large factorial results could get, and recently realized that as I'm compiling for Win32, size_t won't get me far for storing said results.  So, rather than just changing it to a long long or something, I'm changing it to be the same datatype as specified by the template, even if it means a little calculation cost time of not using integers.  Actually, if I used a custom datatype with a large precision, I would need to do this anyway, otherwise the function wouldn't be very precise. Anyway, this update comes as I somehow stumbled across this site:  super fast trig functions.  And a disclaimer for this thread, they're not really any faster than the built in solution unless you are willing to lose precision, which the specified functions don't specify as they loop until "INFINITY".  If you want super fast trig functions, you should look into something like Cordic functions, as mentioned in the Game Engine Gems 1, the important part being that you only need to use integers to calculate sin; though of course, it's an approximation, and there are other integer based methods besides this one as well if you do a little digging. So, here's the final sin function with the little tweaks:

template <class T>
T Sin(const T val, const T n)
{
	const T val2 = val * val;

	T num = val;

	T mult = val;

	T factTot = 1;

	const T odd = 2 * n + 1;
	for(T f = 3; f <= odd; f += 2)
	{
		mult *= -val2;

		factTot *= f * (f - 1);

		num += mult / factTot;
	}

	return num;
}

I realized from reading the mentioned thread on super fast trig functions, that I can cut a few things out of my function. I'm maintaining a variable for storing the exponents and the factorial results, which can both be cached in a single variable with the risk of the loss of a little precision, which isn't a big deal from what I've tested; actually, it still provides pretty accurate results.

template <class T>
T Sin(const T val, const T n)
{
	const T val2 = -val * val;

	T num = val;

	T expDivFact = val;

	const T odd = 2 * n + 1;
	for(T f = 3; f <= odd; f += 2)
	{
		expDivFact = expDivFact * val2 / (f * (f - 1));

		num += expDivFact;
	}

	return num;
}

So, the above Sin function is probably the fastest using the theory of Taylor series; though of course there might be other tweaks one could make.   If you want something faster look into this: CORDIC.

Archive