The sake of this post is going to be steps I've personally learned to take in optimizing a function, so please feel free to fill me in on something I don't know, add further comments or ask questions. I'm still an aspiring programmer, and thought it would be cool to have a "full fledged" math library that I could tie into my game engine that's cross-platform in nature.  As such, I thought if I ever needed my own data type, it would be useful to be able to perform mathematical operations on it, which means I might need my own math library; of course I could try converting to a format supported for the hardware, as it'd typically be faster, but at this point I'm more concerned about learning what goes on behind the math library.  And for whatever reason, if you needed a higher precision, then you'd need a custom math library anyway.

So, recently I started looking up how to program a sin function; I didn't find much, but I did find a Wikipedia article that detailed a mathematical model for sin, which you can find here: Trigonometric functions - Series definitions.  You can view the function below:

Now I'm going to write this function in C++, and the main question we're looking at here is: How do I go about optimizing a function?(Not so much concerned about language semantics as basic fundamentals you should try to apply in just about any language) Here's the function quickly typed up, notice I chose to write it as an iterative function than as a recursive function.  A recursive function will always be slower than an iterative function, which are not the details of this post, but it has to do with the stack and storing return variables and unwrapping the stack, etc.

  1. Try to always write your performance function as an iterative function rather than a recursive one.
// size_t n, is the number of loops; the higher number of loops, the higher the precision
template <class T>
inline T MathUtil<T>::Sin(const T& val, const size_t n)
{
	T num = val;

	for(size_t i = 1; i <= n; ++i)
	{
		size_t odd = 2 * i + 1;

		// Factorial is written elsewhere
		num += powf(-1, i) * powf(val, odd) / Factorial(odd);
	}

	return num;
}

Which you could use like this:

double result = Sin<>(1.0,12);
// or:
result = Sin<double>(1,12);

So this is a very simple function, and in release mode runs close to what the real thing does.  Here's the results in debug: my sin: 3.57916ms std sin: 0.346908ms   If you look back to the image, you'll see (-1)^n.  This function merely alternates the addition and subtraction of our function, which can be simplified to this:

// size_t n, is the number of loops; the higher number of loops, the higher the precision
template <class T>
inline T MathUtil<T>::Sin(const T& val, const size_t n)
{
	T num = val;

	for(size_t i = 1; i <= n; ++i)
	{
		size_t odd = 2 * i + 1;

		// Factorial is written elsewhere
		if(i % 2 == 0)
		{
			num += powf(val, odd) / MathUtil::Factorial(odd);
		}
		else
		{
			num -= powf(val, odd) / MathUtil::Factorial(odd);
		}
	}

	return num;
}

This results in new times (remember, times will vary): my sin: 2.68451ms std sin: 0.348312ms Notice this does help our performance, though it's frankly a minimal change.  There are a few more things we can simplify out that are crucial, we're redoing the factorial calculations, as well as the power calculations every loop.  We can add these operations up through the loop.  Look back to the image, you'll notice as we through the loop, we'll start with 1!, then 3!, then 5!.  We're iterating through odd numbers, and we can compound the operations as 5! contains 3!: 5 * 4 * 3!.  3! contains 1!: 3 * 2 * 1!.  So we just need to have a variable to store the factor total, and the number we're currently on; multiply the stored total by the number - 1 and the number, as shown before.  Again, if we're in the loop at number 7, then we have 5! stored from the previous loop, multiply that by 6 then by 7.  In the next loop 7! x 8 then by 9.  Etc.  We can also do this with the odd exponent.  We now have the new function below.

// size_t n, is the number of loops; the higher number of loops, the higher the precision
template <class T>
inline T MathUtil<T>::Sin(const T& val, const size_t n)
{
	T num = val;

	size_t factNum = 3;
	size_t factTot = 1;

	T mult = val;

	for(size_t i = 1; i <= n; ++i, factNum += 2)
	{
		mult *= -val * val;

		factTot *= factNum - 1;
		factTot *= factNum;

		num += mult / static_cast<T>(factTot);
	}

	return num;
}

So we got rid of the factorial and power functions all together. So this is something you want to keep in mind when trying to make functions faster, try to minimize your function calls as much as possible, I chose "val * val" instead of "pow(val,2)" for this reason. You remember how we have to alternate adding and subtracting over the odd numbers? I do this by negating val and storing it in the variable mult which will flip sign every loop, doing just what I wanted; this is better than the first two solutions. Here's the results: my sin: 0.586781ms std sin: 0.351613ms

  1. Minimize function calls as much as possible, get rid of any that you can.  

And as we already started to do...

  1. Try to simplify the mathematical operations, using any tricks based off of patterns and mathematical properties you notice.

Anything else we can do? Look back at the image of our functions and refer to some of the variables we're now using. We're trying to simplify the exponential and factorial calculations using a factorial total and mult for the exponential totals. As we're using multiplication, we can concatenate the results from an earlier part of the equation and use it for a later one so we have to do less calculations.

// size_t n, is the number of loops; the higher number of loops, the higher the precision
template <class T>
T Sin(const T val, const size_t n)
{
	const T val2 = -val * val;

	T num = val;

	T mult = val;

	size_t factTot = 1;

	const size_t odd = 2 * n + 1;
	for(size_t f = 3; f <= odd; f += 2)
	{
		mult *= val2;

		factTot *= f - 1;
		factTot *= f;

		num += mult / static_cast<T>(factTot);
	}

	return num;
}

I'm still looking into other ways to improve this function, but this is all of what I have thus far.  Notice I moved "-val * val" to it's own variable so it's doing one less negation and multiplication every loop of the algorithm; it doesn't affect t he performance that much, but it's something that seems to me should be done.  I'm currently reading this paper on Optimizing C++, which I highly suggest you give it a read, as well as consider reading this site on Writing Efficient C.  You'll notice that in Optimizing C++ they mention that it's a performance hit to switch between integers and floats, however, the extra time in the algorithm to calculate factTot and f as the float calculations override that performance concern, so what I have above is the best way from what I've tried, by casting factTot to double, or whatever type you're using, and dividing it by mult. The following is the test code I used to generate my results:

Benchmark std = Benchmark();
for(int i = 0; i < 100000; ++i)
{
	std.Start();
	double result6 = sin(1.0);
	std.Stop();
}
Sleep(50);
for(int i = 0; i < 100000; ++i)
{
	std.Start();
	double result6 = sin(1.0);
	std.Stop();
}
Sleep(100);
for(int i = 0; i < 100000; ++i)
{
	std.Start();
	double result2 = sin(1.0);
	std.Stop();
}

sTrace(make_string() << "std    Sin:" << std.Get_AvgMicroSeconds() << "ms");
sTrace(make_string() << "std MinSin:" << std.Get_MinMicroSeconds() << "ms");
sTrace(make_string() << "std MaxSin:" << std.Get_MaxMicroSeconds() << "ms");

Benchmark mys = Benchmark();
for(int i = 0; i < 100000; ++i)
{
	mys.Start();
	double result6 = Sin<double>(1,6);
	mys.Stop();
}
Sleep(50);
for(int i = 0; i < 100000; ++i)
{
	mys.Start();
	double result6 = Sin<double>(1,6);
	mys.Stop();
}
Sleep(100);
for(int i = 0; i < 100000; ++i)
{
	mys.Start();
	double result6 = Sin<double>(1,6);
	mys.Stop();
}

sTrace(make_string() << "my    Sin:" << mys.Get_AvgMicroSeconds() << "ms");
sTrace(make_string() << "my MinSin:" << mys.Get_MinMicroSeconds() << "ms");
sTrace(make_string() << "my MaxSin:" << mys.Get_MaxMicroSeconds() << "ms");

What the results show you is the average run-time, the minimum run-time, and the maximum run-time during my benchmarks.  Here is my first set of results in debug x64:

std    Sin:0.15552ms
std MinSin:0ms
std MaxSin:61.4676ms

my    Sin:0.194272ms
my MinSin:0ms
my MaxSin:61.4675ms

Here is my second set of results in debug mode x64:

std    Sin:0.154726ms
std MinSin:0ms
std MaxSin:48.567ms

my    Sin:0.192029ms
my MinSin:0ms
my MaxSin:55.7761ms

Here is a result from a release build x64:

std    Sin:0.0923657ms
std MinSin:0ms
std MaxSin:48.9464ms

my    Sin:0.0920976ms
my MinSin:0ms
my MaxSin:24.2835ms

Here the next result from the same release build x64:

std    Sin:0.0919509ms
std MinSin:0ms
std MaxSin:4.55315ms

my    Sin:0.0920293ms
my MinSin:0ms
my MaxSin:10.624ms

Result from release build x32:

std    Sin:0.190925ms
std MinSin:0ms
std MaxSin:184.78ms

my    Sin:0.187989ms
my MinSin:0ms
my MaxSin:173.777ms

Result from the same release build x32:

std    Sin:0.183714ms
std MinSin:0ms
std MaxSin:25.4215ms

my    Sin:0.183407ms
my MinSin:0ms
my MaxSin:77.0233ms

Just as a disclaimer, for those of you who don't know, these kind of benchmarks aren't an end-all be-all of determining what is or isn't better.  Also, my benchmark util has it's own overhead so your own results may be different if you compare algorithms, though the results should be comparatively similar.  And, obviously these functions didn't take 0ms, it has to do with the accuracy of the timer, which uses the Windows function QueryPerformanceCounter and QueryPerformanceFrequency, which I've read are accurate only to 1ms and these measurements are smaller than that.  I'm running an Intel Core i7 920 and during these tests was clocked at just 2.8322GHz. It seems this algorithm runs close enough to the standard solution, however, keep in mind that depending on the hardware you're using that the standard function could be calculated using hardware and could be even faster; unless of course my build is already doing that, then I'd be impressed at how well this algorithm is running.  You can also use this algorithm with smaller precision Sin<double>(1,3) instead of Sin<double>(1,6).  In case you couldn't gather, the first number is the number we're taking the sin of; and the second number represents the amount of looping being done to increase the precision, so the higher the number the higher the precision.  Realize however, there is a point where these floating point calculations won't be accurate due to the limitations of double and float.  There should be a function to calculate the level of precision necessary for each data type for this algorithm, I just don't have it at my fingertips at the moment.  Frankly these numbers start to not mean too much as it's hard to get accurate results, but this was more of an academic goal than anything, so it's up to you if you'd find something like this useful.  Honestly, this endeavor was so I could learn how to program a sin and cos function so I can program a compile time version which has no run time costs; it has limited uses, but is still something that is cool and useful.  I'll detail this in a later post, but for now, here's the cos version of this algorithm, which builds off of even numbers instead of odd numbers.

// size_t n, is the number of loops; the higher number of loops, the higher the precision
template <class T>
T Cos(const T val, const size_t n)
{
	T cosResult = 1;

	size_t factorialTotal = 1;

	T multiplicationTotal = 1;

	const T val2 = -val * val;

	const size_t even = 2 * n;
	for(size_t f = 2; f <= even; f += 2)
	{
		multiplicationTotal *= val2;

		factorialTotal *= f - 1;
		factorialTotal *= f;

		cosResult += multiplicationTotal / static_cast<T>(factorialTotal);
	}

	return num;
}

This is used similarly to the sin function, Cos(1,6) or Cos<>(1.0,6).   I have done everything I sought out to do with this post, and I'll continue to experiment and see if I come up with anything else while doing more reading on C++ optimization, thinking if there are other algorithmic shortcuts I can make (Yes I realize there wasn't tooo much optimization I did here, it was merely doing what I needed to, to reduce the factorial calculations and other multiplication.  Also, again, these basic steps for optimization are general things to keep in mind. ).  I'll post updates when I find something, or if you have something you'd like to contribute please feel free, this is a good learning opportunity for all of us! (: [UPDATE - 11/9/2011] I forgot how large factorial results could get, and recently realized that as I'm compiling for Win32, size_t won't get me far for storing said results.  So, rather than just changing it to a long long or something, I'm changing it to be the same datatype as specified by the template, even if it means a little calculation cost time of not using integers.  Actually, if I used a custom datatype with a large precision, I would need to do this anyway, otherwise the function wouldn't be very precise. Anyway, this update comes as I somehow stumbled across this site:  super fast trig functions.  And a disclaimer for this thread, they're not really any faster than the built in solution unless you are willing to lose precision, which the specified functions don't specify as they loop until "INFINITY".  If you want super fast trig functions, you should look into something like Cordic functions, as mentioned in the Game Engine Gems 1, the important part being that you only need to use integers to calculate sin; though of course, it's an approximation, and there are other integer based methods besides this one as well if you do a little digging. So, here's the final sin function with the little tweaks:

template <class T>
T Sin(const T val, const T n)
{
	const T val2 = val * val;

	T num = val;

	T mult = val;

	T factTot = 1;

	const T odd = 2 * n + 1;
	for(T f = 3; f <= odd; f += 2)
	{
		mult *= -val2;

		factTot *= f * (f - 1);

		num += mult / factTot;
	}

	return num;
}

I realized from reading the mentioned thread on super fast trig functions, that I can cut a few things out of my function. I'm maintaining a variable for storing the exponents and the factorial results, which can both be cached in a single variable with the risk of the loss of a little precision, which isn't a big deal from what I've tested; actually, it still provides pretty accurate results.

template <class T>
T Sin(const T val, const T n)
{
	const T val2 = -val * val;

	T num = val;

	T expDivFact = val;

	const T odd = 2 * n + 1;
	for(T f = 3; f <= odd; f += 2)
	{
		expDivFact = expDivFact * val2 / (f * (f - 1));

		num += expDivFact;
	}

	return num;
}

So, the above Sin function is probably the fastest using the theory of Taylor series; though of course there might be other tweaks one could make.   If you want something faster look into this: CORDIC.

This class should have been posted long ago.  So, seeing how popular my C++ Vector2 class page has been, I've decided to finally post this for your needs.  This version is similar to the "improvements" on the Vector2 class, supposedly helping encapsulation; frankly, that's up for you to decide.

 

Remember, you can hover over the source area and click the "<>" looking image (top right) to pop the source code into a new window to more easily view and copy the code.

Vector3.h

/*  __      __   ___     _____    ____
 *  \ \    / /  / _ \   |  __ \  |    |
 *   \ \/\/ /  / / \ \  | | / /  |  __|
 *    \_/\_/  /_/   \_\ |_| \_\  |_|
 *      Take it to the next Level
 *
 *  Copyright (c) 2009 Brian Ernst.
 */

#ifndef w_Vector3
#define w_Vector3

#include <cmath>

namespace _Warp
{
	typedef float	Scalar;
	typedef int		Bool;

	class Vector3
	{
	public:
		Scalar X;
		Scalar Y;
		Scalar Z;

		Vector3();
		Vector3(Scalar x, Scalar y, Scalar z);

		Vector3		operator+(const Vector3&amp; vector) const;
		Vector3		operator-(const Vector3&amp; vector) const;
		Vector3		operator-() const;
		Vector3		operator*(Scalar num) const;
		Vector3		operator/(Scalar num) const;

		Vector3&amp;	operator+=(const Vector3&amp; vector);
		Vector3&amp;	operator-=(const Vector3&amp; vector);
		Vector3&amp;	operator*=(Scalar num);
		Vector3&amp;	operator/=(Scalar num);

		Bool		operator==(const Vector3&amp; vector) const;
		Bool		operator!=(const Vector3&amp; vector) const;

		static const Vector3 Zero;
		static const Scalar	Epsilon;
	};

	inline Bool Vector3::operator==(const Vector3&amp; vector) const
	{
		return X == vector.X &amp;&amp; Y == vector.Y &amp;&amp; Z == vector.Z;
	}

	inline Bool Vector3::operator!=(const Vector3&amp; vector) const
	{
		return X != vector.X || Y != vector.Y || Z != vector.Z;
	}

	inline Vector3 Vector3::operator+(const Vector3&amp; vector) const
	{
		return Vector3(X + vector.X, Y + vector.Y, Z + vector.Z);
	}

	inline Vector3 Vector3::operator-(const Vector3&amp; vector) const
	{
		return Vector3(X - vector.X, Y - vector.Y, Z - vector.Z);
	}

	inline Vector3 Vector3::operator-() const
	{
		return Vector3(-X,-Y,-Z);
	}

	inline Vector3 Vector3::operator*(Scalar num) const
	{
		return Vector3(X * num, Y * num, Z * num);
	}

	inline Vector3 Vector3::operator/(Scalar num) const
	{
		return Vector3(X / num, Y / num, Z / num);
	}
}
#endif

Vector3.cpp:

#include "Vector3.h"

#include <limits>

namespace _Warp
{
	const Vector3 Vector3::Zero = Vector3(0,0,0);
	const Scalar Vector3::Epsilon = std::numeric_limits::epsilon();

	Vector3::Vector3()
	{
	}

	Vector3::Vector3(Scalar x, Scalar y, Scalar z)
		: X( x )
		, Y( y )
		, Z( z )
	{
	}

	Vector3& Vector3::operator+=(const Vector3& vector)
	{
		X += vector.X;
		Y += vector.Y;
		Z += vector.Z;
		return *this;
	}

	Vector3& Vector3::operator-=(const Vector3& vector)
	{
		X -= vector.X;
		Y -= vector.Y;
		Z -= vector.Z;
		return *this;
	}

	Vector3& Vector3::operator*=(Scalar num)
	{
		X *= num;
		Y *= num;
		Z *= num;
		return *this;
	}

	Vector3& Vector3::operator/=(Scalar num)
	{
		this-&gt;X /= num;
		this-&gt;Y /= num;
		this-&gt;Z /= num;
		return *this;
	}
}

Vector3Util.h

/*  __      __   ___     _____    ____
 *  \ \    / /  / _ \   |  __ \  |    |
 *   \ \/\/ /  / / \ \  | | / /  |  __|
 *    \_/\_/  /_/   \_\ |_| \_\  |_|
 *      Take it to the next Level
 *
 *  Copyright (c) 2009 Brian Ernst.
 */

#ifndef w_Vector3Util
#define w_Vector3Util

#include "Vector3.h"

// These two files are not detailed out in this blog post.
#include "Quaternion.h"
#include "TMatrixUtil.h"

namespace _Warp
{
	Scalar len(const Vector3& vect);
	Scalar len2(const Vector3& vect);

	void Clamp(Vector3& vect,Scalar length);
	void Normalize(Vector3& vect);
	void Normalize_s(Vector3& vect);
	void SetLength(Vector3& vect, Scalar length);
	void SetLength_s(Vector3& vect, Scalar length);

	Scalar	Dot(const Vector3& vec1, const Vector3& vec2);
	Scalar	GetAngle(Vector3 vec1, Vector3 vec2);

	Vector3	ToNormalized(const Vector3& vect);
	Vector3	ToNormalized_s(const Vector3& vect);
	Vector3	ToPolar(Scalar x, Scalar y, Scalar z);
	Vector3	ToCartesian(Scalar radius, Scalar angle, Scalar z);
	Vector3	Cross(const Vector3& vec1, const Vector3& vec2);
	Vector3	Rotate(const Vector3& vec1, Scalar angle, const Vector3& axis);

	Vector3 ToEuler(Vector3 axis, Scalar angle);

	inline Scalar len(const Vector3& vect)
	{
		return sqrt(vect.X * vect.X + vect.Y * vect.Y + vect.Z * vect.Z);
	}

	inline Scalar len2(const Vector3& vect)
	{
		return vect.X * vect.X + vect.Y * vect.Y + vect.Z * vect.Z;
	}

	inline void Normalize(Vector3& vect)
	{
		vect /= len(vect);
	}

	inline void SetLength(Vector3& vect, Scalar length)
	{
		vect *= length / len(vect);
	}

	inline Scalar	Dot(const Vector3& vec1, const Vector3& vec2)
	{
		return vec1.X * vec2.X + vec1.Y * vec2.Y + vec1.Z * vec2.Z;
	}

	inline Vector3 ToNormalized(const Vector3& vect)
	{
		return vect / len(vect);
	}

	// This uses a Quaternion combined with the Matrix Utility, neither of which are detailed out in this post.
	inline Vector3	Rotate(const Vector3& vec1, Scalar angle, const Vector3&amp; axis)
	{
		return TransformCoord(Quaternion::FromAxis(axis.X,axis.Y,axis.Z,angle).Get_RotationMatrix(),vec1);
	}

	inline Vector3	ToPolar(Scalar x, Scalar y, Scalar z)
	{
		return Vector3(
			atan2(y,x),
			sqrt(x * x + y * y),
			z);
	}

	inline Vector3	ToCartesian(Scalar radius, Scalar angle, Scalar z)
	{
		return Vector3(
			radius * cos(angle),
			radius * sin(angle),
			z);
	}

	inline Vector3	Cross(const Vector3& vec1, const Vector3& vec2)
	{
		return Vector3(
			vec1.Y * vec2.Z - vec1.Z * vec2.Y,
			vec1.Z * vec2.X - vec1.X * vec2.Z,
			vec1.X * vec2.Y - vec1.Y * vec2.X);
	}
}

#endif

Vector3Util.cpp

#include "Vector3Util.h"

#include "../Constants.h"

namespace _Warp
{
	void Clamp(Vector3& vect,Scalar length)
	{
		Scalar vecLength = len2(vect);

		if(vecLength &lt;= length * length)
		{
			return;
		}

		vect *= length / sqrt(vecLength);
	}

	void Normalize_s(Vector3& vect)
	{
		Scalar vecLength = len2(vect);

		if(vecLength == 0)
		{
			return;
		}

		vect /= sqrt(vecLength);
	}

	void SetLength_s(Vector3& vect, Scalar length)
	{
		Scalar vecLength = len2(vect);

		if(vecLength == 0)
		{
			return;
		}

		vect *= length / sqrt(vecLength);
	}

	inline Scalar	GetAngle(Vector3 vec1, Vector3 vec2)
	{
		if(vec1 == vec2)
		{
			return 0.0f;
		}

		Normalize_s(vec1);
		Normalize_s(vec2);

		Scalar dot = Dot(vec1, vec2) / (len(vec1) * len(vec2));

		dot = dot > 1.0f ? 1.0f : ( dot < -1.0f ? -1.0f : dot );

		return std::acos(dot);
	}

	Vector3 ToNormalized_s(const Vector3& vect)
	{
		Scalar vecLength = len2(vect);

		if(vecLength == 0)
		{
			return vect;
		}

		vecLength = sqrt(vecLength);

		return Vector3(vect.X / vecLength, vect.Y / vecLength, vect.Z / vecLength);
	}

	// Thanks to Martin Baker for this solution
	// http://www.euclideanspace.com/maths/geometry/rotations/conversions/angleToEuler/index.htm
	Vector3 ToEuler(Vector3 axis, Scalar angle)
	{
		Vector3 out = Vector3();

		Scalar s = sin(angle);
		Scalar c = cos(angle);
		Scalar t = (Scalar)1.0 - c;

		if ((axis.X * axis.Y * t + axis.Z * s) > (Scalar)0.998)// north pole singularity detected
		{
			out.Y = 2 * atan2(axis.X * sin(angle/2), cos(angle/2));
			out.Z = static_cast<Scalar>(W_PI_2);
			out.X = 0;
			return out;
		}
		if ((axis.X * axis.Y * t + axis.Z * s) < (Scalar)-0.998)// south pole singularity detected
		{
			out.Y = (Scalar)-2.0 * atan2(axis.X * sin(angle / (Scalar)2.0), cos(angle / (Scalar)2.0));
			out.Z = -static_cast<Scalar>(W_PI_2);;
			out.X = 0;
			return out;
		}
		out.Y = atan2(axis.Y * s - axis.X * axis.Z * t , 1 - (axis.Y * axis.Y + axis.Z * axis.Z ) * t);
		out.Z = asin(axis.X * axis.Y * t + axis.Z * s) ;
		out.X = atan2(axis.X * s - axis.Y * axis.Z * t , 1 - (axis.X * axis.X + axis.Z * axis.Z) * t);
		return out;
	}
}

You'll notice in the Vector3Util I have some functions ending with "_s", like "Noramlize_s"; this is a notation I'm trying to mean, in this case, a safe normalize function call (won't result in a divide by zero error).  So, if you know the vector you're using can't possibly be zero, use the standard Normalize function, if not, then feel free to use Normalize_s.  Please let me know if you have a comment on the notation!

Anyway, this Vector3 class and utility are code I use in my personal game development, and in other school projects that I may have, and have used it and my personal math libs in a ray tracer.  Concerning the util, I'm still tweaking it, making decisions on how to wrap it in a namespace (like it's own _Math namespace, which was taken out for this post), or perhaps on whether I feel I really need a separate util instead of having them as member functions, which would be more convenient with intellisense at hand. If you're wondering what I'm talking about, check out this article at Dr Dobbs on encapsulation.  I personally have to give it another read and determine if I agree with it or not and really want to stick with that methodology.

 

Please leave your thoughts and suggestions!
[I'll be uploading the upgraded version of this Vector3 class sometime soon, I realize there are parts of this version that may be a little rough/unfinished, and after that I'll include the source files for you to download]

[Disclaimer : All benchmarks were done in debug mode on an i7 920 system with no compiler optimizations]

After browsing some cool C++ util functions today on StackOverflow, one of the cool util functions was for trimming strings, so I became intrigued and decided to compare them against my own, as they used some cool C++ std library tricks that I've never seen before. After doing benchmark tests, I was excited to find my own implementation appeared to blow his out of the water. After 1,000,000 tests, his averaged about 27 microseconds, while mine at first averaged less than 5 microseconds. After tweaking mine a bit, my trim_trailingWhitespace function averaged close to 4 microseconds after averaging 1,000,000 samples. Also to note, his function works directly on the original string, while mine returns a new copy; so it's even more impressive.

Here's another benchmark using 1,000,000 samples.
Using the following string: "          "
His: 37.88 microseconds
Mine: 4.03 microseconds

Benchmark 3 using the same # of samples as before.
Using the following string: "  The world ends with you!   "
His: 30.37 microseconds
Mine: 6.26 microseconds

The reason for this post, is because I was excited to find out my implementation is extremely fast, and I felt like sharing it with the world to see if it's possible to do even better! I'm still thinking of ways to improve my functions, but please let me know if you can best me in any way, I'm here to learn!


The only header you should need is the following, with the proper using statements too of course:

#include <iostream>

using std::basic_string;

This is the trim function used by my trim_trailingWhitespace

template<typename CharType>
static basic_string<CharType> trim(const basic_string<CharType>& str, size_t index, size_t len)
{
	size_t length = str.size();
	if(index >= length)
	{
		return basic_string<CharType>();
	}

	length -= index;
	length = length > len ? len : length;

	return str.substr(index, length);
}

And this is my function to trim the whitespace at the end of the string:

template<typename CharType>
static basic_string<CharType> trim_trailingWhitespace(const basic_string<CharType>& str)
{
	size_t length = str.size();
	if(length == 0)
	{
		return str;
	}

	register size_t index = length - 1;

	while(' ' == str[index])
	{
		if(index != 0)
		{
			--index;
		}
		else
		{
			return basic_string<CharType>();
		}
	}

	return trim(str, 0, ++index);
}

You'll notice I'm using some funky template stuff here, and basic_string; this will allow you to use whatever kind of string encoding you want, 1Byte ASCII characters (string), or 2Byte unicode characters (wstring), or whatever. And the beauty of it is, if you pass in the proper string type, like string or wstring, you don't need to define the template parameters; your compiler will automatically fill them in for you so you only just need to call the function name without worrying about the template details!

I haven't gotten a chance to rewrite my trim_leadingWhitespace function, so that will be coming soon; and yes, I will rename these once I decide upon a better name.


After doing some further research I stumbled across this implementation.  I realized my above comparison to the first algorithm wasn't entirely fair as the implementation wasn't even competitive, even though it worked in-place.  So, here's the source for an actually efficient implementation of rear trimming white space:

void rtrim(string& str, const locale& loc)
{
  string::size_type pos = str.size();
  while (pos > 0 && isspace(str[pos - 1], loc)) pos--;
  str.erase(pos);
}

By replacing the isspace with ' ' == you can make this function much faster, though obviously this works with our locale, and may be different across other languages. Another possible reason this isn't as fast as it could be is because of the str[pos - 1], he's doing an extra calculation per loop besides the pos--. His solution gets around the problem of using an unsigned int (size_t) that mine originally had, but lucky for me my rewrite made mine even more efficient because of this trick he had to do.

And here's the details on isspace. I think my check works fine for both string and wstring and the like; if not, please let me know.  [Edit] You could pass the locale space/seperator in as a parameter to the function if that's an issue.

As the above solution works in place, I decided to tweak my original function to work in place on a string, and am excited to say that it beats out this obviously improved implementation.  Here's my new in-place implementation:

template<typename CharType>
static void trim_trailingWhitespace(basic_string<CharType>& str)
{
	size_t length = str.size();
	if(length == 0)
	{
		return;
	}

	register size_t index = length - 1;

	while(' ' == str[index])
	{
		if(index != 0)
		{
			--index;
		}
		else
		{
			str.clear();
			return;
		}
	}
	str.erase(++index);
}

So, my implementation provides shortcuts for an empty string, and for the case the entire string contains spaces (where you see str.clear()). It also provides an interesting way to loop through the string compared to the other implementation, allowing mine to be a bit quicker, probably because I only have to use str[index] (besides the above mentioned shortcuts).

Benchmark 1:
String: "      "
His: 2.97 microseconds
His + my tweak: 2.68 microseconds
Mine:2.54 microseconds

Benchmark 2:
String: "  The world ends with you!   "
His: 3.93 microseconds
His + my tweak: 3.68 microseconds
Mine: 3.65 microseconds

As noted at the header, the above tests were done in debug mode with no optimizations.  If you turn on optimizations you'll notice the two algorithms at the bottom here are neck and neck, it doesn't really matter what you use.  Heck, if you look at the algorithm from StackOverflow that's not nearly as efficient in debug, suddenly becomes nearly comparable.  Though, it's still nice to have a code that runs fast in debug too; but it does raise the question whether it was worth the time to do this research; as they say, don't optimize until you have to.  It's been a nice educational endeavor so far, so I don't really care if it might be considered wasted time, as I've learned more than a thing or two.  Thoughts?

And here's the link to the StackOverflow page of useful C++ utilities: Most Useful C++ Utils.

So if you're somewhat familiar with C++ you know you can't easily implement a nice generic delegate that works for both static classes and member functions of all classes.  However, with some research and thought, you will realize it's not as hard as you think!

First, I would like to share this article that inspired me and caused me to gain hope in that delegates are possible in a nice flexible way!  Here it is, on: Code Project.

Okay, let's first think about this, we want a delegate to hide away the C++ syntax so we can call any kind of function, pass in certain parameters and get something back.  To make this flexible, it makes sense to have the base delegate class be a template, so you can define what types you want to use.  So here is the base class, which I'll call IDelegate, as in Interface Delegate.

// Interface for a Delegate class, for dispatching a lookup function call.
// R is the return type
// P is the parameter type
template <typename R,typename P>
class IDelegate
{
public:
virtual ~IDelegate(){}

virtual R Invoke(P) = 0;
};

You could also just as easily use an operator overload instead of Invoke:

virtual R operator()(P) = 0;

Now we just need to worry about creating a subclass so we can use these Delegates! So, we'll start with the Static Delegate since it'll be the simplest to create. To create a Static Delegate, we'll need a StaticDelegate class that accepts a static function pointer, and that's all it needs. So we have that below:

// This class is for attaching a Static Function Pointer to an IDelegate
// R is the return type
// P is the parameter type
template <typename R,typename P>
class StaticDelegate<typename R,typename P>
	: public IDelegate<R,P>
{
	typedef R (*funcPtr)(P);

protected:
	funcPtr _funcPtr;

public:
	StaticDelegate(funcPtr functionPointer)
		: _funcPtr( functionPointer )
	{
	}

	R Invoke(P param)
	{
		return _funcPtr(param);
	}
};

For those of you who don't know, this is a Static Function Pointer:

typedef R (*funcPtr)(P);

And also in case you didn't know, you can in fact define functions with parenthesis around the name, it is valid syntax.

Okay, so what we have here is really simple to use:

void PrintInt(int num)
{
	std::cout << num << endl;
}
typdef StaticDelegate<void,int> SDelegate;

IDelegate<void,int> delegate = SDelegate(&PrintInt);

I do encourage you to use typedefs to make your life easier, this becomes more obvious why you might want to do that when you're working with Class Member Function Pointers. In an event system, to be as flexible as possible, I do suggest trying to stick with one type of delegate.

Okay, the above shows you what you need to do to get a Static Function working with an IDelegate, now we need to get something to encapsulate a Member Function Pointer. For lack of a better name, I'll be using Type Delegate, as you do have to specify a class type that the member function belongs to.

// This class is for attaching a Member Function Pointer to an IDelegate
// R is the return type
// T is the Class of this Member Function Pointer
// P is the parameter type
template <typename R,typename T,typename P>
class TypeDelegate
	: public IDelegate<R,P>
{
	typedef R (T::*funcPtr)(P);// this is a function pointer definition
protected:
	funcPtr _funcPtr;
	T& _objPtr;

public:
	TypeDelegate(T& objPtr, funcPtr functionPointer)
		: _objPtr( objPtr )
		, _funcPtr( functionPointer )
	{
	}

	R Invoke(P param)
	{
		return (_objPtr.*_funcPtr)(param);
	}
};

Okay, so to create a TypeDelegate you need to pass in an object of the class you specified, and a function pointer to the class. So you can try this:

class Object1
{
public:
	typedef TypeDelegate<void,PhysicsObject,Object_Event::ENUM> TDelegate;

	Object1(){}

	void CatchEvent(Object_Event::ENUM eventID)
	{
		//...
	}
};
//  Somewhere in your code...
Object1 obj1 = Object1();

IDelegate delegate = TDelegate(&obj1,&Object1::CatchEvent);

//or inside of obj1:
delegate = TDelegate(*this,&Object1::CatchEvent);

Again, please feel free to use typedefs as it'll leave you with less code to type. I also suggest using this one system for all your delegate needs, no matter if you need parameters or not. I think this system is beautiful as it is because it's so simple, flexible, and it's not hard to tell what it does and didn't require a mess to get working like I thought might have to happen. If you want to use this type of delegate with no params, you can define an empty class like this:

// use as a Parameter type instead of void inside of IDelegate, as you'll get a compiler error
class Void
{
public:
	Void(){}
};

Yes, as I indicated in the comment, if you try to use "void" as the parameter type for the IDelegate sub classes, you will get a compiler error. So, as an alternative, use "Void", and pass in Void() as the args for the delegate; or, just use a "void*" and set it to NULL I guess (or a Void* or whatever you want); just make it obvious to functions using the delegate know that it's a void type.

So, as I suggested for keeping the delegate system simple, when you want to use multiple parameters, just use a struct with them defined. One reason I prefer this: if you didn't use a single struct, you'd have to expect every function variable to be properly named to know what it does. In this case, if you're using a struct, the variables names are what you define them to be, so this will help for documentation purposes when someone wants to find out what it does, they just need to look in one place: the struct definition.

 

[UPDATE:]
Source Files: C++ Delegates.

Looking to do Multithreaded programming? In order to do that you're going to need a Mutex to prevent threads from accessing data at the same time! There are different techniques of preventing threads from accessing the same data, but they can't guarantee problems won't happen, it has to be implemented on the hardware side and exposed through the OS or what have you. Luckily, we can do this using some code from the Windows header files, using what's called Critical Section Objects, which works between threads on the same process, and is apparently a, "slightly faster, more efficient mechanism for mutual-exclusion synchronization." Sweeet!

Mutex_win32.h:

#ifndef w_Mutex_win32
#define w_Mutex_win32

#include <windows.h>

#include "../Definitions.h"

namespace _Warp
{
	class Mutex_win32
	{
	protected:
		CRITICAL_SECTION _critSection;

	public:
		Mutex_win32()
		{
			InitializeCriticalSection (&_critSection);
		}

		~Mutex_win32()
		{
			DeleteCriticalSection (&_critSection);
		}

		void Lock()
		{
			EnterCriticalSection (&_critSection);
		}
		void Unlock()
		{
			LeaveCriticalSection (&_critSection);
		}

		Bool TryLock()
		{
			return (Bool)TryEnterCriticalSection(& _critSection);
		}
	};
}
#endif

Now, threading problems can be hard to track down and debug, especially if you forget to Unlock the Mutex. So a Lock object is a way to help you clean up and unlock the Mutex when you're done. A Lock is created on the Stack in a local block of code, so when the program leaves the block of code the Lock gets deleted, the destructor gets called, and in the process the Mutex is automatically unlocked in the Lock destructor.

Lock.h:

#ifndef w_Lock
#define w_Lock

#include "../Data/Uncopyable.h"

#include "Mutex_win32.h"

namespace _Warp
{
	class Lock
		: private Uncopyable
	{
	protected:
		Mutex_win32* _pMutex;

	public:
		explicit Lock(Mutex_win32* pMutex);
		~Lock();
	};

	inline Lock::Lock(Mutex_win32* pMutex)
		: _pMutex( pMutex )
	{
		if(_pMutex != NULL)
		{
			_pMutex->Lock();
		}
	}

	inline Lock::~Lock()
	{
		if(_pMutex != NULL)
		{
			_pMutex->Unlock();
		}
	}
};
#endif

Here's an example of how to use a Lock and Mutex:

Mutex_win32* _pMessageListMutex = new Mutex_win32();//a mutex might typically exist in the header as a member variable
list<string> _messages = list<string>();// this 

void function()
{
	Lock lock = Lock(_pMessageListMutex);
	/* do stuff to the list */
}// when the program reaches here, the lock gets cleaned up and the _pMessageListMutex gets freed so another thread can access the _messages list without worrying about another thread accessing it at the same time

In case you didn't notice, the Lock class inherits from an "Uncopyable" class that doesn't allow the Lock to be copied or to be constructed from another Lock.  This helps prevent errors from happening, besides the fact a lock shouldn't be copied anyway; so this class ensures it can't be.  The Uncopyable class was inspired by the Boost Uncopyable class, though it wasn't the first site I ran across that showed something very similar.

Uncopyable.h:

#ifndef w_Uncopyable
#define w_Uncopyable

namespace _Warp
{
	// Inherit from this if you want a class to be unable to be copy-constructed.
	class Uncopyable
	{
	private:
		Uncopyable(const Uncopyable&);
		Uncopyable& operator=(const Uncopyable&);

	protected:
		Uncopyable(){}
		virtual ~Uncopyable(){}
	};
};
#endif

For those of you who don't know, the virtual destructor is required if any class inherits from Uncopyable, if you want it to properly be deleted. This should only be a problem if you store an object as an Uncopyable object, instead of it's own object type, and you try to delete it, the child's destructor will be properly called before the parent's destructor. Also, because the Copy constructor and copy operator aren't defined (besides the fact that they're private), you'll get a compiler error preventing you from doing anything you shouldn't be!

Both the Lock and Mutex class are pretty simple, so there isn't really much variation to them; I don't remember exactly which sites I used to lookup the Windows functions, otherwise I'd credit them for their help.  Either way, if you had trouble finding C++ Mutex and Lock classes, I hope you find these useful for your purposes!

You can download the project files here: Mutex, Lock, and necessary class files.

Archive