### Thread: efficient way to find only the closest objects

1. No Profile Picture
Contributing User
Devshed Newbie (0 - 499 posts)

Join Date
Mar 2002
Location
Oxford, UK
Posts
28
Rep Power
0

#### Fast Distance calculation

Hi,

I noticed that a lot of people have been advocating multipass solutions to this problem - ie:

1) Quick Reject in X
2) Quick Reject in Y
3) Distance calculation

I would, however, be rather surprised if it wasn't faster to do a single pass test based on the magnitude of the distance squared:

Code:
```if(dx*dx + dy*dy > rad2)
{
//...
}```
I'll do some timing results and see.
2. No Profile Picture
Contributing User
Devshed Newbie (0 - 499 posts)

Join Date
Mar 2002
Location
Oxford, UK
Posts
28
Rep Power
0

#### Results.

Basically, a single magnitude squared test turns out to be about 20-25% faster than the two-layer test, on a P4 under VC++ 6.0.

Time for 10^8 distance rejection/acceptance calculations was:

1) Magnitude squared: 917084 / (1.19 * 10^6) s

2) Two-layer: 1286742 / (1.19 * 10^6) s

The code follows. arrayx[] and arrayy[] are initialised to random values in the range 0...1000, as are px and py.

Magnitude squared:
Code:
```for(ctr = 0, count=0; ctr < 1000; ctr++)
{
dx = px-arrayx[ctr];
dy = py-arrayy[ctr];
if(dx*dx + dy*dy < 100)
{
count++;
}
}```
Two-layer + magnitude squared:
Code:
```for(ctr = 0, count = 0; ctr < 1000; ctr++)
{
dx = px-arrayx[ctr];
dy = py-arrayy[ctr];
if(dx > -10 && dx < 10 && dy > -10 && dy < 10)
{
if(dx*dx+dy*dy < 100) count++;
}
}```
Neither of these is safe if dx*dx + dy*dy > 0x7FFFFFFF. Options for making this safe include using 64-bit integers for the mag^2 calculations, and also using double-precision numbers.

The timing results for the double precision case are even more skewed in favor of magintude squared only:

10^8 double precision distance tests:

1) Mag-squared = 1.5582 / 1.19 s

2) Two-layer = 4.2211 / 1.19 s
Last edited by The Ostrich; March 13th, 2002 at 09:32 PM.
3. No Profile Picture
Contributing User
Devshed Newbie (0 - 499 posts)

Join Date
Mar 2002
Location
Oxford, UK
Posts
28
Rep Power
0
I figure that the 3x discrepancy in the timing results for the double precision case is probably because either dx or dy is being swapped into / out of memory more times.

Or it could just be that the more complex condition breaks the branch prediction.
4. No Profile Picture
Contributing User
Devshed Newbie (0 - 499 posts)

Join Date
Mar 2002
Location
Oxford, UK
Posts
28
Rep Power
0

#### Humble Pie

Whoops.

If I replace that UGLY if statement with

Code:
```//integer version
if(abs(dx) < 10 && abs(dy) < 10) ...```
Code:
```//double precision
if(fabs(dx) < 10 && fabs(dy) < 10) ...```
Then the axis-aligned bounding box is faster. Duh. After that change and some inner-loop optimisation, we get:

Integer:

1) Mag^2 test: 9.4 / 1.19 s

2) Two-level: 6.7 / 1.19 s

Double Prec

1) Mag^2 test: 1.52 / 1.19 s

2) Two-level: 9.2 / 1.19 s
Last edited by The Ostrich; March 13th, 2002 at 09:36 PM.
5. No Profile Picture
Junior Member
Devshed Newbie (0 - 499 posts)

Join Date
Dec 2001
Posts
28
Rep Power
0
so... what you're saying is... the two loops are faster if you're dealing with integers, and one loop is faster if dealing with doubles (like me)?
6. No Profile Picture
Contributing User
Devshed Newbie (0 - 499 posts)

Join Date
Mar 2002
Location
Oxford, UK
Posts
28
Rep Power
0
No. One loop which does two tests is faster in both cases, provided that you use the abs() and fabs() macros / functions. What I mean is:

Code:
```N = ...
N2 = N*N;

for(ctr = 0; ctr < thingy; ctr++)
{
dx = ...
dy = ...
if(fabs(dx) < N && fabs(dy) < N)
{
if(dx*dx + dy*dy < N2) ...
}
}```
I believe that this form, and this form only, is the fastest one.

Important notes:

1) There is only one loop
2) fabs(n) is used
3) I never take a square root.
Last edited by The Ostrich; March 14th, 2002 at 08:18 PM.
7. No Profile Picture
epl
Contributing User
Devshed Newbie (0 - 499 posts)

Join Date
Mar 2001
Location
Dublin
Posts
413
Rep Power
18
I've no idea of the speed gains / tradeoffs here but you could also complement the quick rejection with a quick acceptance before resorting to the distance formula .. for this two dimensional example 50% of the unknown area (that not rejected by testing the axes individually) can be quickly accepted.

Quick reject if distance >20 on either axis
Quick accept if distance <20/(2^0.5) on both axes

(I think - off the top of my head)

Again don't know if this speeds things up but I'd guess so...
8. No Profile Picture
Junior Member
Devshed Newbie (0 - 499 posts)

Join Date
Mar 2002
Posts
6
Rep Power
0
I believe that this form, and this form only, is the fastest one.
Wrong, at least when you wrote C++ code

This code should be a little faster:
Code:
```const float maxDist=20;
const float maxDist2 = maxDist*maxDist;

for (pointsDatatype* cur=points; cur != points.end(); ++cur)
{
float distx = testPoint.x - cur->x;
if (fabs(distx) > maxDist) continue;
float disty = testPoint.y - cur->y;
if (fabs(disty) > maxDist
|| (distx*distx + disty*disty) > maxDist2) continue;
// if we have reached this far, the point is near
// do something now ...
}```
1. There is no need to calculate disty, when the point can already be rejected by calculating distx
2. Use pointers instead of an an integer and the [] operator
3. use ++cur instead of cur++

This code is also not the fastest, you can do loop unrolling which should be at least 20% or so faster.

Do you need to do this calculation for every point? If so, there are some better algorithms do do this test.
Last edited by martinus; March 26th, 2002 at 06:14 PM.
9. No Profile Picture
Canta como rafaé
Devshed Newbie (0 - 499 posts)

Join Date
Feb 2001
Location
Barcelona
Posts
74
Rep Power
18
Well, if we are talking about fast code, this is not by far the fastest code.

Code:
```const float maxDist=20;
const float maxDist2 = maxDist*maxDist;

for (pointsDatatype* cur=points; cur != points.end(); ++cur)
{
float distx = testPoint.x - cur->x;
if (fabs(distx) > maxDist) continue;
float disty = testPoint.y - cur->y;
if (fabs(disty) > maxDist
|| (distx*distx + disty*disty) > maxDist2) continue;
// if we have reached this far, the point is near
// do something now ...
}```
I don't have time to write a new code, but I see that

1 - You are using "const float" instead of "#define". Do you think is this an improvement ? So, every time you do if ( ... > maxDist), the program has to go to the data area, fetch the value, and then compare 2 registers, instead of simply comparing to a constant value.

2 - You are calling the function "points.end ()" in every loop in "for". Isn't it more efficient to keep that value in a "register float" to avoid the delay of calling to a function ?

3 - For the first rejections, I won't use float values. It does not mind if the rejection checks are not as precise as using floats, but it's worth to use integers.
Code:
`Example (integer version in parentheses): 23.7 (23)- 3.2 (3) = 20.5 (20) => rejected (not rejected)`
But the number of mistakes would be very small and they would be checked in the real distance test.

4 - With the use of integers, you can always do as for the abs:
Code:
`if (res & 0x80000000) res = abs (res);`
and avoid another function call.

5 - Loop unrolling only works when the size of the array is finite.

6 - If you use C++, all of those testPoint.x are more costly that iterating an array in C.

7 - And the last question ... why is it more efficient to use ++cur instead of cur++ ?

The only improvement of your code is not to calculate disty if you can reject it by distx.

Where are the days in which when we wanted speed we went directly to assembler ? But if you don't know assembler, you can always try to use -O3 in gcc and hope a bit luck.
10. No Profile Picture
Junior Member
Devshed Newbie (0 - 499 posts)

Join Date
Mar 2002
Posts
6
Rep Power
0
1. const float: With every modern compiler this should be as fast as a #define. one should not use #defines in C++ code, inlined code is much cleaner and as fast.

2. There is no function calling overhead, this procedure is inline and does not need any calculation.

3. have a look at this table Relative Performance of Common Operations (PIII-733)
* and + operations are as fast as for integers, so you should stick to floats. Converting from float to int and reverse would also cost time.

4. fabs() should also be inlined, and it is a direct assembler call which clears a flag

5. I have never seen an infinite array, probably I do not have enough RAM
If you mean that unrolling only works for fixed sized arrays, have a look at this code:
Code:
```int i;
for (i=0; i<max % 4; ++i)
doSomething;

for (; i<max; i += 4) {
doSomething;
doSomething;
doSomething;
doSomething;
}```
6. It should be faster using temporary variables, but the same applies to code in C, there is no difference in speed. Believe me, a loop which iterates with a pointer (or an interator in C++) is always faster than iterating an integer and using [].

7. the Prefix ++ is always faster than the postfix ++, because the postfix operator has to temporarily save it's value, increment it and return the temporary value. The Prefix ++ does not need a temporarily object.

martin
11. No Profile Picture
Canta como rafaé
Devshed Newbie (0 - 499 posts)

Join Date
Feb 2001
Location
Barcelona
Posts
74
Rep Power
18
Hi

So, as I understand, points 1, 2, 4, 5 and 7 relies on compiler optimizations. Is it too difficult to make them by hand avoiding that the compiler couldn't handle those optimizations? So, don't say your code is faster, your compiler code is faster.

In point number 3 you might be right, for an Xbox (PIII). But in order to believe those statistics, use the Intel official clock ticks for every function.

In point 6 I wanted to say that, having an array 'int * a' and accessing as '*act' is faster than an array of objects and accessing the values as 'act -> x'.

Point 7 is nonsense. If you are not assigning i++ to any variable, there's no assignation nor backuping of the previous value.
12. No Profile Picture
Junior Member
Devshed Newbie (0 - 499 posts)

Join Date
Mar 2002
Posts
6
Rep Power
0
I do not understand your first point. Of course (every) code relies on compiler optimisations. But compilers are dumb, and above code makes it easier for them to generate fast code.

Yes, point 7 is wrong: "i++ can be slower or as fast as ++i, so one should prefer ++i" should be correct.
Whenever you use i++ and think it is as fast as ++i, you are heavily relying on your compiler

BTW: Only trust benchmarks, never assume anything
13. No Profile Picture
Canta como rafaé
Devshed Newbie (0 - 499 posts)

Join Date
Feb 2001
Location
Barcelona
Posts
74
Rep Power
18
If you are always relying on benchmarks, you will fall into
ignorance.

The increment discussion: Let's make some tests

Code:
```int main (void) {

register int i = 0;
register int j;

/* i ++ */
i ++;

/* ++ i */
++ i;

/* j = i++ */
j = i ++;

/* j = ++ i */
j = ++ i;

return 0;
}```
Now let's compile with

Code:
`gcc testcode.c -S -o testcode.asm -O0 -Wall`
Note that I've disabled any compiler optimizations with -O0.
The compiler assembler result is

Code:
```main:
pushl   %ebp
movl    %esp, %ebp
subl    \$4, %esp

/* register int i = 0; */
movl    \$0, %eax

/* i ++ */
incl    %eax

/* ++ i */
incl    %eax

/* j = i ++ */
movl    %eax, -4(%ebp)  /* Saves the value of i in j */
incl    %eax            /* Increments register i */

/* j = i ++ */
incl    %eax            /* Increments the value of i */
movl    %eax, -4(%ebp)  /* Saves the value of i in j */

/* return 0 */
movl    \$0, %eax
leave
ret```
j is stored in -4(%ebp)

Now, please, tell me the difference with preincrement and
postincrement.

Benchmarks are good tools for analyzing the temporary overall
of an application, but relying on them to assure something is
the fastest way is a silly thing. A program is a program and
its environment, and that's what benchmarks show.

If you suspected that I used optimizations, this is the ouput
with the first level of optimization

Code:
```main:
pushl   %ebp
movl    %esp, %ebp
movl    \$0, %eax
popl    %ebp
ret```
This is a clever compiler, uh ?
Last edited by Thrasher; March 28th, 2002 at 03:36 AM.
14. No Profile Picture
Junior Member
Devshed Newbie (0 - 499 posts)

Join Date
Mar 2002
Posts
6
Rep Power
0
This discussion is getting boring. Have you ever overloaded the ++ operator? Ever used the STL extensively? Do you always know in advance what a good compiler generates? g++, visual C++, Visual Age, Intel or even KAI produce very different output.
(My last posting here)
15. No Profile Picture
Contributing User
Devshed Newbie (0 - 499 posts)

Join Date
Mar 2002
Location
Oxford, UK
Posts
28
Rep Power
0
Martinus:

Before you assert code is faster, it is an absolute necessity to get timing results from it. I am not saying that my code was unequivocally faster, merely that I tried various different combinations of form, and the one I put up was the one I found to have the fastest timing results.

Having said this, I would be very surprised if your code was faster. The reason being that the more if statements you have, the more likely you are to get branch prediction errors.

Thrasher: Nice tip on abs - but the above comment applies. I would try it out, and time both, but I can't be arsed.