#16
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2002
    Location
    Oxford, UK
    Posts
    28
    Rep Power
    0

    Fast Distance calculation


    Hi,

    I noticed that a lot of people have been advocating multipass solutions to this problem - ie:

    1) Quick Reject in X
    2) Quick Reject in Y
    3) Distance calculation

    I would, however, be rather surprised if it wasn't faster to do a single pass test based on the magnitude of the distance squared:

    Code:
    if(dx*dx + dy*dy > rad2)
    {
       //...
    }
    I'll do some timing results and see.
  2. #17
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2002
    Location
    Oxford, UK
    Posts
    28
    Rep Power
    0

    Results.


    Basically, a single magnitude squared test turns out to be about 20-25% faster than the two-layer test, on a P4 under VC++ 6.0.

    Time for 10^8 distance rejection/acceptance calculations was:

    1) Magnitude squared: 917084 / (1.19 * 10^6) s

    2) Two-layer: 1286742 / (1.19 * 10^6) s

    The code follows. arrayx[] and arrayy[] are initialised to random values in the range 0...1000, as are px and py.

    Magnitude squared:
    Code:
    for(ctr = 0, count=0; ctr < 1000; ctr++)
    {
       dx = px-arrayx[ctr];		
       dy = py-arrayy[ctr];
       if(dx*dx + dy*dy < 100)
       {
          count++;
       }
    }
    Two-layer + magnitude squared:
    Code:
    for(ctr = 0, count = 0; ctr < 1000; ctr++)
    {
       dx = px-arrayx[ctr];
       dy = py-arrayy[ctr];
       if(dx > -10 && dx < 10 && dy > -10 && dy < 10)
       {
          if(dx*dx+dy*dy < 100) count++;
       }
    }
    Neither of these is safe if dx*dx + dy*dy > 0x7FFFFFFF. Options for making this safe include using 64-bit integers for the mag^2 calculations, and also using double-precision numbers.

    The timing results for the double precision case are even more skewed in favor of magintude squared only:

    10^8 double precision distance tests:

    1) Mag-squared = 1.5582 / 1.19 s

    2) Two-layer = 4.2211 / 1.19 s
    Last edited by The Ostrich; March 13th, 2002 at 09:32 PM.
  4. #18
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2002
    Location
    Oxford, UK
    Posts
    28
    Rep Power
    0
    I figure that the 3x discrepancy in the timing results for the double precision case is probably because either dx or dy is being swapped into / out of memory more times.

    Or it could just be that the more complex condition breaks the branch prediction.
  6. #19
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2002
    Location
    Oxford, UK
    Posts
    28
    Rep Power
    0

    Humble Pie


    Whoops.

    If I replace that UGLY if statement with

    Code:
    //integer version
    if(abs(dx) < 10 && abs(dy) < 10) ...
    Code:
    //double precision
    if(fabs(dx) < 10 && fabs(dy) < 10) ...
    Then the axis-aligned bounding box is faster. Duh. After that change and some inner-loop optimisation, we get:

    Integer:

    1) Mag^2 test: 9.4 / 1.19 s

    2) Two-level: 6.7 / 1.19 s

    Double Prec

    1) Mag^2 test: 1.52 / 1.19 s

    2) Two-level: 9.2 / 1.19 s
    Last edited by The Ostrich; March 13th, 2002 at 09:36 PM.
  8. #20
  9. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2001
    Posts
    28
    Rep Power
    0
    so... what you're saying is... the two loops are faster if you're dealing with integers, and one loop is faster if dealing with doubles (like me)?
  10. #21
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2002
    Location
    Oxford, UK
    Posts
    28
    Rep Power
    0
    No. One loop which does two tests is faster in both cases, provided that you use the abs() and fabs() macros / functions. What I mean is:

    Code:
    N = ...
    N2 = N*N;
    
    for(ctr = 0; ctr < thingy; ctr++)
    {
        dx = ...
        dy = ...
        if(fabs(dx) < N && fabs(dy) < N)
        {
            if(dx*dx + dy*dy < N2) ...
        }
    }
    I believe that this form, and this form only, is the fastest one.

    Important notes:

    1) There is only one loop
    2) fabs(n) is used
    3) I never take a square root.
    Last edited by The Ostrich; March 14th, 2002 at 08:18 PM.
  12. #22
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2001
    Location
    Dublin
    Posts
    413
    Rep Power
    14
    I've no idea of the speed gains / tradeoffs here but you could also complement the quick rejection with a quick acceptance before resorting to the distance formula .. for this two dimensional example 50% of the unknown area (that not rejected by testing the axes individually) can be quickly accepted.

    Quick reject if distance >20 on either axis
    Quick accept if distance <20/(2^0.5) on both axes

    (I think - off the top of my head)

    Again don't know if this speeds things up but I'd guess so...
  14. #23
  15. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2002
    Posts
    6
    Rep Power
    0
    I believe that this form, and this form only, is the fastest one.
    Wrong, at least when you wrote C++ code

    This code should be a little faster:
    Code:
    const float maxDist=20;
    const float maxDist2 = maxDist*maxDist;
    
    for (pointsDatatype* cur=points; cur != points.end(); ++cur)
    {
        float distx = testPoint.x - cur->x;
        if (fabs(distx) > maxDist) continue;
        float disty = testPoint.y - cur->y;
        if (fabs(disty) > maxDist
    	|| (distx*distx + disty*disty) > maxDist2) continue;
        // if we have reached this far, the point is near
        // do something now ...
    }
    1. There is no need to calculate disty, when the point can already be rejected by calculating distx
    2. Use pointers instead of an an integer and the [] operator
    3. use ++cur instead of cur++

    This code is also not the fastest, you can do loop unrolling which should be at least 20% or so faster.

    Do you need to do this calculation for every point? If so, there are some better algorithms do do this test.
    Last edited by martinus; March 26th, 2002 at 06:14 PM.
  16. #24
  17. No Profile Picture
    Canta como rafaé
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2001
    Location
    Barcelona
    Posts
    74
    Rep Power
    14
    Well, if we are talking about fast code, this is not by far the fastest code.

    Code:
    const float maxDist=20;
    const float maxDist2 = maxDist*maxDist;
    
    for (pointsDatatype* cur=points; cur != points.end(); ++cur)
    {
        float distx = testPoint.x - cur->x;
        if (fabs(distx) > maxDist) continue;
        float disty = testPoint.y - cur->y;
        if (fabs(disty) > maxDist
    	|| (distx*distx + disty*disty) > maxDist2) continue;
        // if we have reached this far, the point is near
        // do something now ...
    }
    I don't have time to write a new code, but I see that

    1 - You are using "const float" instead of "#define". Do you think is this an improvement ? So, every time you do if ( ... > maxDist), the program has to go to the data area, fetch the value, and then compare 2 registers, instead of simply comparing to a constant value.

    2 - You are calling the function "points.end ()" in every loop in "for". Isn't it more efficient to keep that value in a "register float" to avoid the delay of calling to a function ?

    3 - For the first rejections, I won't use float values. It does not mind if the rejection checks are not as precise as using floats, but it's worth to use integers.
    Code:
    Example (integer version in parentheses): 23.7 (23)- 3.2 (3) = 20.5 (20) => rejected (not rejected)
    But the number of mistakes would be very small and they would be checked in the real distance test.

    4 - With the use of integers, you can always do as for the abs:
    Code:
    if (res & 0x80000000) res = abs (res);
    and avoid another function call.

    5 - Loop unrolling only works when the size of the array is finite.

    6 - If you use C++, all of those testPoint.x are more costly that iterating an array in C.

    7 - And the last question ... why is it more efficient to use ++cur instead of cur++ ?

    The only improvement of your code is not to calculate disty if you can reject it by distx.

    Where are the days in which when we wanted speed we went directly to assembler ? But if you don't know assembler, you can always try to use -O3 in gcc and hope a bit luck.
    Thrasher



    'Y se ahogaron los dooos
    No eran duros pa pagar, cuñaaoo !!'
    El vagamundo - El risitas y su cuñao
  18. #25
  19. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2002
    Posts
    6
    Rep Power
    0
    1. const float: With every modern compiler this should be as fast as a #define. one should not use #defines in C++ code, inlined code is much cleaner and as fast.

    2. There is no function calling overhead, this procedure is inline and does not need any calculation.

    3. have a look at this table Relative Performance of Common Operations (PIII-733)
    * and + operations are as fast as for integers, so you should stick to floats. Converting from float to int and reverse would also cost time.

    4. fabs() should also be inlined, and it is a direct assembler call which clears a flag

    5. I have never seen an infinite array, probably I do not have enough RAM
    If you mean that unrolling only works for fixed sized arrays, have a look at this code:
    Code:
    int i;
    for (i=0; i<max % 4; ++i) 
        doSomething;
    
    for (; i<max; i += 4) {
        doSomething;
        doSomething;
        doSomething;
        doSomething;
    }
    6. It should be faster using temporary variables, but the same applies to code in C, there is no difference in speed. Believe me, a loop which iterates with a pointer (or an interator in C++) is always faster than iterating an integer and using [].

    7. the Prefix ++ is always faster than the postfix ++, because the postfix operator has to temporarily save it's value, increment it and return the temporary value. The Prefix ++ does not need a temporarily object.

    martin
  20. #26
  21. No Profile Picture
    Canta como rafaé
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2001
    Location
    Barcelona
    Posts
    74
    Rep Power
    14
    Hi

    So, as I understand, points 1, 2, 4, 5 and 7 relies on compiler optimizations. Is it too difficult to make them by hand avoiding that the compiler couldn't handle those optimizations? So, don't say your code is faster, your compiler code is faster.

    In point number 3 you might be right, for an Xbox (PIII). But in order to believe those statistics, use the Intel official clock ticks for every function.

    In point 6 I wanted to say that, having an array 'int * a' and accessing as '*act' is faster than an array of objects and accessing the values as 'act -> x'.

    Point 7 is nonsense. If you are not assigning i++ to any variable, there's no assignation nor backuping of the previous value.
    Thrasher



    'Y se ahogaron los dooos
    No eran duros pa pagar, cuñaaoo !!'
    El vagamundo - El risitas y su cuñao
  22. #27
  23. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2002
    Posts
    6
    Rep Power
    0
    I do not understand your first point. Of course (every) code relies on compiler optimisations. But compilers are dumb, and above code makes it easier for them to generate fast code.

    Yes, point 7 is wrong: "i++ can be slower or as fast as ++i, so one should prefer ++i" should be correct.
    Whenever you use i++ and think it is as fast as ++i, you are heavily relying on your compiler

    BTW: Only trust benchmarks, never assume anything
  24. #28
  25. No Profile Picture
    Canta como rafaé
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2001
    Location
    Barcelona
    Posts
    74
    Rep Power
    14
    If you are always relying on benchmarks, you will fall into
    ignorance.

    The increment discussion: Let's make some tests

    Code:
    int main (void) {
    
            register int i = 0;
            register int j;
    
            /* i ++ */
            i ++;
    
            /* ++ i */
            ++ i;
    
            /* j = i++ */
            j = i ++;
    
            /* j = ++ i */
            j = ++ i;
    
            return 0;
    }
    Now let's compile with

    Code:
    gcc testcode.c -S -o testcode.asm -O0 -Wall
    Note that I've disabled any compiler optimizations with -O0.
    The compiler assembler result is

    Code:
    main:
            pushl   %ebp
            movl    %esp, %ebp
            subl    $4, %esp
            
            /* register int i = 0; */
            movl    $0, %eax
    
            /* i ++ */
            incl    %eax
    
            /* ++ i */
            incl    %eax
    
            /* j = i ++ */
            movl    %eax, -4(%ebp)  /* Saves the value of i in j */
            incl    %eax            /* Increments register i */
    
            /* j = i ++ */
            incl    %eax            /* Increments the value of i */
            movl    %eax, -4(%ebp)  /* Saves the value of i in j */
    
            /* return 0 */
            movl    $0, %eax
            leave
            ret
    j is stored in -4(%ebp)

    Now, please, tell me the difference with preincrement and
    postincrement.

    Benchmarks are good tools for analyzing the temporary overall
    of an application, but relying on them to assure something is
    the fastest way is a silly thing. A program is a program and
    its environment, and that's what benchmarks show.

    If you suspected that I used optimizations, this is the ouput
    with the first level of optimization

    Code:
    main:
            pushl   %ebp
            movl    %esp, %ebp
            movl    $0, %eax
            popl    %ebp
            ret
    This is a clever compiler, uh ?
    Last edited by Thrasher; March 28th, 2002 at 03:36 AM.
    Thrasher



    'Y se ahogaron los dooos
    No eran duros pa pagar, cuñaaoo !!'
    El vagamundo - El risitas y su cuñao
  26. #29
  27. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2002
    Posts
    6
    Rep Power
    0
    This discussion is getting boring. Have you ever overloaded the ++ operator? Ever used the STL extensively? Do you always know in advance what a good compiler generates? g++, visual C++, Visual Age, Intel or even KAI produce very different output.
    (My last posting here)
  28. #30
  29. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2002
    Location
    Oxford, UK
    Posts
    28
    Rep Power
    0
    Martinus:

    Before you assert code is faster, it is an absolute necessity to get timing results from it. I am not saying that my code was unequivocally faster, merely that I tried various different combinations of form, and the one I put up was the one I found to have the fastest timing results.

    Having said this, I would be very surprised if your code was faster. The reason being that the more if statements you have, the more likely you are to get branch prediction errors.

    Thrasher: Nice tip on abs - but the above comment applies. I would try it out, and time both, but I can't be arsed.

IMN logo majestic logo threadwatch logo seochat tools logo