On an aside, Thrasher is quite correct in asserting that point 7 is rubbish. As far as I can tell, on most modern processors, i++, ++i, i+j and i-j will all execute in 1 cycle, provided that both are in registers. Are you basing this assertion on the performance of an interpreted language? I might believe it if the results were for Fourth or something similar. (I am, of course, talking about 32-bit integers on 32-bit processors).

I would also suggest that the omission of dy will result in a slowdown. I only say this because I tried it, and the slowdown was of the order of 20%. To see why, you would no doubt have to examine the assembler produced by the compiler, which is another thing that I can't be arsed to do.

I suggest that, using the same data structures I used, you try writing a version that runs faster, and do a time comparison. I would be surprised if you did manage to come up with anything faster though. Look upon it as a challenge.

On a further aside, I have found, through extensive tests, that putting conditional continues in an inner loop nearly always results in a slowdown, with the added disbenefit of messing up otherwise neat code. I only say this because I used to do exactly that.