March 31st, 2013, 12:27 PM

Help Implementing NewtonRaphson Division
I've been working on functions which can perform arithmetic on arbitrarily large unsigned integers for sometime now. I've got add, subtract, multiply, and division with remainder all working but the division function is too slow. I would like to implement NewtonRaphson division but am having a hard time finding example code that would show me how to do it with arbitrarily large unsigned integers. The algorithm involves using the reciprocal of the denominator but these are not floating point numbers that I'm working with. Wikipedia showed an example of how to divide by three with the NewtonRaphson algorithm using integers but I am unable to figure out how to apply it to all denominators. The way I'm representing my large numbers is basically with an array of 32bit unsigned integers with bits starting at the end of the array and going to the left and a pointer keeps up with where the number begins in the array. So, 2^32, which is too large for a 32bit unsigned integer to represent by one, would be represented in the array as
startPtr>[00000000000000000000000000000001][00000000000000000000000000000000]
in binary. I don't know if that's bigendian or littleendian. Anyhow, if anyone out there can help me figure this out I'd appreciate it.
ps) I know that there are libraries out there like GMP which I could use for this stuff but I really want to learn how to do it myself.
March 31st, 2013, 02:19 PM

What you're looking for, is called "fixed point." We oldtimers used to apply this technique quite a lot.
Here's an example, in decimal. Suppose I want to divide 987654321 by 3. At the same 9digit precision, I represent the reciprocal of 3 as 333333333. There is an implicit decimal point (or in the usual case with computers, binary point) to the left of the "9 digit frame" chosen for this computation.
The product (dividend X reciprocal) is 329218106670781893, with the implicit point 9 digits from the right (the same rules used when multiplying decimals with pencilandpaper), so this is interpreted as 329218106.670781893
For simplicity, you could take 329218106 as the quotient, or apply rounding and use 329218107, which happens to be the exact quotient in this case. (Rounding is simple in binary, it's necessary only to test the most significant bit after the binary point.)
You can use this technique with NewtonRaphson to get your reciprocal. Note that unless you are quite careful, the quotients obtained with this technique may differ from those with your divisionwithremainder.
__________________________
Please post back here, when you've done this. I'd like to see how the division speeds compare (for various sized inputs) between the two techniques!
March 31st, 2013, 04:12 PM

So........
Alright, I see what you're saying. But what if my divisor is an arbitrarily large number? How would I figure out what the sequence of digits would be in the reciprocal so I can apply your fixed point technique?
March 31st, 2013, 05:57 PM

Well, of course, you don't need to figure out the sequence of digits  the application of the NewtonRaphson iteration, is to compute the reciprocal!
But NewtonRaphson does need an "initial guess." Because the iteration converges so rapidly, this guess doesn't need to be very good. For example, if the divisor is n bits long, place a 1bit that many n1 to the right of the binary point.
Here's an example, with all numbers in binary, and the problem size assumed to be 16 bits:
divisor = 0000000000010011 (19 decimal)
The divisor is 5 bits long, so the guess is
r0 = 0000100000000000
With the binary point at the left, this is 1/32. It's a pretty rough approximation to 1/19, but is guaranteed to be within a factor of 2. If instead of a 1 you place the pattern 1011 the maximum error is a factor of 1.5.
The iteration converges quadratically, so each time you have twice as many "good bits" as before. If the dividend is m bits long, and the initial guess is done by the crude technique above, the required number of iterations is the greatest integer not less than log2(m). For example, if m=10,000 bits, then the log is about 13.3, so 14 iterations will suffice.
The guess can be made better, to save some iterations. If you think about it for a while, the scaling (placement of binary point position) for the divisor can readily be chosen so that reciprocal will be between 1/2 and 1. In this interval, the formula 48/17  D*32/17 is accurate to just over 4 bits ... and requires no division (D is the divisor divided by the next greater factor of 2, so it is between 1/2 and 1). Using this initial guess, the number of iterations needed is exactly two less than given by the formula in the preceding paragraph.
March 31st, 2013, 06:03 PM

As a footnote, since this is the security forum...
While it's interesting, fun and educational to create a big number library, it is seriously dangerous to use that library for any cryptographic whose security is of consequence.
Even if you have tested the library as heavily as you can imagine, it may contain bugs that show up only under extremely specific circumstances. And any arithmetic error in the derivation of (for example) an RSA or DH key, is likely to result in a situation where any attacker can easily read the secrets, forge signatures, etc.
March 31st, 2013, 06:23 PM

Thank you
Thanks for your help. I'm going to try to implement what you've described. I'll post the results of the improvements when I'm done. I should tell you however that I'm using a shift/add algorithm for the multiplication and plan to speed that up using FFT methods as soon as I can figure out how it works. The division will really speed up after I've got that done. If I need help with that can I come back here and ask you?
I'm not really planning on distributing my bignum package or cryptography function I write once its implemented, except maybe to a couple of friends (and I'll warn them not to rely on it). This is pretty much for my personal learning benefit. But, if what you say is true, couldn't the same argument be applied to cryptographic software out there that makes use of packages like GMP? I mean, I know those guys are smart but they are human and capable of mistakes too, right?
March 31st, 2013, 06:55 PM

Probably you already know, that the choice of optimal algorithm depends on number size. Some libraries embody multiple algorithms, automatically applying that which is most efficient for the given arguments. For example (if I recall correctly), FFT multiplication is only efficient for very large multiplicands. The crossover, which will depend on implementation and the underlying CPU architecture, is at least 10,000 decimal digits.
Similarly, Karatsuba multiplication (an optimization of conventional multiplication), best for integers below the FFT threshold, is actually slower for smaller integers, with the crossover being in the realm of perhaps 1000 decimal digits (again, very much dependent on implementation).
For the integer sizes presently used in public key cryptography (not more than about 1000 decimal digits), simple multiplication is likely to be as fast as can be done on a standard computer.
_______________________
Have you implemented modular exponentiation yet? This is the "fundamental computation" of most publickey cryptography operations. It can be sped up quite a lot, by various methods of analyzing the exponent, and using Montgomery multiplication.
March 31st, 2013, 07:05 PM

Vulnerability of Math Libraries
Yes, even the widely used libraries out there, like GMP and those used in openSSL, could contain subtle flaws.
Such libraries have two advantages: they have been tested more extensively than will be feasible for any homegrown library; and their source code has been inspected by numerous knowledgeable and experienced software developers.
But even so, their reliability in every possible case cannot be guaranteed.
Bruce Schneier, the wellknown "security guru," has suggested incorporating TWO libraries, and performing all computations in duplicate. Since it seems unlikely that two independent implementations would have identical flaws, a defect is likely to be revealed by divergent results.
Schneier also recommends that all publickey cryptography computations be checked by a "reverse computation." Basically, this is like checking pencilandpaper subtraction by adding the difference to the subtrahend, which should match the original number. The form of the check computation will depend on the cryptographic computation, but in general this is not very complicated, and can increase confidence in the software's security.
March 31st, 2013, 11:23 PM

Hi again. Actually, I have implemented a modular exponentiation function. It was with that function I realized my divide function is too slow, since each time through the loop in the function I reduce the the number obtained from the multiplication modulus whatever. I was using my divide function to do that since it returns both quotient and remainder. I plan now to use NewtonRaphson for obtaining the quotient of a division and Barrett's algorithm (my namesake, hehe, my last name is Barrett) for modulus operations.
It sounds like from what you're saying that for multiplication alone what I'm using is alright. I don't think I'll be working with numbers larger than 1000 decimal digits. I would like to learn about how to use Montgomery multiplication though for modular exponentiation.
I actually have a book by Bruce Schneier on cryptography. I haven't read the entire thing yet though. I find this stuff really interesting. Anyhow, it may be a week or two before I can get the division function updated to using NewtonRaphson. I have a good deal of work to do with school. I'm a grad student majoring in Computer Science at the University of Alabama in Huntsville. I'm not sure if they offer a course in cryptography. When I get the function working I'll measure the results as compared to the old way and let you know how much it sped things up. Thank you for all your help, and if you could tell me about how Montgomery multiplication works and how to use it with modular exponentiation I'd be oh so appreciative. Again, thank you so much.
April 1st, 2013, 01:04 AM

Ah, I see the logic of the path you've gone down.
No doubt, with an optimized divide, a nonMontgomery modular exponentiation will run faster. Because you are interested in the remainder, it is essential that you take into account the potential that the NewtonRaphson quotient could be off by one.
But as an alternative to speeding up the divide operations, it's possible to get rid of almost all of them!
For example, imagine a 1024bit modular exponentiation, done using the standard squareandmultiply algorithm in which each bit of the exponent is tested in succession. There will be about 1000 squarings, and on average about 500 multiplies, with a total of 1500 modular reductions (divisions).
The "trick" of Montgomery multiplication, is that it allows you to choose the modulus to be whatever natural number you like. So on a digital computer, the modulus may be set (for example) to 2^(32*n). In this way, the modular reduction after multiplication consists of throwing away the more significant half of the product ... you just use half of the data words!
Also, the squarings can be optimized to take advantage of the symmetry of partial products when a number is multiplied by itself.
So now we have about 1500 multiplies, most of them optimized, with no divide during the iteration. It needs one divide at the beginning (to convert to Montgomery representation) and one at the end, to convert back.
_____________________________________________
But it gets even faster: there is a simple technique to optimize exponent processing. Suppose (for example) we divide the exponent into a sequence of 4bit groups. The power represented by this group must be in the range from 0 to 15. So before starting the squareandmultiply iteration, we precompute an array of these 16 powers of the base. (Note: the 0th power must also be computed; it of course corresponds to 1, but we will need it in Montgomery representation, so its value will decidedly NOT be 1.) So this adds one more division (total is now 3 divides), and about 16 multiplies.
In this example, a 1024bit modular exponentiation now requires 1024 optimized squares, 256 multiplies from the precomputed array, and 16 multiplies for precomputation. So the total computation is:
272 standard 1024bit multiplications
1024 optimized 1024bit squarings
3 divisions
_____________________________________________
When I learned about this stuff, one of my starting points was noticing that in Python (which supports arbitrary precision arithmetic), modular exponentiations were MUCH faster than my homegrown software. So I wondered, "how do they DO that?"
Somewhere along the way, I found the first page of Peter Montgomery's original paper, which is vastly clearer and simpler than anything else I saw online Those few paragraphs were enough for me to get the idea.
Looking at the Python integer math source code, gave me the concept of processing several exponent bits at a time. I don't remember, where I learned about optimized squaring.
I went down my road, because of the need to verify a digital signature on a very slow 16bit microprocessor.
Last edited by mah$us; April 1st, 2013 at 06:18 AM.
April 11th, 2013, 11:08 AM

Think its working
Hi there. Well, I decided to take a little bit of a break from school work today and I think I've got NewtonRaphson division working. You're right about it being off by one occasionally. Can I always be guaranteed that when it is off by one the quotient obtained is one more or one less than it should be? Or will that vary? I'll do some speed tests and compare it to my old division function pretty soon here and post the results. The division algorithm I was using previously can be found on the wikipedia webpage for division algorithms. Its the one titled "Integer division (unsigned) with remainder." I would post a link but since I'm a new user on this forum Dev Shed won't let me. Anyhow, again thank you so much for your help.
April 11th, 2013, 12:06 PM

One other thing...
Hey, I was wondering if you could give me a simple example of dividing the exponent up into groups of bits for modular exponentiation. I see how the precomputed array can give you the first 2^n powers of the base, where n is the number of bits in the group, but how do you precompute the powers above that number for the remaining groups of bits. I'm just not completely understanding how this is supposed to work. Also, another question. I've been coming up with random numbers in a way I'm not so sure is that random. If you give my random number function an upper bound on the number of bits you want in your random number what it'll do is start at the most significant bit and if rand() % 3 returns 1 it'll turn that bit on and move to the next significant bit and do the same thing and so on until its processed all the bits. So, each bit has a 1 in 3 chance of being turned on. Can you suggest a better way to generate random numbers? Thanks in advance.
April 11th, 2013, 05:03 PM

@vindy:
I'll respond about the exponents later, because I want to take a little time to make a clear example.
Concerning random numbers  please don't be offended, but the approach you described isn't even on the right planet, let alone the right neighborhood! For example, rand() % 3 == 1 evaluates True 1/4th of the time, not 1/3rd. rand() % 2 (or equivalently in C code, rand() & 1) should give equal numbers of 0 and 1 bits.
But much more deeply and significantly, in almost all environments, rand() is not random at all, and must never be used for cryptographic purposes!
Obtaining "random" numbers for cryptography is almost a science unto itself. It's a big topic! It's a topic that is very often misunderstood by people who are new to the field. And the work you've been doing on arithmetic computation is completely unrelated, so it's understandable that you are entering unfamiliar territory.
I recommend that you do some reading on the subject  one or two web searches should get you several articles that explain some of the problems and challenges.
For practical purposes, this is something programmers don't need to invent. Various platforms provide ways to get random bits that are supposed to be unpredictable enough for cryptography. Whether they are strong enough for the most securitycritical applications is at least debatable  but as far as I know, in general they are strong. For example, the Windows cryptography API has (as I recall) a function for randomized data generation. In must Unixtype systems, special character devices /dev/random and /dev/urandom give 8 bits of random data with each read. Note that these are NOT highly predictable sequences like those of library rand() functions.
If you insist on doing this yourself, as with your arbitrary precision library, expect to spend a similar amount of time and effort as you have already put in on the library, before getting a really worthwhile crypto random source.
April 11th, 2013, 06:09 PM

Hi again
Well, I had a feeling you might come back saying something like that. I strongly suspected that rand() was not appropriate for random number generation for cryptography purposes. It seems like what the real issue is, is coming up with a random number generator that is extremely difficult to predict or decipher. I mean, when you start arguing about what is truly random you start getting into philosophy. Anyhow, I want to thank you again for all your help.
I think what I'm going to do to speed test my two different implementations of division is divide a (large by 32bit standards, but very small by others) number by a small one digit number like 3 or something in a loop 1000, or maybe even 10,000 times with each implementation. That should be good enough to get an idea for how one compares to the other. I would do it with larger numbers but my old division function is so slow I might be here for over an hour waiting for it to finish. I'll post the results when I do.
April 11th, 2013, 06:57 PM

Speed tests
Hello again. I just did a little speed test with the two division functions. I divided (2^68  1) by 3 1000 times with each. The old division function took 12.156 seconds to complete. NewtonRaphson division took 4.563 seconds to complete. That's a pretty good improvement. There are some things I'm doing in the new division function which I think can be changed to speed it up a little bit more. But I'm happy with the speed difference.
I'm probably going to keep coming back here for a while and asking you questions if you don't mind. And again, thanks so much for all the help.