July 22nd, 2013, 05:02 AM
Data encrypted using pgcrypto having slowness when decrypt
I have followed the standard instruction using pgcrypto lib to enable the column encryption using public/private keys.
However I am having problem in decrypting the column data as it is very slow to run the decryption function using pgp_pub_decrypt when the data is very big (above 100k). Is there a way to fine tune the setting of PostgreSQL to have better performance when decrypting the data.
Kindly advise. Thank you very much.
July 22nd, 2013, 07:22 AM
Depends a lot on what you mean by "very slow", what your system setup is (what hardware do you have?) and, of course, the size and nature of the queries you are running over that data.
There's no magic setting that's going to make public key encryption snappy if you're sorting over unindexed plaintext values, for example. Most databases that claim to be secure hold the data unencrypted in temporary tables in memory (totally unsecure, depending on the nature of the threat you're defending against) but keep the disk data (mostly) encrypted. Some index over plaintext (again, unsecure) and answer from the indexes as much as possible. But all of these methods trade away the benefits of encryption to gain speed. So it really depends on why the data is encrypted and what you're trying to do.
It is often the case that you don't need to actually read the data, and dealing with the encrypted data directly is acceptable. Its also common to find that a public key signature over a hashed value or a symmetric key encrypted with a public one suits the need just fine -- and in these cases significant performance improvements can result (symmetric ciphers are way faster than public key systems, so the trick is to encrypt the symmetric key, and just decrypt that part with the slow public key, and zip through the payloads you need). Without knowing what kind of problem you're trying to solve its difficult to engage in anything other than vague talk about encrypted values in a database.
July 22nd, 2013, 10:00 AM
Thanks a lot for your advise.
Basically I am using the Mondrian with PostgreSQL table, and there is security needs to encrypt the measure columns (e.g.: unit & value).
Before implement the encrypted columns (via Public/Private keys), it is very fast to get the total value/unit using sum function:
select sum(unit) from my_table;
However after implemented the encrypted columns, it is very slow (more than 10 minutes with 100k rows) to get the total value/unit using sum function, "to_number" function is required as sum will not work with text data type:
select sum(to_number(pgp_pub_decrypt(unit, pg_read_binary_file('keys/secret.key')),'9999999999990.00'))from my_table;
July 22nd, 2013, 11:01 PM
Whoa yeah, that would be slow. What is happening is Postgres is looping over each record, decrypting, casting to a number, dumping the decrypted value (to prevent holing it in memory), adding the casted number to the accumulated sum, and then moving to the next row.
That's a lot of processing for an aggregate function. 10 minutes is a bit extreme, but the procedure may not be running in parallel (they could be made to do so, but I don't know that they are), but the problem can be attacked another way.
You can compute the running sum and store that in a historical sums table instead of using sum() every time you want to know how much of something there is*.
CREATE TABLE thingy
(code VARCHAR(20) PRIMARY KEY,
amount INTEGER DEFAULT 0 NOT NULL);
CREATE TABLE location
(code VARCHAR(20) PRIMARY KEY);
CREATE TABLE inventory
(location VARCHAR(20) REFERENCES location NOT NULL,
thingy VARCHAR(20) REFERENCES thingy NOT NULL,
amount INTEGER DEFAULT 0 NOT NULL,
PRIMARY KEY (location, thingy));
CREATE OR REPLACE FUNCTION sum_thingy()
RETURNS TRIGGER AS
IF TG_OP = 'INSERT' THEN
SET amount = amount + NEW.amount
WHERE code = NEW.thingy;
ELSIF TG_OP = 'UPDATE' THEN
SET amount = amount + (NEW.amount - OLD.amount)
WHERE code = NEW.thingy;
ELSIF TG_OP = 'DELETE' THEN
SET amount = amount - OLD.amount
WHERE code = OLD.thingy;
$$ LANGUAGE plpgsql;
CREATE TRIGGER update_thingy_sum
BEFORE INSERT OR UPDATE ON inventory
FOR EACH ROW EXECUTE PROCEDURE sum_thingy();
CREATE TRIGGER delete_thingy_sum
AFTER DELETE ON inventory
FOR EACH ROW EXECUTE PROCEDURE sum_thingy();
I've tacked the amount of thingies to the thingy table directly. This is not quite the usual way. Normally there would be a separate table of "thingy_counts" that tracks historical changes to the sum along with whatever other data is necessary for auditing, but I'm keeping the example simple**.
Anyway, the point here is that you can get thingy totals without actually running a sum() over them, and this saves you the ten minute query. It doesn't matter if you add encryption into this or not now, because you only have to decrypt a single value to find the current sum for any item in the thingy list.
The only case now where you need to do the ten minute operation is when verifying/rectifying the sums (just to make sure nobody did an UPDATE directly to the thingy table, getting it off track -- of course, Postgres has awesome permissions features that can operate per table, column and even row, so you can create permission rules to minimize this).
So checks like this
(SELECT thingy, SUM(amount) AS amount FROM inventory GROUP BY thingy)
WHEN t.amount = s.amount THEN TRUE
END AS validated
FROM thingy AS t JOIN sums AS s ON (t.code = s.thingy);
will still require that sum() operation (its hiding in the definition of "sums" in the CTE), but this can be run as a weekly/whenever batch operation instead of part of your regular query.
The example above should be fully functional. Hopefully I explained everything well enough for this solution to be useful to you. Just remember, we're pre-computing the sums incrementally over time. Your actual schema is no doubt more complex than the example above, but the point is that you can keep a running sum of anything as you go, and decrypt just that value when you need it instead of the marathon of decryption required for a sum() aggregate to run.
[* This is essentially the database version of the precomputate/lookup performance hack (or memoization or a litany of other names for variations on the same trick). The basic idea is that since we've known, at some point in the past, every relevant value then we should also know the output it maps to via whatever function we're using (in this case sum(to_number(pgp_pub_decrypt(unit, pg_read_binary_file('keys/secret.key')),'9999999999990.00'))), but we've forgotten it.
So we can trade space (the new row on the "thingy" table above) for time (the time to compute sum()) by recording the output and referencing that instead of running sum(). In this case we're making a tweak to the basic form, in that we're incrementally performing a sum operation (which is actually more expensive than doing it all at once) but breaking the execution of it up into tiny chunks over a huge span of time. The each execution chunk is acceptably small so we don't notice a performance hit to db writes, and the timespan is enormous (the life of your system). Overall there is a huge savings in that sum() need never be run unless you choose to verify the values in the precomputed table.
** It is common to make the thingy_counts table into a transaction log with a timestamp column so the PK would be (thingy, dtg) or since you probably already have a transaction log reference the transaction log table and make the PK for thingy (thingy, transfer) and get your timestamp information from the transfer log. That way you have a history of sums at any given point in time and a way to locate what transactions got you off track, if ever.]
Last edited by zxq9; July 24th, 2013 at 02:26 AM.
Reason: Fixed a boneheaded mistake in delete trigger
July 23rd, 2013, 11:13 PM
If I understand your explanation correctly, basically the technique you used here is to read the encrypted record in a smaller table instead of the original table with large amount of records.
However it is a issue also if user is trying to display the report with 10k of records, then I still need to decrypt the 10k of pre-calculated records.
Please correct me if I understand your explanation incorrectly. And thank you for your kind advice.
July 24th, 2013, 02:34 AM
We are performing the sum() incrementally, so you never run sum() again.
If you have 100k inventory records that cover 10k types of items, then yes you would have to decrypt whichever of the 10k pre-calculated records you want to look at, but this is much faster than decrypting and performing sum() on 100k records (your 100k sum() query might even be performing some on-disk sort/sum operations, which are painfully slow).
But let's think about why we would ever need to pull all 10k sums at once -- only for automated comparisons or reports, since the average human can't remember 10k values at once. In the case of automated comparisons the results of the comparison is very probably another set of aggregate values (percentages or proportions or whatever). In that case you can do the same trick over again, saving the results of the aggregate comparison value so that it can be accessed instantly as well -- and you can use another layer of triggers over that table to maintain its integrity during updates (the triggered updates will cascade).
If you need advice beyond this then you'll have to explain your situation in more detail -- because my imaginary thingy/location/inventory example above is going to be an increasingly inappropriate vehicle for comparison with your real-world problem the more layers to the problem you come up with, and I still know nothing other than your original question "How to speed up sum() over >100k encrypted values".
July 26th, 2013, 01:55 AM
Thank you and really appreciate your patient in understanding my problem and advising the work around.
My real problem is like I mentioned above, I am using PostgreSQL with Mondrian, a OLAP engine which allows users to view the report data in different levels/angles, the report data output can be only one single row (e.g.: sum of all data from 100k of rows), or from 10 to 10,000 rows depends on the users selection. This is the flexibilities that provided by a OLAP engine.
Previously I have no issue in term of performance until I need to encrypt the measures column due to security policy, then issues start to happen as explained above.
Your workaround would be great provided the report output is fixed, then I can just pre-calculated the result, however it will not be suitable to work with OLAP engine.
Thanks again for your kind advise.