#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2011
    Posts
    3
    Rep Power
    0

    Searchable encryption


    Hi,
    we have a management application in our office and I was asked to implement client side encryption with the ability to search data. I am not a cryptography expert, so I am asking for help.
    The basic idea is to use symmetric encryption, and before the text (usually small, around 1K bytes on average) is sent to the server I will separate it into words (english words, separated by space), and each word will be padded with 0s to 16 bytes. Longer than 16byte words will be split into 16byte blocks and the last one will be padded. Then each word will be encrypted individually with AES 128bit and sent to the server. The server will have a dictionary of encrypted words and will associate the words with the record that has been inserted into the database, i.e. it will build a keyword index. (like MySQL Fulltext search option, or Sphinx)
    When the user wants to search the texts for a particular keyword he/she will enter the keyword, the client will encrypt it the same way and send the server to query.
    My question is (since I am not an expert on the subject) , is this method somehow bad ? Is there some way to break this encryption? I have been looking in Google and I found a lot of documents with a lot of mathematical formulas , some of them propose creating the index on the client, but I don't understand why don't they implement this method I am describing? It looks so simple.
    The obvious disadvantage of the method is that storage for the keyword index can increase in size a lot, but I guess this is the price to pay for security + searcheable data. I don't see any problem with it provided that the disk cost is getting lower and lower every day. Another disadvantage is that you won't be able to use wildcards, you will have to search by exact keyword, but well, at least you can search.

    Do you guys have any comments or ideas?
    Will be very appreciated.
    Regards
  2. #2
  3. Banned ;)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Nov 2001
    Location
    Woodland Hills, Los Angeles County, California, USA
    Posts
    9,607
    Rep Power
    4247
    Yes, the method is bad. For one, it is vulnerable to known plain text attack. Also, since you have a dictionary of keywords and there may not be so many of them, it is pretty easy to guess what's going on. Lemme explain: Say that I know your server uses the keyword "SELECT" for a command. When I see the word "LMURTZAXY" going across the network a lot and always at the beginning of the transaction, it is a pretty safe bet that LMURTZAXY = SELECT.

    Worse, if I'm a customer of yours and have a copy of your client program, I can send a bunch of keywords to the server and record what the encrypted versions are and build a table of plaintext keywords and their encrypted versions. I could also potentially decompile the client and get my hands on the AES key. After that, decryption becomes easy and I can decode other people's transactions as well.

    Therefore, what you need is something where the encryption key changes on a per session basis. (e.g.) currently, my keyword "FOO" encrypts to "BLARGH", on my next session, "FOO" encrypts to "ZRRTXBRE" because the key is different for the next session. Therefore, it becomes impractical to build a dictionary of known plaintexts and their encrypted equivalents.

    Luckily, there are already technologies to do this (SSL and TLS), so you don't have to reinvent the wheel. Basically, when the initial connection is established, the protocol makes the client and server use asymmetric encryption (which takes more computing power) to exchange a one time random key for this session. Once the random key is exchanged, both client and server use symmetric encryption (which is less processor intensive) using this random key for the duration of the session.

    All you need to do is then use a SSL or TLS library in your code and voila, your transaction is encrypted and you can proudly tell your customers that you use an industry-standard encryption method!
    Last edited by Scorpions4ever; August 30th, 2011 at 12:37 PM.
    Up the Irons
    What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
    "Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
    Down with Sharon Osbourne

    "I wouldn't hire a butcher to fix my car. I also wouldn't hire a marketing firm to build my website." - Nilpo
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2011
    Posts
    3
    Rep Power
    0
    Originally Posted by Scorpions4ever
    Yes, the method is bad. For one, it is vulnerable to known plain text attack. Also, since you have a dictionary of keywords and there may not be so many of them, it is pretty easy to guess what's going on. Lemme explain: Say that I know your server uses the keyword "SELECT" for a command. When I see the word "LMURTZAXY" going across the network a lot and always at the beginning of the transaction, it is a pretty safe bet that LMURTZAXY = SELECT.
    Thanks for your reply, but I will fight back my position.
    See, I agree, it is vulnerable to plain text attack, but I will not send the words of the protocol in encrypted form. I will send only the data. For example, if I am storing email on the untrusted server, I will not encrypt "Subject:" and "Body" field names, I will encrypt only the content of the field, i.e. the value of Subject and the value of Body. This way you will know I am storing an email, but it will be difficult to guess what is the content of the email. You might kind of guess that the Subject may have an abstract of what Body contains, but I don't think it will be easy to decrypt even knowing that.

    Worse, if I'm a customer of yours and have a copy of your client program, I can send a bunch of keywords to the server and record what the encrypted versions are and build a table of plaintext keywords and their encrypted versions. I could also potentially decompile the client and get my hands on the AES key. After that, decryption becomes easy and I can decode other people's transactions as well.
    Well, the source code is open, it is a javascript and I am using "Movable Type Scripts"'s code to encrypt using 128 bits AES. The customer knows what the key is because he enters it, but you will never know it, unless you somehow hack the customer's machine and get it from the browser while the session is still open, because as soon as the window is closed, the variables (which are stored in memory and where the key is) are destroyed.
    Therefore, what you need is something where the encryption key changes on a per session basis. (e.g.) currently, my keyword "FOO" encrypts to "BLARGH", on my next session, "FOO" encrypts to "ZRRTXBRE" because the key is different for the next session. Therefore, it becomes impractical to build a dictionary of known plaintexts and their encrypted equivalents.

    Luckily, there are already technologies to do this (SSL and TLS), so you don't have to reinvent the wheel. Basically, when the initial connection is established, the protocol makes the client and server use asymmetric encryption (which takes more computing power) to exchange a one time random key for this session. Once the random key is exchanged, both client and server use symmetric encryption (which is less processor intensive) using this random key for the duration of the session.

    All you need to do is then use a SSL or TLS library in your code and voila, your transaction is encrypted and you can proudly tell your customers that you use an industry-standard encryption method!
    Yes, I know SSL is using symmetric encryption, but the problem is, as soon as you begin using SSL this is no longer client side encryption. The admin of the webserver might modify the source code of the OpenSSL (or whatever library is in use) to secretly copy the unencrypted data to another location. This is the main reason I discarded SSL from the beginning.


    Regards
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2007
    Posts
    765
    Rep Power
    929
    Originally Posted by nulik
    See, I agree, it is vulnerable to plain text attack, but I will not send the words of the protocol in encrypted form. I will send only the data. For example, if I am storing email on the untrusted server, I will not encrypt "Subject:" and "Body" field names, I will encrypt only the content of the field, i.e. the value of Subject and the value of Body. This way you will know I am storing an email, but it will be difficult to guess what is the content of the email. You might kind of guess that the Subject may have an abstract of what Body contains, but I don't think it will be easy to decrypt even knowing that.
    If someone can send data to the user that they will then put into your application, they can break it. If they know what data the user is inputting, they can break it. If they can send data as the user (say with a cross-site scripting vulnerability), they can break it. They can do word-frequency analysis and break it (and on top of that section like "body" will often end with a name and a closing like "Thanks" or "Sincerely"). There is no way this can be secure.

    Encrypting each word separately makes your system more vulnerable than using plain ECB encryption which is already considered insecure.

    Originally Posted by nulik
    Yes, I know SSL is using symmetric encryption, but the problem is, as soon as you begin using SSL this is no longer client side encryption. The admin of the webserver might modify the source code of the OpenSSL (or whatever library is in use) to secretly copy the unencrypted data to another location. This is the main reason I discarded SSL from the beginning.
    If you can't trust your server than you have much, MUCH bigger problems--that rogue admin could just replace your html & javascript files with ones that don't encrypt at all for instance.


    On top of all that, you won't get an effective full-text search anyway. Files containing either "cat" or "cats" should both be returned from a search for "cat", but the encryption will render the word-stem algorithm in the db useless. You'll run into similar problems with capitalization, punctuation, spelling errors, composing characters and so on.
    sub{*{$::{$_}}{CODE}==$_[0]&& print for(%:: )}->(\&Meh);
  8. #5
  9. No Profile Picture
    Lost in code
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2004
    Posts
    8,317
    Rep Power
    7170
    If you can't trust your server than you have much, MUCH bigger problems--that rogue admin could just replace your html & javascript files with ones that don't encrypt at all for instance.
    To add to this: even though you're using client side encryption, since the encryption is done in the browser using JavaScript someone with control of the server could still inject JavaScript into your page that steals the password and therefore negates the fact that the encryption is done client side. For this reason, your approach really has no benefits over SSL + server side encryption.

    As far as I know, what you're trying to do with searching is not mathematically possible to do securely. I could very well be wrong though, and if so I would be extremely interested in learning how to do this.
    PHP FAQ

    Originally Posted by Spad
    Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2011
    Posts
    3
    Rep Power
    0
    Originally Posted by E-Oreo
    As far as I know, what you're trying to do with searching is not mathematically possible to do securely. I could very well be wrong though, and if so I would be extremely interested in learning how to do this.
    well, then check this thread:

    google query: searching encrypted data sphinx

    basically you have to use SHA256 + HMAC to create hashes on each word. This way you will send the hashes to the "untrusted" server (well it is not really untrusted, it is like you trust Google as a company, but some employee may occasionally do a 'select' on your gmail account) and it will search fine without knowing what data is it.
    And you have to use SSL of course, to somehow ensure it is YOUR java script that is being executed. I believe a complete solution will be finally found, but it is very difficult.

    Originally Posted by OmegaZero
    On top of all that, you won't get an effective full-text search anyway. Files containing either "cat" or "cats" should both be returned from a search for "cat", but the encryption will render the word-stem algorithm in the db useless. You'll run into similar problems with capitalization, punctuation, spelling errors, composing characters and so on
    It can be (check the reference i posted above) . But it will require a lot of work though on the client side, specially with foreing languages. As for capitalization it is easy, you just need to convert to CAPS before you generate the hash of the word.....
  12. #7
  13. No Profile Picture
    Lost in code
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2004
    Posts
    8,317
    Rep Power
    7170
    well, then check this thread:
    google query: searching encrypted data sphinx
    The people in that thread told you pretty much exactly the same thing that we did.
    PHP FAQ

    Originally Posted by Spad
    Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around

IMN logo majestic logo threadwatch logo seochat tools logo