2

I have a database table that needs to be shared between multiple parties. There is a sensitive column than must be obfuscated in a way that preserves equality relationships between records. A trusted party must have the ability to reverse the obfuscation. Is there a composition of cryptography primitives that satisfies these requirements?

By “equality relationship” I mean that for any two records A and B the obfuscation function, F, should respect F(A) = F(B) if A = B. A cryptographic hash function, such as SHA2, preserves equality but is not reversible. We need some form of encryption.

My first thought was to use a cascade of asymmetric encryption schemes and encrypt records with the trusted party's public key followed by my private key 1. I tried using a modern implementation of RSA which implicitly applied OAEP and thus the cipher text became non-deterministic and we lost equality relationships. My understanding is that RSA without OAEP is inherently vulnerable, but my knowledge isn’t deep enough to understand why.

My reading of some entry-level encryption literature suggests that the goal of OAEP is to make the system “semantically secure” which means that no information about plaintext messages can be derived from encrypted messages, including equality. I understand why this property is desirable in most context, but in our application we specifically want to preserve equality (and only equality).

Are there other designs that result in encryptions that are only reversible by a trusted party but the equality relationships are preserved across encrypted messages?

1 My understanding is that this mitigates the risk of a rainbow attack assuming both private keys are not compromised because chosen plaintext cannot be encrypted through both levels of the cascade without the a private key.

EDIT: It has been requested that I expand on the use case to help better motivate the need for deterministic encryption. We are attempting to create an decentralized (and open source) tokenization scheme for obfuscating patient PII in health data prior to disclosing data with third parties. This is part of a larger suite of privacy tools focused on data sharing.

Here are a couple instances of similar technologies that are commercialized. To my knowledge, all are proprietary and don't allow the organizations that are sharing tokens to manage their own public/private encryption keys.

Erp12
  • 123
  • 4

4 Answers4

2

It sounds like you are looking for 'deterministic public key encryption'.

We typically don't recommend it, precisely because it preserves equality. That is, someone with the the public key and a guess of the plaintext can encrypt that plaintext, and see if it's the same as the ciphertext they've been given, and so it can't meet 'semantic security'.

However, you're willing to live with that. And, it turns out that turning standard (nondeterministic) public key encryption into deterministic public key encryption is pretty easy (from a cryptographical side; the crypto APIs that you have might make it trickier).

What any standard public key encryption method does is take the public key, the message and some random bits (to add nondeterminism), churns on them for a while, and out pops the ciphertext. What you need to do is replace the random bits with something nonrandom (but is different for each message; using the same random bits for different messages is likely to leak something).

One obvious approach to do this is take the message, toss it into SHAKE [1], extract as many 'random bits' as you need, and use those to encrypt. The bits to encrypt a particular message will look random (and be unrelated to the bits used to encrypt any other message). However, if you encrypt the same message again, you'll get the same random bits, and so you'll end up with the same ciphertext.

[1] Or any secure hashing method that generates a sufficiently long output.

poncho
  • 154,064
  • 12
  • 239
  • 382
2

A trusted party must have the ability to reverse the obfuscation.

We will call that party Admin.

We need some form of encryption.

I don't find that the OP makes a compelling argument to support that assertion.

This sounds like a database problem. We need a cryptographically secure hash function, such as SHAKE, SHA3, or the SHA2 that you mentioned. And multiple storage locations, for the Public and for Admin.

Admin chooses a private key priv, perhaps one or two 128-bit GUIDs.

Admin creates a private, secure table (or column, schema, database) in a secure location, inaccessible to the Public. It has hash and plaintext columns. Store in hash a suitably long prefix of SHA3(priv, plaintext), so collisions are unlikely. Or use a keyed HMAC if preferred.

Admin copies hash into the relevant Public table, where anyone can copy it and can make equality tests.

Since this is an RDBMS-oriented solution, Admin is free to put a UNIQUE index on each of the hash and plaintext columns, for efficient lookups when the time comes. Admin is also free to create a FK relationship out to the Public table, create VIEWs containing JOINs, and so on.


key rotation

Admin might choose to rotate the private key from time to time. Use a double buffered pair of hash columns, or a many-to-one relationship, to support having both "old" and "new" hashes for a brief time while re-hashing is in progress. Then delete the old ones.


correlation

For many patients there may be a close correspondence between their SSN and the credit card number they usually use, and their hashes are exposed to the Public. Each co-occurrence of hashes induces an edge in a graph.

Suppose a merchant transaction for Alice's meal produces a Public record mentioning that credit card hash. An honest-but-curious merchant could chase through graph edges to find hashes of other cards Alice uses, and to find e.g. Alice's recent health care providers. You will want to consider whether such threat exposure makes a difference to your use case.

Depending on the number and nature of your participants, you may want to abandon the notion of Public access in favor of a database or cryptographic technique that lets only relevant parties make equality tests.

J_H
  • 276
  • 2
  • 9
2

Are there asymmetric encryption algorithms that preserve equality?

As explained in @poncho's answer, that property is unusual for asymmetric encryption since it causes a severe flaw: it's possible to check a guess of the plaintext by trial encryption followed by comparison with the ciphertext.

That flaw is unavoidable essentially because we want to preserve equality with a public way to produce the ciphertext from the plaintext, not so much because we want that an authority is able to decipher: the later is adequately solved by asymmetric encryption (only at the expense of larger ciphertext than for symmetric encryption or hashing).

E.g. for a birthdate in the 20th or 21th century, if the encryption key and a ciphertext is known, it takes $<45000\approx2^{15.5}$ trial encryptions to find the plaintext (and much less than half on average in most practical applications since birthdate won't be uniformly distributed). Worse, if the encryption key and $n$ ciphertexts are known, we can find the $n$ plaintexts with $<2^{15.5}$ trial encryptions.

Possible mitigation methods for that flaw, with several often combined in practice:

  1. Keep the ciphertext secret. That works, but often is functionally unacceptable.
  2. Keep the encryption key secret. One idea is to keep and use the key in some trusted thing (Smart Card, HSM, server); but that can be costly or/and inconvenient, fails horribly if adversaries observe communication with the thing (they get the plaintext), and is ineffective if they can freely use the thing to perform trial encryption.
  3. Try to keep the method secret. That helps to some degree, but goes against Kerckhoffs's principe which governs academic cryptography. In particular, it's to fear that public executable code can be reverse-engineered as far as necessary to reuse the encryption part, which also breaks attempts to hide the key in the software to achieve 2, including with hypothetical secure white-box cryptography.
  4. Make encryption mildly slow and/or costly to discourage plaintext enumeration, using Memory-Hard Key Derivation Functions (also known a Memory Hard Password-Based Hash Function). The academic state of the art is Argon2, and it's predecessor Scrypt good too. That helps to a quantifiable degree, at a cost in time and energy that's often practically acceptable: e.g. with encryption parametrized to last 4 seconds for a 4-core CPU with 2 MiB free RAM, consuming 50W, for a negligible energy cost of 0.000015€ at a consumer rate. Encrypting $2^{15.5}$ dates may cost an adversary with a slightly better computer a day and 1€ in energy.
  5. Widen the plaintext space as much as functionally acceptable. E.g. encrypt the concatenation of credit card number and expiry date, rather than these two fields separately. That improves the security given by 4. A variant is to ease the "preserve equality" requirement to as small subsets of the input as functionally acceptable, e.g. encrypt the concatenation of blood type and date (even if blood type remains in clear alongside, that's still beneficial against effort to break many records by trial encryption).

Use a cascade of asymmetric encryption schemes and encrypt records with the trusted party's public key followed by my private key.

"Encryption with a private key" is one or several of

  • Pointless from the standpoint of confidentiality if it can be undone with the matching public key and that key is public, as implied by it's name.
  • An error in terminology: one signs with private key, correspondingly indicating that the goal is insuring integrity and proof of origin.
  • Antagonist with "preserve equality", if there are several signer private keys.
  • Impossible, for many standard asymmetric encryption schemes, and a fair fraction of implementations of RSA (on purpose so that users can't shoot them in the foot, or/and because public and private RSA keys do not actually use the same format).

Are asymmetric encryption algorithms that preserve equality safe to use for data obfuscation?

No for low entropy plaintext (e.g. forget about it, or any method that preserve equality, for sex or blood type). They work to some degree for plaintext that it's hard to exactly guess, and the cost of attack can be raised using Memory-Hard Key Derivation Functions, at the expense of raising the cost of legitimate use.

A generic method is to start from a standard secure asymmetric encryption algorithm, and replace it's internal random by the output of Argon2 with the plaintext as input.

RSAES-OAEP would work, but the cryptogram is rather large (e.g. 256 bytes). Elliptic Curve crypto can reduce that to e.g. 40 bytes.

I don't know that asymmetric cryptography or MHKDF are used in the context of obfuscating patient PII in health data, even though that could be an incremental improvement over what I know of current practice.


I hereby put in the public domain whatever novelty there is in the present post, and grant every legal entity a free, worldwide, perpetual, nonexclusive license to use such novelty.

fgrieu
  • 149,326
  • 13
  • 324
  • 622
1

This sounds like you’re trying to solve two different requirements, so wrt separation of concerns, one approach would be to divide the implementation into an obfuscation of the data through encryption on one hand, and keeping the equality through a hash value on the other.

The two values could be stored in one column or in separate columns, depending on how they need to be processed further.

not2savvy
  • 232
  • 4
  • 11