Another island my personal website & blog for technical stuff

Using abstract algebra to refine privacy

When we talk about data privacy, we are considering any form of information encoding that converts your data into a representation that is secure and cannot be easily accessed or even interpreted by unapproved agents. There are several components and strategies to it, but clearly, the first thing that comes to mind is cryptography.

Anyone who knows a little bit about the topic already can recall that abstract algebra is fundamental to the development of cryptographic schemes. However, we are used to think about the security it creates only when we have the data stored somewhere or when we are sending it through a network. Every time we want to actually do something with it, like computing statistics, then it seems that we have to access the “real stuff”, don’t we? How could it be any different? That’s what we are going to talk about in this post.

Borrowing ideas from abstract algebra

Let us start by borrowing ideas from our abstract algebra classes and discuss the main concept here. Suppose one has two distinct binary algebraic structures, say $(A, \circ)$ and $(B, \diamond)$, and also a map $\varphi$ between the sets $A$ and $B$ that satisfy the following,

\[\varphi(a_1 \circ a_2) = \varphi(a_1) \diamond \varphi(a_2) \, ,\]

for all $a_1, a_2 \in A$. Under such condition, the function $\varphi$ is called an homomorphism between the corresponding ‘algebras’.

So, it establishes a mapping between the elements in a way that the operations within each set will have a consistent correspondence, with respect to the structure defined by the “first” one. To put it in another way, an homomorphism transforms the underlying sets while preserving the algebraic structure.

That’s the core aspect behind what it’s known as homomorphic encryption: encryption schemes that display homomorphic properties over their associated plaintext and ciphertext spaces, which allow us to work directly on encrypted data, doing certain types of computation without sending it back to its original representation.

A primer on homomorphic encryption

Hopefully, the concept (and motivation) should be clear from the discussion so far. If so, I think now it’s a good place to introduce a more precise definition of this form of encryption.

Definition. Let the 5-tuple $(\mathcal{P}, \mathcal{C}, \mathcal{K}, E, D)$ be an encryption scheme. Suppose that the plaintext and ciphertext spaces, when equipped with particular binary operations, will form the groups $(\mathcal{P}, \diamond)$ and $(\mathcal{C}, \circ)$. If the property

\[E_{k}(p) \circ E_{k}(\tilde{p}) = E_{k}(p \diamond \tilde{p})\]

is valid for any $p, \tilde{p}\in\mathcal{P}$ and every $k\in\mathcal{K}$, then we say that the encryption scheme is homomorphic.

What is an encryption scheme?

Messages lie in the plaintext space $\mathcal{P}$ and, through encryption, are lead to the ciphertext space $\mathcal{C}$. Likewise, it's possible to recover a message from a given ciphertext using the decryption rules. In order to perform such transformations, the scheme provides an encryption algorithm $E$ and a decryption algorithm $D$, both of which are parametrized by the key space $\mathcal{K}$. Therefore, for each $k\in\mathcal{K}$, those algorithms yield specific mappings $E_{k} : \mathcal{P} \to \mathcal{C}$ and $D_{f(k)} : \mathcal{C} \to \mathcal{P}$, where $f(k)$ simply denotes some key depending on $k$. Naturally, one expects that $D_{f(k)} \big( E_{k} (m) \big) = m$ for any message $m \in \mathcal{P}$.

Even if you have a vague understanding of what was just described, I truly believe it is going to be enough to follow. Also, these ideas will become more tangible as we go through an example.

Example: Paillier Encryption Scheme

Here, we are going to explore one particular example of homomorphic encryption. This scheme was created by Pascal Paillier in 1999 and presents a probabilistic public-key encryption. We’ll start by describing the steps for key generation, encryption and decryption. Later, these algorithms will be implemented in code, so we can try them out.

Key generation. First, randomly generate two prime numbers, $p$ and $q$, independently of each other. For practical applications, both should be large. Check if these primes satisfy the following

\[\mathrm{gcd}\big(pq,(p-1)(q-1)\big) = 1 \, ,\]

where $\mathrm{gcd}$ stands for greatest common divisor. If they don’t, generate a new pair of primes and try again. Repeat until the condition is met. After a successful attempt, compute $n = pq$ and $\lambda = \mathrm{lcm}(p-1, q-1)$, with $\mathrm{lcm}$ being the least common multiple.

For the next step, generate a random integer $g\in\mathbb{Z}_{n^2}^{\times}$ and verify if its order is divisible by $n$. This can be done by determining the existence of

\[\mu = \Big[ \mathrm{L}\big( g^{\lambda} \hspace{2.7pt} \mathrm{mod}\,\, n^{2}\big) \Big]^{-1} \hspace{2.7pt} \mathrm{mod}\,\, n \, ,\]

where $\displaystyle \mathrm{L}(u) \triangleq \left\lfloor \frac{u - 1}{n} \right\rfloor$. If it doesn’t exists, choose another $g$ and try again.

Once proper values are decided, given the constraints, the algorithm can then return the public key $(n, g)$ and the private key $(\lambda, \mu)$. As the name implies, the private key must remain secret.

Encryption. Suppose we have a message $m\in\mathbb{Z}_{n}$ that needs to be encrypted. In order to do so, we select a random number $r\in\mathbb{Z}_{n}^{\times}$ and then compute the ciphertext as

\[c = g^{m} r^{n} \hspace{3pt} \mathrm{mod}\,\, n^2 \, .\]

As expected, knowledge of the public key $(n, g)$ is necessary.

Decryption. Assume now that we have a ciphertext $c\in\mathbb{Z}_{n^{2}}^{\times}$ and want to decrypt it. The original message can be recovered from $c$ by computing the following

\[m = \mu \cdot \mathrm{L}\big( c^{\lambda} \hspace{2.7pt} \mathrm{mod}\,\, n^{2}\big) \hspace{5pt} \mathrm{mod}\,\, n \, .\]

Naturally, one needs access to the private key $(\lambda, \mu)$ in order to perform decryption.

These three algorithms define the Paillier encryption scheme in its overall form. However, it can be computationally expensive in certain parts. If performance is of essence in your context, there is a particular case of the scheme that can be applied instead, which is simpler due to clever considerations.

As we’ve said earlier, this is an example of a homomorphic encryption scheme. “Show, don’t tell”, right? Given two messages $m_1, m_2 \in \mathbb{Z}_{n}$, let $c_1 = E_{(n,g)}(m_1)$ and $c_2 = E_{(n,g)}(m_2)$ represent the ciphertexts obtained by applying the encryption algorithm to them. Then, we can verify that

\[\begin{align*} c_{1} \hspace{0.93pt} \odot_{n^{2}} \hspace{0.85pt} c_{2} \,&=\, E_{(n,g)}(m_1) \hspace{0.93pt} \odot_{n^{2}} \hspace{0.85pt} E_{(n,g)}(m_2)\\[9pt] &=\, (g^{m_1} \hspace{0.6pt} {r_{1}}^{n} \hspace{3pt} \mathrm{mod}\,\, n^2) \hspace{0.93pt} \odot_{n^{2}} \hspace{0.85pt} (g^{m_2} \hspace{0.6pt} {r_{2}}^{n} \hspace{3pt} \mathrm{mod}\,\, n^2)\\[9pt] &\equiv_{n^{2}} \hspace{1.8pt} g^{m_1} \hspace{0.6pt} {r_{1}}^{n} g^{m_2} \hspace{0.6pt} {r_{2}}^{n} \hspace{3pt} \mathrm{mod}\,\, n^2\\[9pt] &=\, g^{(m_{1} + m_{2})} \hspace{0.6pt} ({r_{1}} \hspace{1pt}{r_{2}})^{n} \hspace{3pt} \mathrm{mod}\,\, n^2 \hspace{3.5pt}=\hspace{4.5pt} E_{(n,g)}(m_{1} + m_{2}) \, , \end{align*}\]

where the notation $\odot_{k}$ denotes the multiplication modulo $k$. So, we have shown that the Paillier scheme satisfies the following homomorphic property,

\[E_{(n,g)}(m_1) \hspace{0.93pt} \odot_{n^{2}} \hspace{0.85pt} E_{(n,g)}(m_2) = E_{(n,g)}(m_{1} + m_{2}) \, .\]

Therefore, within the context of Paillier encryption, if one performs a multiplication ($\mathrm{mod}\,\, n^{2}$) between any pair of ciphertexts that were computed using the same public key, the result of this operation will be equivalent to the encrypted value of the addition of the original messages. This is a really interesting feature, because, assuming a scenario where you shouldn’t see the original data and would only want to perform summations, then you could work directly with the encrypted data and deliver your results to someone who has the private key and actually needs to see the ‘plaintext’ representation.

Now, in order to experiment with these ideas, we’re going to use a Python class that I’ve wrote to embody the Paillier encryption scheme.

import random
import gmpy2

class Paillier:

    def __init__(self, p, q):

        if not gmpy2.is_prime(p):
            raise ValueError('p is not prime')

        if not gmpy2.is_prime(q):
            raise ValueError('q is not prime')

        self.keygen(p, q)


    def L(self, u):
        return gmpy2.floor_div(u - 1, self.modulus)


    def keygen(self, p, q):

        modulus = gmpy2.mul(p, q)
        euler_totient = gmpy2.mul(p - 1, q - 1)

        if gmpy2.gcd(modulus, euler_totient) != 1:
            raise ValueError('pq and (p - 1)(q - 1) are not coprime')

        self.modulus = modulus
        self.modulus_squared = gmpy2.mul(modulus, modulus)
        self.carmichael = gmpy2.lcm(p - 1, q - 1)

        self.mu = None

        while not self.mu:
            g_candidate = random.randint(0, self.modulus_squared - 1)

            while gmpy2.gcd(g_candidate, self.modulus_squared) != 1:
                g_candidate = random.randint(0, self.modulus_squared - 1)

            try:
                self.mu = pow(
                    base=self.L(
                        int(gmpy2.powmod(g_candidate, self.carmichael, self.modulus_squared))
                    ), 
                    exp=-1, 
                    mod=self.modulus
                )
            except:
                pass

        self.g_base = g_candidate

        self.public_key = (
            int(self.modulus),
            int(self.g_base)
        )

        self.private_key = (
            int(self.carmichael),
            int(self.mu)
        )
    
    
    def encrypt(self, message):

        if message >= self.modulus:
            raise ValueError('Message must be lesser than the modulus')

        r_base = random.randint(0, self.modulus - 1)

        while gmpy2.gcd(r_base, self.modulus) != 1:
            r_base = random.randint(0, self.modulus - 1)
        
        randomized_mod_mult = gmpy2.mul(
            gmpy2.powmod(self.g_base, message, self.modulus_squared),
            gmpy2.powmod(r_base, self.modulus, self.modulus_squared)
        )

        ciphertext = gmpy2.mod(randomized_mod_mult, self.modulus_squared)

        return int(ciphertext)
    
    
    def decrypt(self, ciphertext):

        cipher_exp = int(gmpy2.powmod(ciphertext, self.carmichael, self.modulus_squared))
        
        message = gmpy2.mod(
            gmpy2.mul(self.mu, self.L(cipher_exp)), 
            self.modulus
        )

        return int(message)

It’s clear that this implementation heavily relies on gmpy2, a module written in C that is used to perform arbitrary-precision arithmetic. The selling point here is not only the fact that gmpy2 can handle big integers, which is essential to cryptography, but also that it is very optimized and, therefore, can do certain computations really fast.

For the first draft, I actually wrote everything in vanilla Python and, although it worked fine for small primes, it really struggled with large ones. Honestly, I believe the main bottleneck was computing modular exponentiations, which continued to be a problem even after writing an improved algorithm for it.

Finally, just to set the record straight, the Paillier class is like a “toy”. This is only a tool to help us explore the concepts discussed so far and, because of that, was built in a way that favors convenience. With that said, let us try it!

Consider the primes $p = 7$ and $q = 13$. We can use these specific values to instantiate the class and obtain a concrete representation of the Paillier encryption scheme.

paillier = Paillier(p=7, q=13)

In order to see the keys that were generated in a particular run, just retrieve the public_key and private_key attributes.

paillier.public_key,\
paillier.private_key

In my case, the keys were given by $(n, g) = (91, 3160)$ and $(\lambda, \mu) = (12, 44)$. Recall that $g$ is a random integer and $\mu$ depends on it, so, if we create a new object using the same primes again, these portions of the keys will probably going to be different.

As expected, attributes related to the keys are naturally referenced by the encryption and decryption methods, which we are going to test now. However, before starting, remember that the message $m$ to be encrypted must be an element of $\mathbb{Z}_{n}$, implying $m < n$. In our current situation ($n=91$), this means that our set of possible messages is very restricted, but in a realistic scenario, where $n$ should be large, this is not an actual concern.

message = 27
ciphertext = paillier.encrypt(message)   # 2207 
decrypted = paillier.decrypt(ciphertext) # 27

message = 42
ciphertext = paillier.encrypt(message)   # 3056
decrypted = paillier.decrypt(ciphertext) # 42

message = 89
ciphertext = paillier.encrypt(message)   # 5021
decrypted = paillier.decrypt(ciphertext) # 89

The scheme is probabilistic, so there is a random component to the encryption algorithm. Therefore, if you encrypt the same message twice, the resulting ciphertexts are not (necessarily) going to be the same.

Last, and definitely not least, it remains to see the homomorphic property in action. We are going to introduce a function to represent the operation of modular multiplication,

def modular_multiplication(x, y, k):
    return gmpy2.mod(gmpy2.mul(x, y), k)

just to make our next snippet a bit more readable. In order to test the feature, we start by encrypting two messages, $m_1$ and $m_2$, and then evaluate the multiplication modulo $n^2$ between the encrypted values.

m1 = 17
m2 = 34

c1 = paillier.encrypt(m1) # 4012
c2 = paillier.encrypt(m2) # 6871

c1_c2_modmul = modular_multiplication(c1, c2, paillier.modulus_squared)

By printing the value of c1_c2_modmul, we see that $c_{1} \hspace{0.93pt} \odot_{n^{2}} \hspace{0.85pt} c_{2} = 7284$. Now, we apply the decryption algorithm to this result.

paillier.decrypt(c1_c2_modmul) # 51

So, upon decrypting the ciphertext that came from a direct computation over encrypted messages, we’ve obtained the value $51$, which is precisely the result given by $m_1 + m_2 \, (= 17 + 34)$. Of course, this is no surprise, because it was shown beforehand that this behavior should be expected from the Paillier encryption scheme. Nevertheless, I’m always amazed to see particular examples working just as the analytical results told us they would.

Having understood how to use the Paillier class and, more importantly, how to work directly with the ciphertexts it produces, one can probably imagine some practical applications. Before ending this section, I would like to provide one case for this sort of exercise: calculating gross revenue from individual sales.

Imagine you work in a company as an analyst that helps other departments answer their data-related questions. One day, you receive an e-mail from Accounting requesting the company’s total revenue in 2022 and also a monthly breakdown of it. You go to the database, see a table named Sales and then proceed to query its data. Just to get a feeling of what’s inside, you pull 5 random rows and this is what you see.

SaleID CustomerID TotalPrice SaleDate
f1fe189a-fec… 8003de0a-1d2… 215272844153266… 2022-12-08
fcd7d7a8-6fd… 98bbab69-a65… 922647070091506… 2021-02-01
1686bbf5-88c… ca32bdaf-817… 836295534454147… 2022-12-16
97b4ee9b-743… 323a1aef-ef6… 907573945632322… 2021-07-10
5ea62be9-70c… 2ca57535-8f5… 416338386019049… 2023-02-20

The date field seems fine and the identifiers are just UUIDs, so nothing wrong with that. However, the TotalPrice column is filled with some big integers. Upon inspection, you see that these numbers have around 250 digits. You then go check the documentation and learn that this particular column has been encrypted using the Paillier scheme. The documentation also provides the public key associated with it and explains everything you need to know in order to use it.

\[\begin{align*} n = \,\,&57460714856160950657162022989787\\&04462510120345670661132715166846\\&30316900255862192597161745322508\\&95759630899005677789669118023473\\[10pt] g = \,\,&14766529205354252455009509466902\\&34939610821321951935578695344181\\&58243450135387724072965817151425\\&73844860666891072392740274638400\\&11146238659086843893592520220438\\&18667284600660957360105496696295\\&50988513316268594042489771280122\\&37291207781088037445358609838821 \end{align*}\]

Once you got comfortable with the concepts, it was time to move on with the calculations. If the data wasn’t encrypted, you could just run some SQL queries with SUM and GROUP BY, but now, due to the particularities of the context, you decided it would be easier to work within a Python environment. So, from a local Jupyter notebook, you connect to the database and run a query to only retrieve sales from 2022, loading the result as a pandas.DataFrame in a variable named sales_2022_df.

Basically, now you just need to run the modular_multiplication over a series of ciphertexts. You can easily define a function that receives a list of encrypted values and compute this. For instance,

import functools

def encrypted_sum(ciphertexts, k):
    return functools.reduce(
        lambda x, y: int(modular_multiplication(x, y, k)),
        ciphertexts
    )

Given $n$ from the public key, you readily obtain $n^2$, storing its value in the n_squared variable. Then, in order to find out a ciphertext corresponding to the total revenue in 2022, you execute

encrypted_sum(
    sales_2022_df['TotalPrice'],
    n_squared
)
# 988277300609677363267254662252535353754219383265442...

For determining the (encrypted) monthly revenue in the same year, it’s just as easy. Something along the following lines,

sales_2022_df['SaleYearMonth'] = sales_2022_df['SaleDate'].dt.to_period('M')

sales_2022_df\
    .groupby('SaleYearMonth')\
    .agg(
        MonthlyRevenue=(
            'TotalPrice',
            lambda x: encrypted_sum(x, n_squared)
        )
    )\
    .reset_index()
SaleYearMonth MonthlyRevenue
2022-01 1507096180402141…
2022-02 3182359256095547…
2022-03 1229216614797477…
2022-04 8092279127535406…
2022-05 1007931911466433…
2022-06 2821392593552826…
2022-07 9278337899269485…
2022-08 3017587273617995…
2022-09 2404635809080789…
2022-10 3282675300922324…
2022-11 1946026392773099…
2022-12 2926994193328774…

Finally, you can report these results back to Accounting, where ideally there should be someone who has clearance to access the private key and, therefore, decrypt these answers. People there will eventually see these aggregations in their ‘plaintext’ form; you, on the other hand, never saw it and didn’t need to. It was possible to work with the encrypted data directly, performing computations on them that were equivalent to some other operation on its original representation.

Partially & Fully

As we’ve seen so far, some encryption schemes can display homomorphic properties, which can be used to compute over plaintexts by only manipulating ciphertexts. For a particular example, we explored the Paillier scheme in some detail. I tried to avoid describing others, even citing them throughout the text, in an attempt to prevent “info dumping”. If you want to see other examples, you can look up for unpadded RSA, Goldwasser–Micali or ElGamal.

However, you probably have been wondering about the limitations. In the Paillier scheme, the homomorphism only allow us to use the encrypted data in order to indirectly perform addition over the original messages. So, it’s very clear that your computations are going to be constrained by this single type of operation. Similar restrictions are valid for other cases of homomorphic encryption too. For instance, in the ElGamal scheme, the only property associated lets you combine two ciphertexts in a way that results in an encryption of the product between the underlying plaintexts. Therefore, any application of these ideas are going to be limited to the operations that they support within their homomorphic properties.

Naturally, we could follow-up with:

Is there a scheme which allows arbitrary computations over encrypted data?

Before 2009, the answer would be no. During that year, in Craig Gentry’s PhD thesis, the first homomorphic encryption of such kind would be described. I’m not going in depth here, but the general idea behind Gentry’s breakthrough is that he was able to construct an encryption scheme containing two distinct homomorphic properties that preserve algebraic structures representing the logic gates $\mathrm{AND}$ and $\mathrm{XOR}$. So, ideally, it should support the computation of any arbitrary function. In following years, other approaches appeared as well.

Due to the fact that these schemes can provide either a partial or a complete “range of computability” over the encrypted information, they are typically categorized into two major classes: partially homomorphic encryption (PHE) and fully homomorphic encryption (FHE). Therefore, cryptosystems like Paillier and ElGamal fall under the classification of PHE schemes, while examples of FHE include schemes such as Brakerski-Gentry-Vaikuntanathan (BGV) and Cheon-Kim-Kim-Song (CKKS).

Concluding remarks

Honestly, I believe homomorphic encryption is a powerful concept that adds an extra layer of functionality to our usual understanding of cryptography. Not only you can think about data being encrypted at rest and in transit, but now also remaining encrypted while in use. These homomorphic properties may have significant implications to any data-related context where privacy is a crucial element. Any time someone handles information in a database, when statistics are calculated or when a linear model is built, if they are directly accessing sensitive data in order to do it, then there is a potential to enhance privacy and security by applying these ideas. Also, given the realm of standard data processing tasks, it would probably be necessary to employ a fully homomorphic encryption scheme.

Nevertheless, it’s important to acknowledge that, while there is a lot of potential for applications, any implementation aiming practical usage in a production stage comes along with real challenges. Ongoing research try to address these barriers and, as our knowledge progresses, it is likely that more efficient solutions will appear. Hopefully, we will hear about successful use cases of FHE schemes in the next decades, as one of the many tools trying to establish a higher level of data privacy.

If you want to know more about the topic, here are some references in no particular order:

Of course, you can also visit any of the several URLs that can be found linked throughout the post.