RSA: The algorithm aside, how are we turning a string into an int and vice versa?

Question

Let's say that I want to encrypt the file plain.txt. The very first step is actually to turn the contents of that file (lets say it only contains the string "Hello") into an int. I see python codes like this:

from Crypto.Util.number import bytes_to_long
with open('plain.txt', 'rb') as f:
    flag = f.read()
m = bytes_to_long(flag)

However, I don't quite understand what is going on. Furthermore, when the ciphertext has been decoded back into the plaintext but still in number form, I don't see long_to_bytes or anything to convert the number into string. I see

import binascii
binascii.unhexlify('{:x}'.format(m))

Which looks completely different from the other code, but it still works. Can someone explain these processes to me so that I understand the inputs and outputs of an encoding algorithm and not just the algorithm itself.

Maarten Bodewes · Answer 1 · 2022-01-04T18:07:25.187

Basically there are two steps to this:

encode the text to bytes - this generally requires a character encoding such as UTF-8 or Latin encoding;
encode the bytes to integer - this is part of the encryption operation in RSA as specified in PKCS#1, and is performed using a function called OS2IP.

In your case the text is obviously already encoded as bytes; files consist of bytes after all, and you are opening the file as a binary file (the b in the rb flag).

OS2IP means octet string to integer primitive. An octet string is nothing more than a byte array. If the bytes are already in the right form then it is just a question of interpreting the bytes as a number, as the computer always handles everything as binary anyway.

In PKCS#1 based RSA OS2IP is not used directly though: first a security relevant padding is applied. This would be either the PKCS#1 v1.5 defined padding or OAEP padding. Adding the padding means a not-insignificant amount of overhead is added before the message is applied; the amount of plaintext is much smaller than the RSA modulus.

This is one reason why files are generally not encrypted using RSA directly. The main other reason is that RSA encryption and especially decryption operations are very inefficient compared to e.g. AES-based encryption. Instead we use a protocol such as PGP which performs hybrid encryption. RSA in a secure mode of operation has a certain overhead and a maximum per operation, so generally a symmetric key is encrypted or derived using RSA instead; this symmetric key is then used to encrypt the data. Symmetric ciphers such as AES directly operate on binary data, so besides handling of the IV and padding, the data can be encrypted directly without conversion.

Sam Ginrich · Answer 2 · 2022-04-07T13:43:19.397

EDIT

There are modifications of the RSA algorithm, which are considered suitable for long messages, after analyzing known games and attacks.

How to Strengthen the Security of RSA-OAEP, Boldyreva et al.

Original Answer

Your question addresses the block cipher concept. In the kernel, you have an algorithm, which operates on byte blocks of defined size; according to a defined operation mode, data blocks are preprocessed or post-processed. Easiest mode is the atomic ECB, "Electronic Codebook" mode, the one without ´knowledge´, i.e. you split your input stream 'plain.txt' in packets of defined size, encrypt them sequentially and concatenate the output in an output stream, e.g. 'encrypted.bin'. The transition from text to octets (=bytes) is in your case already anticipated, as you read a binary file. In other operation modes as CBC, CTR input or output from former blocks is an additional input for the block encryption.

These modes cover the structural aspect of fragmentation and memory on base of a byte input stream, which you start with as 'plain.txt'.

Now to the processing of blocks: Usually a padding algorithm is applied to an input block, which adds salt and or mixes input bytes by hash functions, "OAEP" is a common candidate. Text book RSA omits padding.

Next step is transition from the padding output block to the numeric algorithm, the one you "put aside". This is defined PKCS#1 as mentioned, also the mapping back from number space to byte blocks.

Concerning dimensions: You start with a RSA key length and associated modulus, which limits size of equivalent data blocks. Padding usually decreases the maximum size of user data, again. The evaluation of this usable block size is an essential input for the fragmentation.

Finally, decryption needs to know the length of the original stream. This must be encoded, too. E.g. you prefix the input stream with it's length.

Note: It's politically incorrect to apply block cipher modes to RSA, and some people will rather suggest you to limit your text file size. Listen to them, if you think, they solve your problem. :) Anyway, a stream will attract the block cipher concept.

RSA: The algorithm aside, how are we turning a string into an int and vice versa?

2 Answers2

EDIT

Original Answer