5

I need to encrypt a live audio-stream that is being sent over the internet. The details are:

  • The stream contains spoken language (VOIP)
  • Only the raw stream is transmitted (no meta-data or container)
  • Latency needs to be low. Therefore, the stream is send via UDP
  • The Stream needs to be decryptable, even if a package goes missing (because of UDP).
  • I can't use TLS, because the SOC on the embedded hardware only supports TLS 1.0, which seems insecure.

For ease of implementation and because packages can get lost, I would like to use AES-256 in ECB-mode. I know that ECB can leak information and is prone to pattern detection. But is that still a concern with audio-data?

As an alternative, I could also use CTR-Mode and transmit the counter along with each package (unencrypted). Would that actually gain me anything?

Lukas Knuth
  • 151
  • 4

1 Answers1

4

I would be very scared about using ECB with audio (or anything really). Note for example that attacks have been demonstrated on variable bitrate compressed encrypted audio:

Despite the rapid adoption of Voice over IP (VoIP), its security implications are not yet fully understood. Since VoIP calls may traverse untrusted networks, packets should be encrypted to ensure confidentiality. However, we show that when the audio is encoded using variable bit rate codecs, the lengths of encrypted VoIP packets can be used to identify the phrases spoken within a call. Our results indicate that a passive observer can identify phrases from a standard speech corpus within encrypted calls with an average accuracy of 50%, and with accuracy greater than 90% for some phrases. Clearly, such an attack calls into question the efficacy of current VoIP encryption standards. In addition, we examine the impact of various features of the underlying audio on our performance and discuss methods for mitigation.

So right off the bat I have to worry that techniques similar to that paper could perhaps be adapted to ECB. Note for example (section 2):

Generally speaking, the codec takes as input the audio stream from the user, which is typically sampled at either 8000 or 16000 samples per second (Hz). At some fixed interval, the codec takes the $n$ most recent samples from the input, and compresses them into a packet for efficient transmission across the network.

So even if we used constant length packets, if each one is compressed separately in ECB mode, I would worry that it might interact poorly with the packetization described in this quote—would somebody with a good speech model and knowledge of the codec be able to use ECB ciphertext block frequencies to decode the speech? Speech has a lot of phonemic, phonotactic and metrical redundancy and apparent limits on its information rate, after all, enough that compressed packet lengths are enough to recover 50-90% of the phonemic segmental stream.

One simple question that could serve as a starting point is this: could a cryptanalyst reliably identify which ECB blocks encode silence, or is the audio's noise floor high enough to defeat this?

In any case, I wouldn't go my own way on this topic—there seems to be a body of research on VoIP encryption, so I'd track that down and read it.

Luis Casillas
  • 14,703
  • 2
  • 33
  • 53