Can you use memory errors as a source of randomness for cryptography?

Question

Obviously, if you need a random number for cryptography, your code should use an api that gets it from hardware. However, not all hardware has a SRNG available. If you are working on a security critical application, and hardware RNG is not available then the appropriate course of action is to abort execution.

BUT, not every application would be deemed security critical. That is to say, while you want security, you are not worried about overly complex attacks because, 1. your application is not worth the effort of being cracked by sophisticated hackers 2. none of your data is considered strictly confidential (you don't want it breached, but it would not be the end of the world).

Assuming a none security critical application, running on a machine without hardware RNG or ECC Memory, would it be viable to allocate a very large amount of memory (perhaps in a long loop) and use the errors that eventually occur as a source of randomness? Would this be more secure than PRNG?

fgrieu · Accepted Answer · 2021-03-05T19:59:29.957

(..) would it be viable to allocate a very large amount of memory (perhaps in a long loop) and use the errors that eventually occur as a source of randomness?

No. Practical use of memory errors as a source of randomness would require manipulation of DRAM timing parameters, as envisioned in this answer and it's reference.

The rate of memory error in a correctly working computer at ground level is very low; anything larger than one error detected per week (as reported by machine with ECC logging) is reason for maintenance [#]. Hence the proposed method requires a terminally impractical delay to get any entropy, or an unreliable computer; and for one well built, that mostly happens at altitude (especially in the South Atlantic Anomaly). Plus, if the machine has ECC, errors are hard to detect portably (reporting of ECC errors tends to be a proprietary mess); and if it is not, and there are sizably many errors, it is to fear one will eventually prevent the program from running as intended.

More practical sources of entropy include

Audio or video input; more generally output of an ADC.
Time of events measured to high accuracy relative to CPU clock (e.g. by way of a performance counter as obtained by RDTSC instruction or API to that). Sources include:
- key presses (value gives extra entropy)
- mouse/pointing device movement (position gives extra entropy)
- arrival of network packets
- availability of data from a spinning hard disk
- change of second in a real-time clock IC with independent crystal (low entropy rate, but quite reliably entropic)

Extracting the entropy is relatively easy: essentially, hash anything entropic you can get. The one extremely difficult thing is asserting a reliable lower bound of how much entropy was really gathered. Be extremely conservative, and don't assume that something that holds, will. Things change unexpectedly, especially when there's an active adversary, as we assume in crypto.

[#] Reliability studies on DRAM give widely varying estimates. However we are talking like 100 FIT par DRAM IC within a factor of 100, where Failure In Time is the expected number of failures per 10⁹ device⋅hour. Taking the upper extreme 10⁴ FIT, a computer with 4 DIMMs each with 18 DRAM ICs (enough for 32 GiB with ECC using DDR4 mainstream by late 2017), we get 1.2 expected failures per week.

My limited field experience with servers is that when there are ECC alarms more often than once per week, there's a cause, most often a mismatch in memory specs or settings (for new machines only) or a particular bit in one particular IC of a particular DIMM that's marginal (or worn out, for machines that have been humming quietly for some time). The error log's main virtue is to help identifying such condition, and what DIMM to replace rather than scrapping the whole lot.

There are techniques to purposely increase the rate of occurrence of DRAM errors; including lowering the refresh rate much below manufacturer specs, and/or using a RowHammer read/write pattern. However that's extremely dependent on DRAM model. Doing this reliably, portably and within an OS is very difficult; there's a reason MemTest wants to own as much of the machine as feasible.

Squeamish Ossifrage · Answer 2 · 2017-11-07T23:45:06.200

It is hard to imagine that you could afford enormous quantities of RAM for your system, but can't afford a hardware RNG, such as a cheap model costing 0.25 USD, or down to 0.01 USD at the cost of some ergonomics, available in internationalized varieties in other countries too. You might even have one in your pocket.

But let's suppose you used ECC RAM which reported to your operating system the count of error events, which in turn you could query easily from your operating system. Consider a Poisson-distributed count of error events at a rate of one per week, as @fgrieu suggested as an upper bound on the rate before you throw it all out and buy new RAM. What is the entropy you attain from the count of errors over the course of a week? The probability of a count $k$ at rate $\lambda$ is $$\operatorname{Poisson}(k; \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}.$$ There's no really nice closed form expression for the entropy of a Poisson-distributed variable, but we can use the series $$\lambda \cdot (1 - \log \lambda) + e^{-\lambda} \sum_{k = 0}^\infty \frac{\lambda^k \log k!}{k!}.$$ If we fix $\lambda = 1$, meaning we measure for a week, this simplifies to $$1 - \log 1 + e^{-1} \sum_{k = 0}^\infty \frac{\log k!}{k!} \approx 1.3\;\mathrm{nat} \approx 1.9\;\mathrm{bit}$$ if I did my calculation correctly.

That is, you get just under two bits of entropy by waiting around twiddling your thumbs for a week—thumbs you could have used to flip coins instead during a much shorter time for much greater entropy. And that's Shannon entropy, not min-entropy like cryptographers use! The min-entropy is exactly $1\;\mathrm{nat} \approx 1.4\;\mathrm{bit}$.

Exercise for the reader: Suppose you could measure the inter-arrival times at some clock resolution (say, days, hours, minutes, seconds), rather than the count at the end—maybe you get hardware interrupt on an event, or you just poll every minute/second/etc. What is the (Shannon, min-) entropy in that case?

Paul Uszak · Answer 3 · 2017-11-09T22:41:58.080

I'm assuming that you actually mean memory errors and not initial power up state.

It wouldn't work an iota. There are no memory errors. I say again, there are no memory errors. Modern RAM is pretty good. Just like microprocessors, it either works or doesn't. I've run MemTest86 loads of times. There are three scenarios I've found:

Memory passes with zero failures after running all night.
Memory fails with loads of errors. Remove memory cards, polish /sand the connectors with sandpaper + screwdriver or file and retest. Contact oxidation is a PITA. Go to 1 virtually every time.
Memory fails with a smidgen of errors. Chuck chip away and/or get a refund.

Scenario 3 happens very rarely. That's why most computers (statistically speaking) run non ECC chips. Also Intel doesn't allow consumer ECC and most desktops run okay. Your phone, laptop or tablet won't have it either. And they work most of the time don't they? I have non ECC servers with 365+ days up time. I don't think that a long loop would find any. So it makes AMD's support for ECC rather questionable as a competitive strategy.

The following graph is an extract from a paper titled Memory Errors in Modern Systems. It's only one paper from many, so make of it what you will:-

In summary, the graph shows a monthly error rate for largish computers. You can see that per DRAM chip, the error rate is small. In one day, they got a maximum of 30 FITs /chip /month but that rate decreases with time. A single FIT is a single failure in 1 billion hours of a single device's operation. It is not 30 failures per day. So it only really matters in the exoscale machines they analysed. As your phone proves.

The upshot is that you cannot rely on errors as an entropy source as there aren't any statistically significant amounts. And if you do find some, that's an error and can't be relied upon, so you can't release code to exploit them for entropy distillation. You would have no indication of which machine had errors or how many there were. It seems from Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field that it's very difficult to identify a machine that might be suitable for DRAM entropy extraction. They studied the entire Facebook server estate for over a year. Their findings indicate that:-

Errors are very concentrated on specific individual computers. 1% of servers were responsible for 97.8% of the errors. Fix those servers, and you have virtually no entropy. So on the flip side there's a 99% probability that the server you're pointing to at random produces virtually no errors.
It depends on what the machine is doing specifically. Workload type can influence server failure rate by up to 6.5 times.
The majority of errors (unspecified %age) are actually caused by the memory controller and not the DRAM cells.
A final good result from the hypothesis' perspective is that there is a clear correlation between error rate and cell and CPU density. We can all look forward to much less reliable memory (and more entropy) in the future.

Finally as an exercise, how would you use memory errors? It's not possible. MemTest can find them as it's a tiny application running exclusively on a machine. Coming across a randomly located error doesn't crash MemTest. I guess that errors could happen within its program space, but you'd never know as it would just crash and not report it. On a multi-tasking operating system with virtual machines and massive IO, the whole thing might just crash if significant errors occurred. I don't think that you could have the situation where memory errors only occurred in convenient and accessible areas of the memory map.

There are better ways to get entropy from a computer.

marcelm · Answer 4 · 2017-11-09T12:38:10.777

Not from memory errors, no.

The other answers cover this in detail.

But there's at least a paper that discusses the idea of using SRAM memory contents of RFID chips just after power-on to obtain both a fingerprint of the device and some random bits.

The summary is that the SRAM cells are symmetric structures of 6 transistors, which are meta-stable at one of two configurations, 0 or 1. At power-on, the two sides of the SRAM cell compete for dominance. Which one wins depends on small intrinsic differences between the transistors, and on noise.

In some cells, these manufacturing differences dominate, and the cell reliably powers up as 0, or 1. These can be used to derive a fingerprint.

In other cells, these differences are so small that noise dominates the outcome, and these cells may power up as either value. These can be used to obtain some entropy.

This an image from the paper. On the left, 0/1 skew drawn as a diagram. On the right, a visual representation of a memory area. Black cells are bits that reliably initialize to 0; white cells reliably initialize to 1; and the gray cells may initialize to either state:

The paper is Initial SRAM State as a Fingerprint and Source of True Random Numbers for RFID Tags, by Daniel E. Holcomb, Wayne P. Burleson, and Kevin Fu.

It focuses on RFID tags (which often lack the hardware for proper RNG), but I imagine it should be viable on many other platforms. It also targets SRAM, because of its symmetrical structure. I'm not sure if the technique generalizes to DRAM or other memory types, but many computing devices contain some SRAM somewhere that may be used for this (e.g.: processor caches).

score 4 · Answer 5 · edited Sep 20 '18 at 09:46

No, as already indicated, waiting for memory errors is not a good idea. First of all, you would very much rely on the specific memory type. If somebody would replace the memory with Error Checking & Correcting (ECC) memory then you would already run into problems. The number of memory errors would be minimal to non-existent on modern computers, and allocating/scanning the entire memory would be very CPU and memory bandwidth consuming. The number of memory errors may also be environmentally sensitive. An attacker could, for instance, lower the temperature to try to minimize the entropy generated by memory errors.

What is sometimes performed is to allocate memory and read it out without wiping it, reading back any info that might be present in such memory. This needs to be an OS function because the OS could very well wipe the memory itself. This is usually only performed to add a bit of entropy to the RNG, not as the main source. Furthermore, it has been proven dangerous: the Debian and Ubuntu Linux distributions were immediately affected when a static code analysis tool discovered the reading of uninitialized memory and a developer removed this source of entropy - unfortunately including with all the other entropy sources.

Using small timing differences (in nano-seconds if a good timer is available) seems to be one of the best ways to collect entropy if a good True Random Generator isn't available. It still has some problems such as running in a headless environment or VM though. Obviously with the SSD's becoming prevalent using HDD's characteristics for a random source is not such a good idea anymore.

score 4 · Answer 6 · answered Nov 08 '17 at 04:18

It has already been pointed out that the memory errors are too infrequent to be useful. There's another problem, though--memory errors due to imperfections in the chip (as opposed to being in the path of a cosmic ray) aren't very random. They're the result of data bleeding somewhere. Your keys will show a very non-random distribution and anyone attacking it will probably know that and prioritize which keys to try first.

Besides, if you need randomness without a high security situation, bang on the keyboard or wiggle the mouse.

Nobody · Answer 7 · 2021-03-05T13:33:23.157

Yes, this is possible, see for example this paper from ETH Zurich and Carnegie Mellon University.

We propose D-RaNGe, a mechanism for extracting true random numbers with high throughput from unmodied commodity DRAM devices on any system that allows ma- nipulation of DRAM timing parameters in the memory con- troller. D-RaNGe harvests fully non-deterministic random numbers from DRAM row activation failures, which are bit errors induced by intentionally accessing DRAM with lower latency than required for correct row activation.

It can provide sustained MB/s of randomness, if a large amount of RAM is used (the amount used can be changed on the fly, it doesn't interfere with the rest of the RAM) even hundreds of MB/s.

score 1 · Answer 8 · answered Nov 09 '17 at 08:58

I'm not sure about memory errors, but I did successfully write a shuffling algorithm using the unknown execution order of parallel computing. I would move cards(ints) from one deck(list) to another by simultaneously passing every card to different threads and had all the threads 'race' to reinsert them back into another deck. When performed repeatedly, constantly for a few seconds, the deck seems completely shuffled and I never got the same result twice even though I was running the same code. over and over even in the same computer and same environment. But I guess it heavily depends on the environment its executing in whether it's random enough.

Can you use memory errors as a source of randomness for cryptography?

8 Answers8

Not from memory errors, no.