Encrypting off Processor Memory


Mike Frantzen Brian Koehl
frantzen@cerias.purdue.edu koehlb@ecn.purdue.edu



1 Abstract

Industry's wide deployment of standard microprocessors has brought unparalleled features to the consumer but has also served to place trade secrets and other valuable vender data at the consumer's fingertips. We propose to protect the security of the data in a physically insecure environment by encrypting all memory accesses before they leave the security of the chip. We researched the optimum demarcation point between the processor and the RAM for where to place the encryption layer will be determined via SimpleScalar simulation. Various encryption algorithms will be surveyed for use in the encryption layer.



2 Justification

The electronics industry has undertaken the proliferation of conventional microprocessors and microcontrollers for use in mainstream products. In so doing, the industry has placed their trade secrets within reach of even an electronics savvy high schooler.

2.1 Trade Secrets

Devices such as DVD players must protect their trade secrets to satisfy contractual requirements. Each DVD player has a decryption key and each DVD has approximately 400 encrypted versions of a single key. The DVD player decrypts one of the 400 versions of the key on the DVD and uses that key to decrypt the actual DVD data. The recent deCSS fiasco involved a software vender who did not adequately protect their key; their key was stolen, thus voiding any security it offered. But long before there was a software attack on the DVD encryption, people used in-circuit emulators (ICEs) to monitor the algorithms the hardware DVD players used. The DVD production houses claim that reverse engineering and illegal decoding and the pirating of their content will severely endanger their revenue. DVD decoding capable processors which also implemented physical security measures would potentially have market penetration in the millions.

2.2 Sensitive Data (or "Who wants to be a millionaire?")

There are situations where sensitive data must be placed in an environment where a hostile attacker may obtain physical access to the microprocessor and its peripherals. Take the Standard & Poor's (S&P) stock exchange for example. They provide their ComStock service to customers to generate stock quotes and news off a real-time feed. Their service consists of an dedicated network circuit between them and their customer and a ComStock supplied linux based appliance physically located at the customer's site. In March of 2000, Kevin Kadow discovered a series of vulnerabilities in the S&P appliance. If fully exploited, the vulnerabilities could have allowed an attacker to: alter real-time Nasdaq and Amex prices and Level-II market data; alter published interest rates; alter equity fund data; alter earnings and balance sheet information; publish phony news stories and change published dividend rates. [19] [20] [32]

Kevin Kadow stipulates that the S&P vulnerability was not explicitly due to breaches in hardware physical security. But the methods S&P used to remedy the problem only alleviate software attacks. An attack on the physical hardware -- as simple as a screwdriver and a new bootdisk -- still have the potential to do damage tabulated in the billions of dollars.

2.3 Conventional Protection

There a few conventional methods of protecting a physically insecure electronic device. Historically, the electronics industry has relied on what the security industry calls "security through obscurity" -- protected only because no one has sat around long enough to figure it out. The electronics combined many simple off the shelf parts in such a way that it takes a great deal of time, knowledge and skill to discern the technology behind the device. The other classic technique for obscuring the technology in the device is to implement custom chips to operate the device.

It is often more economically viable to utilize generic programmable components than to undergo a lengthly hardware design and verification project. The easiest method to "protect" hardware is the legal method. Lobby the local legislative body to enact legal impediments to any reverse engineering or publication of results or techniques. The beginning of such protection has been enacted in the United State via the Digital Millennium Copyright Act (DMCA) [11]. Signatories of the World Intellectual Property Organization's (WIPO) Performance and Phonograms Treaty[35] may be enacting similar protective laws.

2.4 Secure Coprocessors

Given the lack of technical merits of protecting technology with an army of lawyers, the designers of IBM's 4758 secure coprocessor chose the methods of "tamper detection" and "tamper response" [16][29][30][31] (IBM's 4758 was the first device to achieve FIPS 140-1 [22] level 4 validation -- the most secure standard for cryptographic modules) They enclosed a custom processor, some battery backed RAM and some volatile RAM inside a tamper resistant casing. Their tamper resistance relies on the ability to guarantee the detection of a foreign penetration into the casing. Upon detection, the processor zeroes out all secrets and enters a response mode. They use a technique called "memory crowbarring"; it consists of rapidly shorting the power pin to ground. The zeroing scheme must guarantee that all information and secrets were destroyed. It must be resilient against techniques like: a low temperature which will cause DRAM to retain its value even if the power line has been shorted to ground; the presence of ionizing radiation which will cause DRAM not to lose data even after being crowbarred; or postmortem analysis showing that the same value was stored in the DRAM for long periods of time causing an imprinting phenomenon. Further electronic attack techniques are far outside the authors' experience; see reference [1] for more information. Needless to say, these devices are complex.

Naomaru Itoi has integrated IBM's 4758 Secure Coprocessor into a Kerberos Key Distribution Server (KDC) [18]. In a secure distributed environment, if the central authority is compromised then all of the users' keys are exposed. By moving the keys onto the secure coprocessor, Naomaru was able to insure that the entire distributed system's security did not hinge on a single untrusted component.

Bennet Yee did his thesis work on using Secure Coprocessors [36]. He proposed several novel concepts on guaranteeing the privacy of cryptographic keys and the integrity of the cryptographic functions, security kernel and access control databases on machines. He based his work on the assumption that physical security could not be assured.



3 Previous Research

3.1 Encrypted File Systems

Other than IBM's secure coprocessor, the previous work pertinent to this research center on security non-volatile storage and on cryptographic network transactions. Matt Blaze pioneered encrypted Unix filesystems with his now infamous Cryptographic File System (CFS) [4][5]. It kickstarted the research into encrypting nonvolatile filesystems.

3.2 Encrypted Swap

Niels Provos of the OpenBSD project [8] has extended the concept from encrypting filesystems to encrypting a virtual memory system's backing store [24]. While cryptographic filesystems can protect confidential data from unauthorized access, the cleartext (aka unencrypted) data may swapped to non-volatile backing store after authentication; unbeknownst to the user. Niel's elegant solution was to keep a volatile key store and encrypt/decrypt pages as they are spilled to/from the backing store. When implemented into the OpenBSD 2.6 operating system, Niels measured a 0.44ms performance penalty per page encrypted on an Intel Celeron 333 which was deemed an acceptable compromise.

3.3 Hardware Encryption

There has been extensive prior research into hardware accelerated encryption for both general purpose use and specifically for cryptographic network transactions. Much of the research has revolved around using programmable logic devices to give hardware figures. Ian Goldberg and David Wagner [13] examined hardware cryptanalysis of breaking the RC4, A5, DES and CDMF encryption algorithms. There are HDL models flying around of many encryption algorithms and almost a paper per University with a timing analysis for that encryption algorithm.

3.4 Secure Coprocessors

Previous secure coprocessors have included IBM's 4758 [16][29], the uABYSS[33][34], the Citadel [23], and the Luna-340 [15]. There appears to be a blossoming commercial market for secure co-processors. The following is our research into moving the security from a seperate co-processor on a PCI card into into a real processor. We believe the following system will allow the manufacturers to place all of the tamper proofing around just the core processor and construct the rest of the computer system out of the shell parts. It would even be feasible to make the encased processor compatible with a common architecture and allow a consumer to make their system resilient to physical attacks by dropping in a secure processor and enacting some Operating System hooks.



4 Our proposal

We propose to add an encryption layer somewhere between the processor and main memory. By encrypting everything after an arbitrary demarcation point in the memory hierarchy, architects can place everything before the demarc point in a tamperproof casing and use contemporary hardware external to the demarcation point. This will allow two things. First it will allow reusing standard RAM technology and cache technology outside the demarc. It also takes a bulk of hardware outside the large tamperproof casing used by prior attempts at secure coprocessors. Both serve to reduce the cost of the secure device. The second serves to minimize the use of tamperproof casing, who's effectiveness increases with the inverse of its size.

A word sized encryption layer may be added between the microprocessor and the L1 cache. It would encrypt everything before it leaves the layer for the cache and decrypt it anything returns to the processor. Alternately, an encryption layer could be added between the L1 and L2 caches. It would encrypt cache blocks as the ascended from the L1 cache to the L2 and it would decrypt cache blocks as they descended from L2 to the L1 cache. Or the crypto layer could be placed between the L2 cache and main memory. It would encrypt cache blocks before being sent on to the memory controller and decrypt them before they came back in and filled a cache line.



5 Review of Encryption Technologies

5.1 Public Key Cryptography

Before selecting a cryptographic algorithm, we must first trim the field down to a particular algorithm family. The first family we must reject is the Public Key family of algorithms. First proposed by Whitfield Diffie and Martin Hellman [9][10], their seminal work found that keys could come in pairs -- an encryption key and a decryption key and that it could be intractable to derive one from the other. The caveats are that Public Key algorithms are extremely slow, the ciphertext may be significantly larger than the plaintext, the keys may be intangibly large and that they have not been mathematically proven to be secure [28]. Unfortunately, a Public Key cryptosystem would be of minimal use in uniprocessor environment. It could be useful for interprocessor communication in a multiprocessor system or if bus data was to be encrypted between processor and peripherals.

5.2 Probabilistic Cryptography

The second encryption technology we must reject is Probabilistic Encryption. Invented by Shafi Goldwasser and Silvio Micali [14]. In theory, it is the most secure crypto system ever invented but many implementations have been far from practical [28]. Probabilistic encrypt attempts to eliminate all information leaked by conventional public key algorithms. In public key cryptography, if a cryptanalyst can encrypt a message with a known public key, he can compare the ciphertext against a stolen ciphertext he is attempting to decrypt. It the two ciphertexts match, he knows he has the plaintext. Public Key may also leak correlations between bits in the ciphertext. But with probabilistic encryption, encrypting the same plaintext twice, will yield different ciphertexts. Thus the cryptanalyst could not prove he had the plaintext message without the secret decryption key. All possible plaintext may yield the same ciphertext depending on the key used. The disadvantage to probabilistic encryption is that the ciphertext will be larger that the plaintext -- much larger in some algorithms. We must reject this method for use in an encrypted memory system since it too difficult to sell a system where one gig plus one gig does not yield two gigs.

5.3 Stream Ciphers

Stream ciphers are the next cipher family we will review. They convert plaintext to ciphertext one bit at a time. Their operation can be modeled as a keystream generator. They generate a stream of pseudo-random bits and XOR them with the bits of the cleartext. To decrypt the ciphertext, the same pseudo-random bits are XORed to the ciphertext. These can be made extremely fast by using multiple keystream generators in parallel, each responsible for a percentage of the bits. Unfortunately, they are extremely insecure for the application of encrypting cache line sized chunks of data. In most operating systems, a page of memory is zeroed before being mapped into a memory space. An attacker can watch for the first traffic to an address and make the assumption that it is being zeroed. For a simple stream cipher, the keystream that will correspond to that address will simply be the first traffic written to it. XORing that keystream against the next block written to that address will yield the new plaintext. Therefore we must conclude that stream ciphers will make a very poor candidate for use encrypting memory blocks.

5.4 Block Ciphers

The last remaining encryption family to examine is that of the block cipher. We will emulate Bruce Schneier [28] and use Ranier Rueppel's quantification of the difference between stream ciphers and block ciphers:

Block ciphers operate on data with a fixed transformation on large blocks of plaintext data; stream ciphers operate with a time-varying transformation on individual plaintext digits. [26]
Or put more simply, block ciphers typically encrypt data block by block. The most obvious method to employ block ciphers is what is called Electronic Codebook mode (or ECB). The same block of plaintext encrypts to the same ciphertext if the same key is used. One of the problem with this mode is that if a cryptanalyst knows which cleartext block a ciphertext block corresponds to, he will know that all repeated occurances of that ciphertext block correspond to that plaintext block. The cryptanalyst may also replace a ciphertext block with another whose plaintext he knows in an attempt to subvert the correct operation of the machine.

Block ciphers may be strengthened by a chaining or feedback scheme between blocks. Use the previous block to affect the outcome of the encryption of the current block. This assures that if an attacker knows the correlation between a particular ciphertext block and a plaintext block, he will not be able to directly apply that knowledge to other blocks. Unfortunately, using a feedback mode cipher would totally negate the entire purpose of a cache -- it would require that prior to encrypting a block and sending it down to the next layer in the memory hierarchy, the prior encrypted block would have to be fetched.

5.5 Cipher Choice

We are left with no ideal candidate. Each family presents significant flaws, either security or architectural. Despite misgivings, we must narrow the research into the encryption algorithm down to Block Ciphers in electronic codebook mode (ECB). It is the best compromise between security and speed. It does not have any glaring security problems that we cannot work around unlike stream ciphers and it is orders of magnitude faster than either public key ciphers or probabilistic ciphers. There is also a large body of research pertaining to implementing block ciphers in hardware.



6 Selection of the Encryption Algorithm

We are concentrating our search on two block algorithms: IDEA and the AES winner Rijndael. IDEA is being reviewed because it has impressive hardware performance and a long trackrecord. Rijndael also has impressive hardware performance and underwent intense scrutiny during the course of its selection as the Advance Encryption Standard (AES) replacement for the DES algorithm.

6.1 IDEA

6.1.1 IDEA Algorithm Description

The IDEA (International Data Encryption Algorithm) algorithm is the brainchild of Xuejia Lai and James Massey. The algorithm consists solely of three arithmetic operations: XOR, Addition modulo 2^16 and Multiplication modulo 2^16 + 1. These operations implement the design philosophy of mixing operations from different algebraic groups [28]. Each of these operations only work on 16bit blocks.

The algorithm as a whole works on 64bit data blocks with a key length of 128 bits. It splits the 64bit block into four 16bit blocks as the initial input of the algorithm. And the key is broken into eight 16bit subkeys. The algorithm is composed of eight rounds each with the same 14 stages per round. And the second and third subblocks are swapped between rounds. The algorithm below is the 14 stages per round and is adapted from Bruce Schneier's "Applied Cryptography":

  1. Multiply SubBlock1 and the SubKey1 modulo 2^16 + 1
  2. Add SubBlock2 and SubKey2 modulo 2^16
  3. Add SubBlock3 and SubKey3 modulo 2^16
  4. Multiply SubBlock4 and the SubKey4 modulo 2^16 + 1
  5. XOR the results of steps (1) and (3)
  6. XOR the results of steps (2) and (4)
  7. Multiply the results of step (5) and the SubKey5 modulo 2^16 + 1
  8. ADD the results of steps (6) and (7) modulo 2^16
  9. Multiply the results of step (8) and the SubKey6 modulo 2^16 + 1
  10. ADD the results of steps (7) and (9) modulo 2^16
  11. XOR the results of steps (1) and (9)
  12. XOR the results of steps (3) and (9)
  13. XOR the results of steps (2) and (10)
  14. XOR the results of steps (4) and (10)
The output of the round is the four sub-blocks results of steps (11), (12), (13), and (14).

After the final round, there is an output transformation to produce the four sub-blocks which combine into the subkey

  1. Multiply SubBlock1 and SubKey1 modulo 2^16 + 1
  2. ADD SubBlock2 and SubKey2 modulo 2^16
  3. ADD SubBlock3 and SubKey3 modulo 2^16
  4. Multiply SubBlock3 and SubKey3 modulo 2^16 + 1

Subkeys creation is also simple. It consists simply of dividing the 128bit key into 16 16bit subkeys. Use the first 6 for the first round of the algorithm. Rotate the 128bit key left by 25 bits and repeat to generate the subkeys for each additional round.

6.1.2 Performance of IDEA

There have been several hardware implementations of IDEA. Zimmerman et al achieved 177.8 Mbit/s at 25MHz on a 1.2um CMOS circuit [6]. The Ascom corporation has achieved 300 Mbit/s at 40MHz on a .25 micron process; at 100MHz, performance tops 720 Mbit/s [2] Mosanya et al. designed a fully pipelined implementation of IDEA and simulations suggested their could achieve a throughput of 1500 Mbit/s in an FPGA [21].

For IDEA simulations, we will assume the processor architects can spare enough silicon for several full IDEA pipelines of 59 stages each (or enough units to keep the pipeline from stalling during any of the 59 clock cycles). Expensive, but it should scale to high clock frequencies.

6.1.3 Advantages of IDEA

+ Minimal key setup time.
+ Many of the round operations can be combined into the same cycle to help minimize a pipelined implementation.
+ All the operations are extremely fast in hardware.
+ The entire algorithm is fast as far as encryption goes.
+ The same encryption hardware can be mostly reused for the decryption.
+ IDEA is a relatively well studied algorithm.
+ VERY firm theoretical footing.

6.1.4 Disadvantages of IDEA

- Fixed block size smaller than a cache block.
- Requires many registers to store intermediates between a rounds stages.
- The IDEA algorithm is patented and must be licensed for commercial use.

6.1.5 Special notes

Despite initial appearances, the size of IDEA's subblocks cannot be safely increased to 32bits to double the block size. The safety of the 16bit subblocks comes from the multiplication modulo 2^16 + 1 which happens to be prime; 2^32 + 1 is not prime.

To be used in encrypting cache sized blocks, the architecture designer would have to have multiple IDEA encryption/decryption pipelines in parallel to cover the entire size of the cache block. Example: if the cache block size is 32 bytes, there would need to be four IDEA pipelines in parallel.

6.2 Rijndael

6.2.1 Rijndael Algorithm Description

The Rijndael was designed by Joan Daemen and Vincent Rijmen as a candidate for the Advance Encryption Standard (AES) [7]. It is a block cipher with a variable key and block length, adjustable in multiples of 32 bits with 128bits the minimum specified; although, only the 128bit block size was included in the AES standard. It has a variable number of rounds depending on how large the block and key sizes are. There are 9 rounds if both are 128 bits long, 11 if either are 192 bits long or 13 rounds if either are 256 bits long

Rijndael was designed to use only simple whole byte operations. To encipher of block of data, you first perform the Add Round Key step (XOR a subkey with the block). Next the variable number of rounds are performed. [27]

Each round consists of four steps. The first is the Byte Sub, where each byte of the block is replaced by its correlary substitute in an S-box. The S-box is calculated by:

  1. Replace each byte with its reciprocal in the same Galois Field of 2^8, with the exception that 0 is replaced by itself.
  2. A bitwise modulo-two multiply
  3. XOR the result with 63h
The second step is the Shift Row step where each byte in the block is shuffled to a different location. See the Rijndael specification for the shuffling pattern since it is dependant on the blocksize.

The third step is the Mix Column step. Each column is multiplied by the matrix over the Galois Field(2^8):

2 3 1 1
1 2 3 1
1 1 2 3
3 1 1 2
But if the result byte has more than 8 bits, they are cancelled out by XORing the generating polynomial of the particular version of the Galois Field(2^8) used (the binary 9bit string 100011011) with the result. The final Rijndael round omits the Mix Column step.

The final step is the Add Round Key which simply XORs the current round's subkey.

6.2.2 Performance of Rijndael

The implementations of Rijndael are fairly new since the algorithm itself is fairly new. Vicent Rijmen, one of the co-authors of Rijndael, has written a paper on efficient implementation of Rijndael's S-box in [25] and several hardware implementations based off of it. AJ Elbirt et al. were able to achieve a throughput of 1937.9 Mbit/s at 31.8 MHz with a sub pipelined implementation in an FPGA [12]. The elaborate nature of the S-boxes makes fully pipelining Rijndael difficult at best and impossible with the limited resources of an FPGA. The National Security Agency (NSA) implemented all six of the AES candidates in ASICs in both iterative and pipelined versions. The NSA was able achieve a throughput of 5163 Mbit/s with their pipelined version of Rijndael and 520 Mbit/s with their iterative version [3] The time to encrypt one block was approximately 250ns.

6.2.3 Advantages of Rijndael

+ Fast.
+ Free.
+ Fixed 128bit block size.
+ Minimal encryption key setup time.
+ Variable block size (single pipeline can accommodate an entire cache block).
+ There has been alot of cryptanalysis of 128bit block sized algorithm.

6.2.4 Disadvantages of Rijndael

- Non-trivial key setup time for decrypt.
- Minimal cryptanalysis of other than 128bit block sized algorithm.
- Very difficult to fully pipeline due to the complex nature if its S-boxes.

6.2.5 Special Notes

none

6.3 Algorithm Choice

Once again we are left with a difficult choice. Rijndael's ability use variable block sizes makes it ideal to use to encrypt cache block sized chunks. Unfortunately Rijndael has two shortcomings we cannot forgive. First, it is very difficult to fully pipeline; difficult enough that we would worry about the correctness of a fully pipelined implementation. But the clincher is that Rijndael's variable block modes have not been cryptanalyzed nearly as well as IDEA or Rijndael's 128bit block algorithm. We will undoubtably be criticized for this, but we must conclude IDEA provides more security than Rijndael in any mode. This conclusion will not be valid in the near future when Rijndael's variable block modes have been better scrutinized and its ability for a single pipeline to encrypt a cache size block outweighs its difficulty in pipelining. Since the entire purpose of this research is to provide the foundation of a physically secure processor through encryption, sacrificing the security of the encryption in the least would be hypocritical.



7 Tailoring the machine

Several modifications must be made to the processor and operating system other than the encryption layer to support our model of a secure processor.

7.1 Key Generation

Prior to commencing any encrypting operations, the processor must first generate a random key. It cannot be stressed enough that the key MUST be random. Intel processors since the P6 core was introduced have come with a hardware RNG (Random Number Generator) [17]. It is composed of two free running oscillators which are given their randomness by the electromagnetic noise inside the processor. Choosing when and how to key the encryption algorithm is implementation dependant but appears to have minimal hardware cost.

7.2 Key Muddling

By choosing an encryption algorithm in Electronic Codebook Mode (ECB), we enabled a cryptanalytic attack that all identical plaintext blocks will have the same ciphertext. We propose to muddle the encryption input based on the physical address of the the block being encrypted. The address can either muddle the key used to encrypt/decrypt the block, it can be used to muddle the plaintext block or can muddle the ciphertext block. Choice of muddling technique based on the address is beyond the cryptographic scope of this research and may have dire implications to the security of the system if the technique is used improperly.

7.3 MTRRs

The proposal so far has not addressed issues pertaining to memory mapped IO devices. If the processor communicates with a peripheral via the memory bus, the peripheral will most likely not understand the encrypted garbage the processor throws at it. To solve this problem, we propose that the architecture will need a form of Memory Type Range Registers (MTRRs in the Intel vernacular). The operating system will have to mark ranges of memory as un-encryptable similar to how it must mark ranges of memory uncacheable. It may be most simple to make the uncacheable marking also imply that the memory range cannot be encrypted. The will give the added benefit that encryption layer may be incorporated into one of the cache layers.

7.4 Page usage

To help strengthen the cryptographic aspects of the algorithm, we suggest that the operating system randomize its allocation of free pages. Every time a new free page is allocated, take it from a random location in the free pool. Due to the deterministic nature of computers, pages will often be allocated to the same data structures every time the computer is booted. If a cryptanalyst is given a secure computer that has been running for a span of time, he can make assumptions about what is in a given page. It makes some sense to randomize the location of data structures a cryptanalyst might find useful to breaking the security of the system.

7.5 Multiprocessor Systems

We must also point out that encrypting a computer's memory precludes the use of multiple processors without a secure method of transfering the key. For the method to be secure and to be able to trust the other processors on the bus, all the processors would have to be incorporated into the same tamperproof packaging.



8 Placing the Cryptographic Memory Layer

The final decision in the cryptographically secure processor is where to place the encryption layer. The obvious choices are to place it between levels in the memory hierarchy. The layer closest to the processor that it may be placed is between the processor and L1 cache. Before or after the store buffers will not influence the results. The farthest from the processor that the encryption layer can be placed is between the most distant cache and main memory. Or a compromise may be sought by placing the encryption layer between levels of caches. For brevity's sake, we will pretend only two levels of cache are realistic.

8.1 Simulations Methodology

Todd Austin's et al SimpleScalar/PISA 3.0 tool set was used for simulations. Specifically, sim-outorder; the out of order instruction simulator was used. An out of order simulator was used instead of a cache hierarchy simulator to more accurately model real world conditions. Parts of the SPECint95 was used to give benchmarks. The entire suite could not be run due to time constraints (and people repeatedly rebooting machines on us at the most inopportune moments, grrrr). All of the benchmark were not run for the components of SPEC95 which were run, also due to time constraints. The benchmarks are intended to examine the worst case performance scenario if the processor architects can spare enough hardware for full encryption and decryption pipelines. Better performance is possible if the pipeline stages are able to be minimized. And worse performance may be acceptable if the encryption layer is to be placed distant from the processor -- a semi pipelined implementation may be ideal here.


*** GRAPHS GO HERE ***

8.2 Simulation of IDEA between Processor and L1

The secure processor will have abysmal performance if there is a 59 cycle decryption penalty when accessing L1 cache. These simulations are only included for completeness sake and comical value.

8.2.1 IJPEG Benchmark

The IJPEG benchmark took 8.45 times as many cyles to complete with an encryption layer before the L1 cache than it did without any encryption layer. Without the encryption layer, the IPC was 1.30; with the layer it was 0.15 -- 860% slower.

8.2.2 PERL Benchmark

The PERL benchmark took 6.477 times as many cyles to run with the encryption than it did without any encryption layer. The IPC was 1.255 without the encryption and .193 with the encryption -- 648% slower.

8.2.3 Processor-L1 Conclusion

Placing the encryption layer between the processor and the L1 cache should be avoided at all costs.

8.3 Simulation of IDEA between L1 and L2

Placing the encryption component between the L1 and L2 caches is a compromise between the speed of a distant encryption layer and the security of placing the encryption layer near the processor.

8.3.1 IJPEG Benchmark

The IJPEG Benchmark took 8.045 times as many cycles to complete execution with the encryption layer between the L1 cache and the L2 cache than no encryption layer. The IPC was 5% faster with the encryption between L1 and L2 than it was before the L1 cache. It was still 804% slower than with no encryption layer though.

8.3.2 PERL Benchmark

The PERL benchmark took 4.6 times as many cycles with the encryption layer than without. The IPC was 1.4 times higher with the encryption between L1 and L2 than it was before L1 but it was still 4.6 times lower than without any encryption.

8.3.3 L1-L2 Conclusion

The encryption layer between L1 and L2 caches does not impose as drastic a performance penalty as does the layer before the L1 cache but it still brings the performance down to unacceptable levels

8.4 Simulation of IDEA between L2 and main memory

Placing the encryption layer this distant from the processor should yield the most minimal performance penalty. If the L2 cache is located in the same physical package as the processor and L1 cache, this should not jeopardize the security afforded by the encryption. But if the L2 is on a different package than the processor and L1 cache, placing the encryption layer so distant may increase the size of the tamperproof casing which may have adverse effects if care is not taken.

8.4.1 IJPEG Benchmark

The encryption layer between the L2 cache and main Memory causes a factor of 7.27 increase in the cycle count compared to running the benchmark without any encryption. The IPC is 10.5% better than with the encryption between L1 and L2 but is still 728% worse than without any encryption.

8.4.2 PERL Benchmark

The PERL benchmark took 8% more cycles to complete with the encryption layer than without. The IPC was also 8% slower with the encryption layer than without any encryption.

8.4.3 LISP Benchmark

We also performed the LISP benchmark with an encryption layer between the L2 cache and main memory. With the crypto layer, the LISP benchmark only took .06% more cycles to complete than without a crypto layer. The CPI was .0002 smaller with the crypto than without

8.4.4 L2-Memory Conclusion

We believe the IJPEG benchmark showed fluke of short running programs not getting many cache hits which showed the performance degradation of a cryptographic memory layer. But the cryptograhic layer before memory gave acceptable performance results on longer running applications like LISP and PERL. In the case of LISP, there was almost no performance penalty.

8.5 Choosing the encryption layer placement

Hurray! Our first easy choice. To get the performance degradation of an encryption layer down to acceptable levels, the encryption layer must be placed between the L2 cache and main memory.


9 Conclusion

We have found that adding an IDEA cryptographic layer between an architectures L2 cache and main memory can provide a high level of artificial physical security and can often keep the performance at acceptible levels.



10 References

[1]
Anderson, Ross; Kuhn, Markus. "Tamper Resistance - a Cautionary Note." The Second USENIX Workshop on Electronic Commerce Proceedings", Nov 18-21, 1996

[2]
Ascom Corporate Web Site

[3]
Bean, Bryan; Ficke, Chris; Rozylowicz, Tom; and Weeks, Bryan. "Hardware Performance Simulations of Round 2 Advanced Encryption Standard Algorithms." National Security Agency, 2000

[4]
Blaze, M. "A Cryptographic File System for Unix." Proceedings of the First ACM Conference on Computer and Communications Security, Fairfax, VA, November 1993.

[5]
Blaze, M. "Key Management in an Encrypting File System." USENIX Summer 1994 Technical Conference, Boston, MA, June 1994.

[6]
Curinger, A; Bonnenberg, R; Zimmermann, N.; Kaeslin, H. and W. Fichtner. "VINCI: VLSI implementation of the new block cipher IDEA." Proceedings of the IEEE CICC'93.

[7]
Daemen, Joan; Rijmen, Vincent. "AES Proposa: Rijndael." 2000

[8]
de Raadt, Theo; Hallqvist, Niklas; Grabowski, Artur; Keromytis, Angelos D; and Niels Provos. "Cryptography in OpenBSD: An Overview."

[9]
Diffie W. and M.E. Hellman, "Multiuser Cryptographic Techniques," Proceedings of AFIPS National Computer Conference, 1976

[10]
Diffie W. and M.E. Hellman, "New Directions in Cryptography," IEEE Transactions on Information Theory, Nov 1976.

[11]
"Digital Millenium Copyright Act (DMCA)" 105th Congress 2d Session H.R 2281, Aug. 4, 1998

[12]
Elbirt, AJ; Yip, W; Chetwynd, B; Paar, C. "An FPGA Implementation and Performance Evaluation of the AES Block Cipher Candidate Algorithm Finalists." The Third Advanced Encryption Standard (AES3) Candidate Conference, April 13-14, 2000.

[13]
Goldberg, Ian, and David Wagner. "Architectural considerations for cryptanalytic hardware." 1996

[14]
Goldwasser S. and Micali S, "Probabilistic Encryption and How to Play Mental Poker Keeping Secret All Partial Information," Proceedings of the 14th ACM Symposium on the Theory of Computing, 1984.

[15]
http://www.chrysalis-its.com (Luna 340 secure coprocessor)

[16]
IBM Cryptocard Website http://www.ibm.com/security/cryptocards/

[17]
Intel Corporation. "The Intel Random Number Generator" 1999

[18]
Itoi, N. "Secure Coprocessor Integration with Kerberos V5," Proc. USENIX Security'2000, September, 2000.

[19]
Kadow, Kevin. Personal Communication via email.

[20]
Koch, Lewis Z. "Online Trader Vaults Left Ajar." Interactive Week, Oct 26, 2000.

[21]
Mosanya, Emeka; Teuscher, Christof; Restrepo, Hector; Galley, Patrick; and Eduardo Sanchez. "Cryptoboster: A Reconfigurable and Modular Cryptographic Coprocessor"

[22]
NIST, FIPS140-1 "Security Requirements for Cryptographic Modules".

[23]
Palmer, E.R. "An Introduction to Citadel---A Secure Crypto Coprocessor for Workstations." Research Report RC 18373, IBM T.J. Watson Research Center, 1992.

[24]
Provos, Niels. "Encrypting Virtual Memory."

[25]
Rijmen, Vincent. "Efficient Implementation of the Rijndael S-box".

[26]
Rueppel, Ranier. "Stream Ciphers," Contemporary Cryptology: The Science of Information Integrity, IEEE Press, 1992.

[27]
Savard, John. "A Cryptographic Compendium." 2000.

[28]
Schneier, Bruce. "Applied Cryptography." Second Edition. 1996

[29]
Smith, Sean W; Weingart, Steve. "Building a High-Performance, Programmable Secure Coprocessor." October 16, 1998 DRAFT

[30]
Smith S.W. and V. Austel. "Trusting Trusted Hardware: Towards a Formal Model for Programmable Secure Coprocessors." 3rd USENIX Workshop on Electronic Commerce. August 1998.

[31]
Smith, Sean; Perez, Ron; Seingart, Steve; and Vernon Austel. "Validating a High-Performance, Programmable Secure Coprocessor." Proceedings, 22nd National Information Systems Security Conference, Oct 1999.

[32]
Taylor, L. "Standard & Poor's Exposes Customers' Security." TechnologyEvaluation.com, June 21, 2000.

[33]
Weingart S.H., "Physical Security for the microABYSS System", Proceedings of the 1987 IEEE Symposium on Security and Privacy, Oakland, CA, pp. 52-58, (April 1987).

[34]
White S.R. and L. Comerford, "ABYSS: A Trusted Architecture for Software Protection," Proceedings of the 1987 IEEE Symposium on Security and Privacy, Oakland, CA, pp. 38-51, (April 1987).

[35]
"WIPO Performances and Phonograms Treaty" World Intellectual Property Organization, Dec 20, 1996.

[36]
Yee, Bennet. "Using Secure Coprocessors." PhD Thesis, Carnegie Mellon University, May 1994.




(Copyright (c) 2000 Michael Frantzen and Brian Koehl)