Earlier this week I heard about a new startup called Bitcasa which is offering “infinite” secure cloud storage for a low monthly fee. Now, I’m not particularly interested in relying on a brand-new startup for all of my off-site storage needs, but one of Bitcasa’s technical claims seems to have raised a few eyebrows on the Internet. In particular, I learned from an episode of the podcast Security Now that Bitcasa claims to use exclusively client-side encryption and also to be able to de-duplicate files server-side.
Think about that for a moment, and it may at first seem impossible. How can you de-duplicate plaintext that the server never has access to? But it’s not impossible, and I wouldn’t doubt that many ways of doing this are widely known, but I did find it to be a really interesting computer science brainteaser. Here’s the problem, in my words:
Design a service that allows users to store blocks of data and retrieve them later. The service must have the following properties:
- No block is stored more than once.
- No more than O(1) additional space is used per user who “owns” a particular block.
- No user is able to decrypt a block that he does not own.
- The service could not be compelled by any authority to decrypt the blocks it stores.
- No communication between users, directly or through the service, is required.
- The service could not be compelled by any authority to divulge which users own which blocks.
In the Security Now episode in question, the host gave a solution which satisfied 1-4, but not 5 or 6, so I have labelled them as extra credit. If you want to try solving this problem yourself, stop reading now — my solution follows.
First, some notation and definitions. Below I will be making use of two well-known algorithms: AES, a symmetric key cipher, and SHA, a cryptographic hash. Both of these come in various flavors, but the only requirement is that the output of the SHA flavor used must be the same length as the key for the AES flavor used. So, for the sake of argument, let’s say we use AES-256 and SHA-256. I will denote an application of
SHA to some data
SHA(X). I will denote an application of
AES to some data
X with key
And, before beginning, I should point out that I have no idea if this is even close to what Bitcasa actually does. I don’t even know if this method is well-known and is named after some professor somewhere. This is just the way I came up with to solve the problem as presented after I thought about it for a while.
Storing a New Block
Consider User 1. This user generates a private key
PK1 which is kept secret and never sent to the service. When the user logs into the service for the first time, he creates a block list
BL1, initially empty. This list will store information about the blocks he owns (details later), and its size will be O(n) in the number of owned blocks. This block list file is encrypted to
CL1 = AES(BL1, PK1) and the encrypted list
CL1 is uploaded to the service. This file is directly associated with the user’s account and is not de-duplicated. Since the file is linear in the number of owned blocks, the user is contributing only O(1) additional data per owned block, satisfying requirement 2, even though the block list itself is not de-duplicated. The service is unable to decrypt the block list because it is encrypted using
PK1, which is never sent to the server.
Now, User 1 has a block
B which he wishes to store. He computes
SHA(B), the hash of
B, and then uses the hash as the key to encrypt the block:
C = AES(B, SHA(B)). In other words, the encrypted block
C is the original block
B encrypted using its own hash as the key. The user then sends
C to the server, and the server uses
SHA(C) as the unique identifier for the block B. Note that the server cannot decrypt
C since the key
SHA(B) was never sent to the server. This satisfies requirement 4.
Ownership and De-Duplication
After the block is stored, User 1 must claim ownership of the block. He does this by adding a record to the block list consisting of the tuple:
<filename, block number, SHA(B), SHA(C)>. The first two elements are for the user’s benefit only; they don’t have anything to do with how the service works. The filename and block number would allow the client to reassemble the blocks into complete files with names so that they can be presented in a user-friendly way. But what’s important is
SHA(C). Recall that
SHA(B) is the key used to encrypt
SHA(C) is the unique identifier that the service uses for identifying encrypted block
C. So what this record says is “I know that using
SHA(B), I can decrypt the block identified by
SHA(C)“. This is an association that is known only by owners of the file, satisfying requirement 3. The user then saves the new block list file, re-encrypts it using
PK1, and re-uploads it to the server. Again, the server is unable to read the contents of this file.
If User 1 later wants to retrieve a block, he requests
CL1, the encrypted block list, from the server. He decrypts
PK1, and then looks up the filename and block number he wants to retrieve. He then requests the block with identifier
SHA(C) from the server, and the server responds with
C. Finally, the user decrypts
SHA(B) and the original plaintext data is available.
Now suppose User 2 with private key
PK2 comes along with the same block
B and wants to store. He can compute the same encrypted block
C = AES(B, SHA(B)) using only his local knowledge. When he tries to send
C to the server, the server will notice that it already has a block with identifier
SHA(C), and will not store
C again, satisfying requirement 1. User 2 can then add
<filename, n, SHA(B), SHA(C)> to his block list
BL2, encrypt it to
CL2 = AES(BL2, PK2), and upload
CL2. User 2 can retrieve the original block
B in the same way that User 1 did, except by retrieving
CL2 instead of
CL1. Note that User 1 and User 2 now share an encrypted block
C on the server that they can both read, but they never had to communicate with each other, satisfying extra credit requirement 5.
Finally, consider requirement 6. There’s a potential weakness to this sort of system. Say that block
B contained some sort of illicit materials (e.g. pirated media). You could imagine an authority taking a copy of
SHA(C) and then subpoenaing the service for a list of users who own that block. Could the service comply with such a subpoena? Assuming the service is trying to protect the privacy of its users and not doing anything stupid like logging which IPs addresses sent or received which blocks, it would be technically impossible for the service to associate the block identified by
SHA(C) with either User 1 or User 2. The only place in which that association exists is within
CL2, which are encrypted with
PK2 respectively, and the service does not possess these keys. Therefore, the service is not only secure, but anonymous with respect to the association between users and their data, satisfying extra credit requirement 6.
One final note: Observe that it is not a security flaw if an encrypted block
C or an encrypted block list
CL were to fall into the hands of someone other than their owner. These are both strongly encrypted, and nothing meaningful can be extracted from either without the appropriate keys, which are posessesed (or attainable) only by the true owners. So the service can happily respond to a request for
C without having to know or care whether or not the requestor is one of the owners of
C. The only further security consideration woud be to ensure that a user be authenticated when uploading his encrypted block list, otherwise the service is vulnerable to a denial of service attack by means of a malicious user overwriting another user’s block list with garbage.