Programming, Technology

Transparent Encryption Of Offsite Backups With Puppet And Git

I’ll be going into some detail as to how our source control setup works at a later date, but I wanted to address a hot topic before hand – secure storage of configuration data in the cloud.

All of our source code commits are automatically backed up in the cloud. For us this is GitHub, but this should hold for other SaaS platforms such as those offered by Atlassian. As such all of our configuration data goes onto untrusted systems – be it network address ranges or passwords stored in our Hiera configuration files. This also goes for any certificates that need to be part of our puppet environment.

First Steps

Our initial solutions were based on puppet modules as that was what Google initially hinted at. Hiera_yamlgpg seemed to fit the requirements. This module replaces the default yaml backend provided by puppet and provides transparent decryption of the Hiera data files on the fly by the puppet master upon compilation of a catalog. The plus points of this approach was that the majority of the data file could be left in plain text and only the pertinent fields encrypted with gpg like so :

echo -n Passw0rd | gpg -a -e -r recipient

I ran into issues by forgetting to strip the new line at times, and the Hiera data file soon became a mess of GPG statements. Obviously decrypting obfuscated data was a pain, and performing code review was tedious as there was no way of seeing the actual changes without some leg work.

At this point it was decided we should just opt for full file encryption. This led me to wondering whether git supported some form of hook whereby encryption and decryption could be performed while transferring sensitive data on and off site. As it turns out better exists. Git supports filters which can be run on individual files when checked in and out of working branches.

Transparent Encryption With Git

Files can be tagged with attributes either individually or with wildcards in either .git/info/attributes or .gitattributes. The former is on a single repository basis, the latter is under version control and propagated to all my peers, which seems like the right thing

/hieradata/common.yaml      filter=private

The specified file is tagged with a filter called private. This tag can be arbitrary. Now when we checkout (smudge) a file or check it back in (clean) a file with the private filter we can run arbitrary commands on the contents. The input and output are via stdin and stdout respectively.

git config --global filter.private.smudge decrypt
git config --global filter.private.clean  encrypt

The encrypt and decrypt scripts were initially based on PGP, the keys were already installed from our prior dabble with hiera_yamlgpg. And it worked, but not great. The issue was that GPG doesn’t perform encryption deterministically, most likely including data such as time stamps and who encrypted the data. This led to git thinking that the private files were always modified. Not a problem, the files can be ignored in the index and manually committed when a change actually occurred. But this is hardly the transparent work flow we desired. The real deal breaker was when trying to pull from a remote origin, which git refused to do as it would destroy locally modified files. Back to the drawing board then.

Turns out things work perfectly when you remove the determinism.

Encryption & Decryption In Python

There are a couple solutions out there that use openssl, but required compilation which made me steer clear. We’re a python shop, and I’m a geek, so I architected a solution using AES 256 from python-crypto and encoded into base 64.

Important bits are, key and initialisation vector generation

def gen_key():
    """
    Generate a new key
    """
    try:
        keyf = open(KEY_PATH, 'w')
    except IOError:
        sys.stderr.write('Err: Open {0} for writing\n'.format(KEY_PATH))
        exit(1)
    keyf.write(Random.new().read(KEY_SIZE + AES.block_size))
    keyf.close()

Encryption

def encipher():
    """
    Encipher data from stdin
    """
    key = get_key()
    ivec = get_ivec()
    data = sys.stdin.read()
    datalen = len(data)

    # Now for the fun bit, we're going to append the data to a 32 bit
    # integer which describes then actual length of the data as we
    # need to round the cypher input to the block size, this allows
    # recovery of the exact data length upon deciphering. We also
    # specify big-endian encoding to support cross platform operation
    buflen = round_up(datalen + 4, AES.block_size)
    buf = bytearray(buflen)
    struct.pack_into('>i{0}s'.format(buflen - 4), buf, 0, datalen, data)

    # Encipher the data
    cipher = AES.new(key, AES.MODE_CBC, ivec)
    ciphertext = cipher.encrypt(str(buf))

    # And echo out the result
    sys.stdout.write(HEADER)
    sys.stdout.write(base64.b64encode(ciphertext))

And decryption

def decipher_common(filedesc):
    """
    Decipher data from a file object
    """
    key = get_key()
    ivec = get_ivec()
    ciphertext = base64.b64decode(filedesc.read())
    # Decipher the data
    cipher = AES.new(key, AES.MODE_CBC, ivec)
    buf = cipher.decrypt(ciphertext)

    # Unpack the buffer, first unpacking the big endian data length
    # then unpacking that length of data
    datalen, = struct.unpack_from('>i', buf)
    data, = struct.unpack_from('{0}s'.format(datalen), buf, 4)

    # And echo out the result
    sys.stdout.write(data)

decipher_common takes a file descriptor as when used in diff mode git will provide you with a file name, which may or may not be already decrypted. This is the purpose of the HEADER string, to determine whether to perform the decryption or just echo out the file contents. You can enable the diff functionality updating .gitattributes

/hieradata/common.yaml      filter=private diff=private

And your git configuration to act on the tag

git config --global filter.private.smudge 'dc_crypto decipher'
git config --global filter.private.clean  'dc_crypto encipher'
git config --global diff.private.textconv 'dc_crypto diff'

Obviously you need to be pretty secure with your symmetric key and initialisation vector, but I hope I’ve given enough information for you to avoid the same mistakes I did and keep your data secure in the SaaS world.