XMPP is at serious risk of being permanently replaced by proprietary alternatives that have already solved the mobile push messaging problem. Remember in the 90s how many closed IM networks there were? Today it is much worse.
I’ve been taking Coursera’s Stanford Cryptography I class and last week’s homework had an interesting extra credit problem: given eleven ciphertexts all encrypted with the same one-time pad key, find the plaintext of the last ciphertext. It included a hint to XOR the ciphertexts together and to take a note of what happens when you XOR a space with lowercase and uppercase characters.
When I initially went to solve the problem, I thought that it would make the most sense to use frequency analysis but hit a dead-end after a few hours and felt pretty frustrated. However, after studying the hint more carefully, I realized:
1
2
'ABC' XOR ' ' = 'abc'
'abc' XOR ' ' = 'ABC'
Eureka! If you combine that with these properties of XOR:
Pay close attention to the last two lines. You’ll notice that the XOR of two ciphertexts encoded with the same OTP key is identical to the XOR of the two plaintexts. Now, let’s look closer at the contents of
You’ll notice the uppercase ‘N’ and ‘T’ characters definitely stand out. From what we learned earlier, this means that the 5th position of the two original plaintexts has a 50% chance of being either a space or a lowercase ‘n’, and the 6th position is either a ‘t’ or a space.
** Spoiler Alert: If you’re currently taking this class and haven’t completed this homework, please do not read any further until you have solved it on your own. Your solution will likely be better than mine. **
I used this knowledge, along with the XOR permutation of the 11 given ciphertexts with each other to assemble the probability of each character at a given index. Basically, I had a parallel ‘probability’ array that kept track of the chance of each character occurring at each index.
The
1
scan_for_letter
function checks if a given character is within [a-zA-Z] and returns the swapped case of that letter if found. The
1
set_letter_at_index
function helps us keep track of the occurance of each letter by recording both a space and the letter in two probability arrays, each array corresponding to the one of the two currently XORed ciphertext pairs.
I figured that if for a given index, a non-space character with a 0.5 probability was probably that character. For a short-cut, basically all 2-element letter probability dictionaries could be assumed to be whatever non-space character was present. Otherwise, if there is no clear letter match, it is probably a space character. Sometimes there are no letter matches at all so I output a ‘
1
*
’ instead, with the final output looking something like this:
1
1. The *...
The
1
most_probable
function probably isn’t the most efficient or beautiful way of calculating this, but it is very useful and I end up reusing it later when refining the key guesses. You could have stopped right here and had a reasonable guess for each original plaintext, missing some characters here and there.
However, I thought that I could improve the accuracy of my plaintexts by guessing the key that the most generated plaintexts had in common with each other. I re-used the
1
most_probable
function to find the most common key by keeping track of the key bytes at each character index and finding the most commonly shared key. This allowed me to find the non-letter characters that were left, like punctuation and even the sneaky single-character ‘
1
fi
’ present in one of the plaintexts.
Unfortunately my solution still has bugs (can you find them?) and I wasn’t able to completely automate the key-finding process. There were still a bunch of garbage characters that I needed to guess based on context clues. I ended up repeatedly running the
1
guess_key
function on manually refined plaintexts until I found the original OTP key that decrypts all of the ciphertexts in their entirety.
I could have automated the process a bit more by using an English dictionary and guessing the missing characters within words using the Levenshtein distance to known words, but I don’t know how well that would work. I’m still interested in a completely automated solution that doesn’t require any manual intervention but, for now, this will have to do.
Remember, never use the same OTP key more than once or your ciphertexts will leak information about the original plaintexts and eventually reveal your key.
If you’re interested in the rest of the source code, it’s available as redacted Gist on GitHub.
One of the main motivations for the redesign was to include information about the recently re-branded Android version of ChatSecure (previously it was called Gibberbot). A large number of support requests for Android started showing up in the iOS GitHub Issues and UserVoice pages, so hopefully this new design should help guide people in the right direction.
Overall I have been very pleased with the zen of writing blog posts and other content in Markdown, and keeping it version controlled with git. Storing text in a database (especially without version control) just seems so broken now.