One-Byte XOR

- 8 mins read

Series: cryptopals

Single-byte XOR

Things start to get a little bit interesting, but still nothing tricky, yet. Just like in the last challenge, we have to produce output from some hex input and some key. However, this time, the key is only one byte long.

Hex string:

1b37373331363f78151b7f2b783431333d78397828372d363c78373e783a393b3736

The xor key:

 

(there is no xor key) Did I mention we have to figure it out?

Now that we have some ready code from the previous challenge, which allows us to xor hex data, we’ll just modify it slightly, so that it will bruteforce the key. The simplest I could think of is just add a new loop. The new loop will iterate through the ASCII range [0,128), while the second nested loop will execute xor on each byte with the same one byte key. Every decrypted string is simply printed.

Bruteforce

use std::{
    env,
    error::Error,
};
fn main() -> Result<(), Box<dyn Error>> {
    let args: Vec<String> = env::args().collect();
    if args.len() != 2 {
        println!("Usage:\n\tsinglexor <hex string>");
        return Ok(());
    }
    let hexstr = &args[1];
    if !hexstr.len() % 2 == 0 {
        return Err("not a valid hex encoded string".into());
    }
    let mut bytes: Vec<u8> = Vec::new();    
    for i in (0..hexstr.len()).step_by(2) {
        let byte = u8::from_str_radix(&hexstr[i..][..2], 16).unwrap();
        bytes.push(byte);
    }
    for i in 0..128{
        let mut xored: Vec<u8> = Vec::new();
        for j in 0..bytes.len() {
            xored.push(bytes[j] ^ i);
        }
        println!("{}", std::str::from_utf8(&xored).unwrap());
    }
    Ok(())
}
> ./singlexor 1b37373331363f78151b7f2b783431333d78397828372d363c78373e783a393b3736

7316?x§413=x9x(7-6<x7>x:9;76
→66207>y¶→~*y502<y8y)6,7=y6?y;8:67

		...

eIIMOHA♠ke☺U♠JOMC♠G♠VISHB♠I@♠DGEIH
dHHLNI@jdTKNLBFWHRICHAEFDHI

The solution should be somewhere in this output (to keep it clean, only a few lines are printed above). There are more than 100 lines, not daunting, so we could find the answer by looking at each and every one of them and see which one is readable. Though, this approach gets impractical very quickly as the number of lines gets bigger, and is also quite boring. This brings us to a new technique.

Frequency analysis

In cryptanalysis, frequency analysis (also known as counting letters) is the study of the frequency of letters or groups of letters in a ciphertext. The method is used as an aid to breaking classical ciphers. -Wikipedia

Important: For accurate analysis, a larger chunk of reference text is preferable. Also, this method may not be foolproof, especially for short sentences like the one we are working with.

Count letters

You will see exactly why we need to count letters, but for now, here is the function that we’ll use to store information about how many times each letter occurs in a string. It returns a hashmap. For instance c: 4, for a string where the letter c occurs 4 times cccc. This allows for convenient comparison between the observed values and expected values, because the expected values will be stored in a hashmap too.

fn count_characters(input: &str) -> HashMap<char, u32> {
    let mut char_count: HashMap<char, u32> = HashMap::new();

    for c in input.chars() {
        // only work with letters when doing frequency analysis
        if c.is_alphabetic() {
            *char_count.entry(c.to_ascii_lowercase()).or_insert(0) += 1;
        }
    }
    char_count
}

You probably asked yourself what are observed values and expected values. Well there’s quite a bit of reading to do, but it is fairly simple and has to do with this next method.

Chi-Squared method

To make it practical, fast and fun, I will try and use the chi-squared method to determine if any of those lines are a valid sentence. This is why we need to be able to count letters. Then, compare these counted letters (observed frequencies) to the real frequency of letters in the English language (expected frequencies) based on a reference text. The reference text should ideally be a large corpus of English text in our case. However, it is easy to find this data online (example: wikipedia-letter-frequency).

To better understand it, here’s the formula:

$$ \chi^2 = \sum \frac{{(O_i - E_i)^2}}{{E_i}}$$

Where $O_i$ is the observed frequency, and $E_i$ is the expected frequency of letter $i$. For instance, if we found that in English text, E occurs 12% of the time, and T occurs 9% of the time, we would use these percentages as the expected frequencies for each letter. Next, we would sum up the chi-squared values for all letters and compare the total chi-squared value to a critical value(later on this one, though in our case we can clearly see if there is a corelation between the observed values and expected values by simply reading the output) from the chi-squared distribution table with degrees of freedom equal to the number of letters minus one. The degrees of freedom in a statistical calculation represent the number of variables that can vary in a calculation. So, if you are analyzing the frequency of letters in English sentences, the degrees of freedom would be the total number of letters in the English alphabet minus one. If the total chi-squared value exceeds the critical value, it indicates that the observed frequencies significantly differ from the expected frequencies. According to the table, in our case, every value above 36 is considered to indicate a non valid English sentence.

Calculate chi-squared

fn calculate_chi_squared(observed: f64, expected: f64) -> f64 {
    ((observed - expected).powi(2)) / expected
}
fn calculate_total_chi_squared(observed_frequencies: &HashMap<char, u32>, expected_frequencies: &HashMap<char, f64>) -> f64 {
    let total_letters: u32 = observed_frequencies.values().sum();

    let mut total_chi_squared = 0.0;
    for (letter, &observed) in observed_frequencies {
        if let Some(&expected) = expected_frequencies.get(&letter) {
            total_chi_squared += calculate_chi_squared(observed as f64, total_letters as f64 * expected);
        }
    }
    total_chi_squared
}

Expected Frequencies

Now I’m sure there’s better ways to do this, and you can even calculate them yourself by counting the letters from some big chunks of text, as mentioned before, but I got the expected frequencies from the internet and embedded them into a hashmap.

    let expected_frequencies: HashMap<char, f64> = [
        ('e', 0.127),
        ('a', 0.0817),
        //    ...
        ('z', 0.0007),
        ('k', 0.0077),
    ].iter().cloned().collect();

Full code

All things above considered, the full working code should look like this:

use std::{
    env,
    error::Error,
    collections::HashMap,
};

fn count_characters(input: &str) -> HashMap<char, u32> {
    let mut char_count: HashMap<char, u32> = HashMap::new();

    for c in input.chars() {
        if c.is_alphabetic() {
            *char_count.entry(c.to_ascii_lowercase()).or_insert(0) += 1;
        }
    }
    char_count
}

fn calculate_chi_squared(observed: f64, expected: f64) -> f64 {
    ((observed - expected).powi(2)) / expected
}

fn calculate_total_chi_squared(observed_frequencies: &HashMap<char, u32>, expected_frequencies: &HashMap<char, f64>) -> f64 {
    let total_letters: u32 = observed_frequencies.values().sum();

    let mut total_chi_squared = 0.0;
    for (letter, &observed) in observed_frequencies {
        if let Some(&expected) = expected_frequencies.get(&letter) {
            total_chi_squared += calculate_chi_squared(observed as f64, total_letters as f64 * expected);
        }
    }
    total_chi_squared
}

fn main() -> Result<(), Box<dyn Error>> {
    let args: Vec<String> = env::args().collect();
    if args.len() != 2 {
        println!("Usage:\n\tsinglexor <hex string>");
        return Ok(());
    }
    let hexstr = &args[1];
    if !hexstr.len() % 2 == 0 {
        return Err("not a valid hex encoded string".into());
    }
    let expected_frequencies: HashMap<char, f64> = [
        ('a', 0.0817),
        ('b', 0.0149),
        ('c', 0.0278),
        ('d', 0.0425),
        ('e', 0.127),
        ('f', 0.0223),
        ('g', 0.0202),
        ('h', 0.0609),
        ('i', 0.0697),
        ('j', 0.0015),
        ('k', 0.0077),
        ('l', 0.0403),
        ('m', 0.0241),
        ('n', 0.0675),
        ('t', 0.0906),
        ('o', 0.0751),
        ('s', 0.0633),
        ('r', 0.0599),
        ('u', 0.0276),
        ('w', 0.0236),
        ('y', 0.0197),
        ('p', 0.0193),
        ('v', 0.0098),
        ('x', 0.0015),
        ('q', 0.001),
        ('z', 0.0007),
    ].iter().cloned().collect();

    let mut chis: Vec<f64> = Vec::new();
    let mut bytes: Vec<u8> = Vec::new();    
    for i in (0..hexstr.len()).step_by(2) {
        let byte = u8::from_str_radix(&hexstr[i..][..2], 16).unwrap();
        bytes.push(byte);
    }
    for i in 0..128{
        let mut xored: Vec<u8> = Vec::new();    // This will hold the current decrypted string
        for j in 0..bytes.len() {
            xored.push(bytes[j] ^ i);
        }
        // pretty self-explanatory, creates a hashmap of chars and their occurence as u32 from an ascii string
        let observed_frequencies = count_characters(std::str::from_utf8(&xored).unwrap());
        // chi-squared value of the current xored string 
        let total_chi_squared = calculate_total_chi_squared(&observed_frequencies, &expected_frequencies);
        //ignore 0 chi-squared values as this means the string is empty
        if total_chi_squared != 0.0 {
            chis.push(total_chi_squared);
            println!("{} {} {}", i, total_chi_squared, std::str::from_utf8(&xored).unwrap());
        }
    }
    println!("Smallest chi squared value: {}", chis.iter().copied().fold(f64::INFINITY, f64::min));
    Ok(())
}

Final

Let’s run the program with the string we are provided and save the output to a file:

./singlexor 1b37373331363f78151b7f2b783431333d78397828372d363c78373e783a393b3736 > out.txt

After opening the file, you can see all the attempts at decrypting a string, at least one of them is correct.

0 3988.008999999999 77316?x+x413=x9x(7-6<x7>x:9;76
1 292.6867279187817 66207>y~*y502<y8y)6,7=y6?y;8:67
2 8559.432771428572 55134=z})z631?z;z*5/4>z5<z8;954
5 1426.5721285714285 22643:}z.}1468}<}-2(39}2;}?<>23

        ...

124 395.13435098956717 gKKOMJCigWHMOAETKQJ@KBFEGKJ
125 687.4101105337077 fJJNLKBhfVILN@DUJPKAJCGDFJK
126 45.154514962914675 eIIMOHAkeUJOMCGVISHBI@DGEIH
127 47.151804944267106 dHHLNI@jd
Smallest chi value: 31.618818336524786

As I mentioned before, you could find the decrypted string by visually going through all the 100 something lines, but notice the last line in the output - the smallest chi value. The smallest chi value represents the highest probability that a line contains English text. Here it is 31.6, smaller than 36, the value in the chi-squared distribution table.

Result

After searching for 31.61… in my out.txt file, I found this:

> 88 31.618818336524786 Cooking MC's like a pound of bacon
> 120 31.618818336524786 cOOKINGmcSLIKEAPOUNOFBACON

Where 88 and 120 are the keys, which are X(uppercase) and x(lowercase).


Get Involved

I think knowledge should be shared and discussions encouraged. So, don’t hesitate to ask questions, or suggest topics you’d like me to cover in future posts.

Stay Connected

You can contact me on J6597@tutanota.com


comments powered by Disqus