Advent of Code Day 9–Decompression Length

Today’s Advent of Code challenge was decompressing a sort-of run length encoded string, for which a recursive solution was the obvious choice.

The puzzle itself only asked for the length of the decompressed string to be calculated, so it was quite possible and permissible to solve the challenge without actually decompressing the string at all. I decided that I wanted my solution to be able to decompress the string, even though technically it wasn’t necessary.

So I created a recursive function that works its way through the string decoding as it goes, and yielding small strings which can be concatenated to produce the decompressed result. The benefits are that I can use this to check the decoding of test inputs, and that the entire decoded string doesn’t need to be stored in memory – we could sum the lengths of each element in the sequence as they are produced. It uses the convenient yield! operator to yield every element from a sequence:

let input = System.IO.File.ReadAllText (__SOURCE_DIRECTORY__ + "\\input.txt")
open System.Text.RegularExpressions

let rec expand isV2 parsePos (input:string)  = seq {
    let sub = input.Substring(parsePos)
    let m = Regex.Match(sub, @"\((\d+)x(\d+)\)")
    if m.Success then
        let chars = int m.Groups.[1].Value
        let repeats = int m.Groups.[2].Value
        let repeat = input.Substring(parsePos + m.Index + m.Length, chars)
        yield sub.Substring(0,m.Index)
        if isV2 then
            for i in 1..repeats do
                yield! expand isV2 0 repeat 
        else
            yield [for i in 1..repeats -> repeat] |> List.fold (+) ""
        yield! expand isV2 (parsePos+m.Index+m.Length+chars) input
    else
        yield sub
}
    
let decompressV1 input = expand false 0 input |> System.String.Concat

decompressV1 "ADVENT" |> printfn "%s" // ADVENT
decompressV1 "A(1x5)BC"  |> printfn "%s" // ABBBBBC
decompressV1 "(3x3)XYZ"  |> printfn "%s" // XYZXYZXYZ
decompressV1 "(6x1)(1x3)A"  |> printfn "%s" // (1x3)A
decompressV1 "X(8x2)(3x3)ABCY"  |> printfn "%s" // X(3x3)ABC(3x3)ABCY

(decompressV1 input).Length |> printfn "Part a: %d"

So this approach works fine and can be used to solve part b as well,

let decompressV2 input = expand true 0 input |> Seq.sumBy (fun f -> int64 f.Length)
decompressV2 "(27x12)(20x12)(13x14)(7x10)(1x12)A" |> printfn "Test: %d" // 241920
decompressV2 "(25x3)(3x3)ABC(2x3)XY(5x2)PQRSTX(18x9)(3x2)TWO(5x7)SEVEN" |> printfn "Test: %d"  // 445
decompressV2 input |> printfn "Part b: %d"

But this is very slow (20 minutes on my machine) since the decompressed string is extremely long, meaning we have to iterate through a very long and repetitive sequence of expansions. A more sane way is to write an algorithm just keeping track of the length of the decoded string, which greatly reduces the work required since we can just multiply the length of an expanded substring by its number of repetitions.

So here’s the solution I used for part b (which was actually what I created first):

let rec decompressLen (input:string) =
    let rec expand (parsePos, decodedLen) =
        let m = Regex.Match(input.Substring(parsePos), @"\((\d+)x(\d+)\)")
        if m.Success then
            let chars = int m.Groups.[1].Value
            let repeats = int64 m.Groups.[2].Value
            let repeat = input.Substring(parsePos + m.Index + m.Length, chars)
            let expandedLen = (decompressLen repeat) * repeats 
            expand (parsePos + m.Index + m.Length + chars,
                    decodedLen + int64 m.Index + expandedLen)
        else
            decodedLen + int64(input.Length-parsePos)
    
    expand (0,0L)

decompressLen "(27x12)(20x12)(13x14)(7x10)(1x12)A" |> printfn "Test: %d" // 241920
decompressLen "(25x3)(3x3)ABC(2x3)XY(5x2)PQRSTX(18x9)(3x2)TWO(5x7)SEVEN" |> printfn "Test: %d"  // 445
decompressLen input |> printfn "Part b: %d"

These two solutions show the classic trade-off between single and general purpose code. The optimised solution runs far quicker to get the length of decompressed data, but if the part b question had been “what is the 10 millionth character in the decompressed string”, our general purpose function would have been able to give us the answer very easily whilst the optimised solution would be of no use.

So in the real world, choosing between two implementation options requires that we have some idea of what the future requirements might be, and that we have an idea of what sort of input we might expect (e.g. do we need to cope with very large input data?). Usually we end up guessing at both.