Posted in:

If you'd like to create a zip file in Azure blob storage using the Azure Storage SDK you can of course just create the zip locally and then upload it, but in this post I want to show how you can create a zip on the fly, and show you a potential gotcha and how to work around it.

Getting set up

For this example I'll use a connection string, but of course you can use DefaultAzureCredential if you prefer. We'll get a container client, and create a block blob client that we're going to upload to. I'll also assume that we have a local folder of files we want to zip for the purposes of this demo.

var blobServiceClient = new BlobServiceClient(connectionString);
var containerName = "my-container";
var containerClient = blobServiceClient.GetBlobContainerClient(containerName);
var blobName = "zipExperiment.zip";
var zipBlockClient = containerClient.GetBlockBlobClient(blobName);
// location of the files
var files = Directory.GetFiles(@"C:\My source files");

Writing to the zip

To write the zip we'll first of all open a writable stream to the blob using OpenWriteAsync and then use ZipArchive from System.IO.Compression to create the zip archive. Then for each file we want to add, we call CreateEntry to create a new entry in the zip file. Then we Open a writable stream for that zip entry, and copy our local file contents in.

var writeOptions = new BlockBlobOpenWriteOptions(); // can customize if you like with tags etc
var index = 0; // for this demo using an incrementing number to ensure each file in the zip has a unique name
using (var blobWriteStream = await zipBlockClient.OpenWriteAsync(true, writeOptions, CancellationToken.None))
using (var zipArchive = new ZipArchive(blobWriteStream, ZipArchiveMode.Create))
{
    foreach (var file in files)
    {
        var fileName = Path.GetFileName(file);
        var entry = zipArchive.CreateEntry($"{++index}{fileName}");
        using var zipEntryStream = entry.Open();
        using var localFileStream = File.OpenRead(file);
        await localFileStream.CopyToAsync(zipEntryStream);
    }
}

As you can see, this is all pretty straightforward, and works just fine for a typical zip file. However, there is a gotcha that I ran into when this code was used to create a zip file with many issues.

The block size issue

In Azure, a blob can be made up of many "blocks". There is a limit of 50,000 blocks per blob, and the default size of a block is 4MB (although larger block sizes are supported). This means that in theory you could create a 200GB zip file without adjusting the defaults.

However, if you use the code above, every time you finish adding an entry to the zip file, you'll end up with another block. This means that if you try to create a zip file with 50,001 fairly small files, even if the resulting zip file would only be 10s of MB in size, you wouldn't be able to create the file.

The culprit is that the Flush method is being called on the writable blob stream after we've finished uploading each zip entry. And this is interpreted as meaning that we want to complete the current block, resulting in blocks that are smaller than 4MB. The trouble is that we don't have direct control over when Flush is being called.

My solution is fairly simple, I created a decorator we can use to wrap the writable blob stream that simply doesn't pass on calls to Flush. It's pretty straightforward - here's the code:

class AvoidFlushStream : Stream
{
    private readonly Stream source;
    private bool disposed;
    public AvoidFlushStream(Stream source)
    {
        this.source = source;
    }
    
    public override bool CanRead => source.CanRead;

    public override bool CanSeek => source.CanSeek;

    public override bool CanWrite => source.CanWrite;

    public override long Length => source.Length;

    public override long Position { get => source.Position; set => source.Position = value; }

    public override void Flush()
    {
        //Console.WriteLine("Not gonna flush");
    }

    public override Task FlushAsync(CancellationToken cancellationToken)
    {
        return Task.CompletedTask;
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        return source.Read(buffer,offset,count);
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        return source.Seek(offset, origin);
    }

    public override void SetLength(long value)
    {
        source.SetLength(value);
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        source.Write(buffer, offset, count);
    }

    public override Task WriteAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken)
    {
        return base.WriteAsync(buffer, offset, count, cancellationToken);
    }

    public override ValueTask WriteAsync(ReadOnlyMemory<byte> buffer, CancellationToken cancellationToken = default)
    {
        return base.WriteAsync(buffer, cancellationToken);
    }

    public override ValueTask DisposeAsync()
    {
        return base.DisposeAsync();
    }

    public override Task<int> ReadAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken)
    {
        return base.ReadAsync(buffer, offset, count, cancellationToken);
    }

    public override ValueTask<int> ReadAsync(Memory<byte> buffer, CancellationToken cancellationToken = default)
    {
        return base.ReadAsync(buffer, cancellationToken);
    }

    protected override void Dispose(bool disposing)
    {
        if (!disposed)
        {
            if (disposing)
            {
                source.Dispose();
            }
            else
            {
                base.Dispose(false);
            }
            disposed = true;
        }
    }
}

And now we just need to make one small adjustment to wrap the writable stream with AvoidFlushStream:

using (var zipStream = new AvoidFlushStream(await blob1Client.OpenWriteAsync(true, writeOptions, CancellationToken.None)))
// ...

And with that change in place, now creating an 80Mb zip file with 50,001 entries took me 1 min 30 seconds. Without this wrapper, it took many hours before eventually failing because you'd used too many blocks.

Hope this is helpful to someone, and of course, if you know a better way to resolve this issue, I'd love to hear about it in the comments!