ZIP file from scratch

(home) 2020-11-27

Let's try creating a ZIP file from scratch. Since I'm mostly interested in the layout and meta-data we'll be using the option to add the files uncompressed. That means I don't have to learn the [deflate] compression algorithm at the same time. The resulting knowledge is still useful in that it will allow us to bundle a set of files into a single, ubiquitous format.

unix [cpio] (or even [tar]) would be easier but it's not ubiquitous.... I'm looking at you DOS.

I'll be using C for this, since it's ubiquitous... since I like it. Also, C is good for the kind of low-level bit manipulation we'll be doing here. Even if you don't like C there is still some useful information here about the zip file format. (I was going to use Common Lisp, but it's not ubiquitous.)

Useful resources:

simplest zip file

Zip files are read from the end. This allows the zip file to built in a single pass. (It also allows some other tricks like prefixing data at the start of a zip file e.g. an executable to unpack the remainder.)

A zip reader finds the final record by searching backwards for the record's magic number e.g. 0x06054b50 (a record is just a block of bytes in the file). Every zip file must contain an End of Central Directory Record and the simplest zip file consists of only an End of Central Directory Record. That would be an empty zip file, but a zip file nevertheless.

note the mention of disk numbers, this file format was invented back in the days of archives spanning multiple floppy disks.

The EOCD format is given here. https://en.wikipedia.org/wiki/ZIP_(file_format)#End_of_central_directory_record_(EOCD)

Remember that all integers (including the magic) must be written little-endian (i.e. byte reversed). I will assume your platform does that automatically since most people will be using an Intel/AMD processor. (Check the C functions ntohs() and ntohl() if your platform is big endian.)

This is the C function which will write the EOCD record. the full code will be linked at the bottom of the article

/*
  EOCD record [End of Central Directory] (all integers little-endian)
  https://en.wikipedia.org/wiki/ZIP_(file_format)#End_of_central_directory_record_(EOCD)
  == offset len description
  0   4   End of central directory signature = 0x06054b50
  4   2   Number of this disk
  6   2   Disk where central directory starts
  8   2   Number of central directory records on this disk
  10  2   Total number of central directory records
  12  4   Size of central directory (bytes)
  16  4   Offset of start of central directory, relative to start of archive
  20  2   Comment length (n)
  22  n   Comment
*/
static void eocd(unsigned nentries, FILE *f){
  enum{magic=0x06054b50, disknum=0, no_comment=0, };
  size_t debug_size = curoffset;
  uint32_t const cdsz = curoffset - cdoffset;
  putu32(magic, f);
  putu16(disknum, f);
  putu16(disknum, f);
  putu16(nentries, f);
  putu16(nentries, f);
  putu32(cdsz, f);
  putu32(cdoffset, f);
  putu16(no_comment, f);
  // no comment to put
  debug_size = curoffset-debug_size;
  assert(22==debug_size);
}

Build the code and call the program to produce our zip file (on stdout):

$ make mkzip
$ ./mkzip >ex.zip
$ file ex.zip
ex.zip: Zip archive data (empty)

Congratulations! We have created a zip file from scratch. That will be all for today. Thank you for watching...

adding some files

$ unzip -l ex.zip
Archive:  ex.zip
warning [ex.zip]:  zipfile is empty

Ok, so we'd actually like to add some files to our archive. Otherwise what's the point?

Every file we add must be prefixed with a local file header and, after all the files are added, we add a Central Directory record which contains the locations (offsets) of all those files.

Firstly let's add the local file header. It needs the length of the file and the length of the file name. No problem. More difficult is that the header needs the CRC32 checksum of the file (to allow the zip reader to check the integrity of the archive). https://en.wikipedia.org/wiki/ZIP_(file_format)#Local_file_header

The trailing data descriptor record which can be used to add the CRC32 checksum and file size/compressed size after we've written the file. This can help simplify the zip writer's job. We will not use these here.

I have a feeling that CRC32 checksums will be difficult. So let's just write a zero crc32 and see what happens...

https://en.wikipedia.org/wiki/Cyclic_redundancy_check#CRC-32_algorithm

Call local_file_header for each file we want to add to the zip. I've defined a struct zfile_t to hold the file name and some other information like it's size and crc32 (set to zero for now). full code linked at the end.

/*
  Local File Record
  https://en.wikipedia.org/wiki/ZIP_(file_format)
  == offset len description
  0   4   Local file header signature = 0x04034b50 (read as a little-endian number)
  4   2   Version needed to extract (minimum)
  6   2   General purpose bit flag
  8   2   Compression method
  10  2   File last modification time
  12  2   File last modification date
  14  4   CRC-32 of uncompressed data
  18  4   Compressed size
  22  4   Uncompressed size
  26  2   File name length (n)
  28  2   Extra field length (m)
  30  n   File name
  30+n  m   Extra field
*/
static void local_file_header(zfile_t *zf, FILE *f){
  enum{magic=0x04034b50, bit_flags=0, no_compression_method=0,
    mod_time=0, mod_date=0,
    no_extra=0,
    };
  size_t debug_size = curoffset;
  unsigned const fnamelen = strlen(zf->fname);
  putu32(magic, f);
  putu16(EXTRACTOR_MIN_VERSION, f);
  putu16(bit_flags, f);
  putu16(no_compression_method, f);
  putu16(mod_time, f);
  putu16(mod_date, f);
  putu32(zf->crc32, f);
  putu32(zf->sz, f); // compressed size == uncompressed size cos not compressing
  putu32(zf->sz, f);
  putu16(fnamelen, f);
  putu16(no_extra, f);
  putbytes((uint8_t const *)zf->fname, fnamelen, f);
  debug_size = curoffset-debug_size;
  assert((30+fnamelen)==debug_size);
}

If we try to unzip the file now we get this error. We don't have the Central Directory record yet.

$ unzip -l ex.zip
Archive:  ex.zip
warning [ex.zip]:  25784 extra bytes at beginning or within zipfile
(attempting to process anyway)
warning [ex.zip]:  zipfile is empty

the central directory

This record/block contains a list of all the files in the archive (it need not contain all the files we added, nor list them in the same order; this allows deleting files without rewriting the entire archive, important if you have 10 floppy disks...).

The central directory doesn't have a separate header. It's not really a record, more a sequence of records, one per file of interest. These records are similar to the local file header records but have a few extra fields.

Finally we must remember to update our, now non-empty, EOCD with link back to our central directory record. Until now we were just writing zero into the EOCD fields [size of central directory] and [offset of central directory].

We can easily capture the file offset before we start writing the Central Directory and get the size by subtracting it from the file offset after we have written the Central Directory.

/*
  Central Directory Entry
  https://en.wikipedia.org/wiki/ZIP_(file_format)
  == offset len description
  0   4   Central directory file header signature = 0x02014b50
  4   2   Version made by
  6   2   Version needed to extract (minimum)
  8   2   General purpose bit flag
  10  2   Compression method
  12  2   File last modification time
  14  2   File last modification date
  16  4   CRC-32 of uncompressed data
  20  4   Compressed size
  24  4   Uncompressed size
  28  2   File name length (n)
  30  2   Extra field length (m)
  32  2   File comment length (k)
  34  2   Disk number where file starts
  36  2   Internal file attributes
  38  4   External file attributes
  42  4   Relative offset of local file header. This is the number of bytes
    between the start of the first disk on which the file occurs, and the
    start of the local file header. This allows software reading the central
    directory to locate the position of the file inside the ZIP file.
  46  n   File name
  46+n  m   Extra field
  46+n+m  k   File comment
*/
static unsigned cdir(zfile_t files[], FILE *f){
  unsigned nfiles=0;
  enum{magic=0x02014b50,
    bit_flags=0, no_compression_method=0,
    mod_time=0, mod_date=0,
    no_extra=0, no_comment=0,
    disknum=0,
    fileattr_internal=0, fileattr_external=0,
    };
  cdoffset = curoffset;
  for(zfile_t *zf=files; zf->fname; zf++,nfiles++){
    size_t debug_size = curoffset;
    unsigned const fnamelen = strlen(zf->fname);
    putu32(magic, f);
    putu16(CREATOR_VERSION, f);
    putu16(EXTRACTOR_MIN_VERSION, f);
    putu16(bit_flags, f);
    putu16(no_compression_method, f);
    putu16(mod_time, f);
    putu16(mod_date, f);
    putu32(zf->crc32, f);
    putu32(zf->sz, f); // compressed size == uncompressed size cos not compressing
    putu32(zf->sz, f);
    putu16(fnamelen, f);
    putu16(no_extra, f);
    putu16(no_comment, f);
    putu16(disknum, f);
    putu16(fileattr_internal, f);
    putu32(fileattr_external, f);
    putu32(zf->offset, f);
    putbytes((uint8_t const *)zf->fname, fnamelen, f);
    // no extra field
    // no comment
    debug_size = curoffset-debug_size;
    assert((46+fnamelen)==debug_size);
  }
  return nfiles;
}

Let's try unpacking our putative archive again. We learned that zip uses date formats based from 1980-00-00 (which is similar to the unix epoch 1970-01-01, except with bigger shoulder pads and more hair gel). At the moment I'm not too interested in the timestamps.

$ unzip -l ex.zip
Archive:  ex.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     4596  1980-00-00 00:00   readme.zipformat
     6694  1980-00-00 00:00   mkzip.c
    18040  1980-00-00 00:00   mkzip
error:  expected central file header signature not found (file #4).
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)

We do see the correct filenames and the sizes match (the current) sizes of the files. Good news!

$ wc -c readme.zipformat mkzip.c mkzip
 4596 readme.zipformat
 6694 mkzip.c
18040 mkzip

However, something is wrong with my central file directory. Maybe the contents, maybe my offset to it in EOCD is wrong?

The problem is in the EOCD. There are 2 fields [number of central directory records] and [total number of central directory records]. I had initially set these both equal 1 cos of a misunderstanding on my part that there is a Central Directory record, rather than a sequence of them. After fixing that (correct in the code above already) I finally reach my crc32 checksum problem (i.e. that I've just written a zero for the checksum).

$ unzip -t ex.zip
Archive:  ex.zip
  testing: readme.zipformat         bad CRC 7261ec57  (should be 00000000)
  testing: mkzip.c                  bad CRC e7d510f6  (should be 00000000)
  testing: mkzip                    bad CRC 7858c036  (should be 00000000)

I could leave things like this since the zip archive will actually unpack in this state (with GNU/linux [unzip] at least).

crc32

Finding the correct variant of the crc32 function for zip proved tricky. There is not a lot of information out there and what there is, is contradictory.

The best description, and the one which finally worked, was this one at OSDev. https://wiki.osdev.org/CRC32

Most of the code online uses a pre-generated look-up table with 256 entries, but this page actually showed the code to generate the table.

The algorithm given assumes a single byte string to checksum but I load files in blocks. So I must be careful to do the init and complete parts separately and only the kernel on each block. (I wasn't careful, I repeatedly 1's-complemented the crc and got the wrong answer...)

Building the lookup-table. Normally this table would be generated to a C file once and included in the final zip tool. I just build it each time.

static uint32_t poly8_lookup[256];
static void crc32_mktab(void){
  uint32_t const crc_magic = 0xedb88320;
  uint32_t *table = poly8_lookup;
  uint8_t index=0,z;
  do{
    table[index]=index;
    for(z=8;z;z--) table[index]=(table[index]&1)?(table[index]>>1)^crc_magic:table[index]>>1;
  }while(++index);
}

The kernel of the crc32 calculation, called on blocks of file bytes as I read them in: Trivial No?

static uint32_t crc32update(uint32_t crc, uint8_t const *p, size_t n){
  while (n-- !=0) crc = poly8_lookup[((uint8_t) crc ^ *(p++))] ^ (crc >> 8);
  return crc;
}

The main file crc32 function. It reads the entire file and rewinds it. It would be trivial to incorporate this into the file copying function and avoid reading files twice. That's why I was too lazy to do it...

static uint32_t filecrc32(FILE *f){
  uint32_t crc=0xffffffff;
  char buf[8192];
  size_t rsz;
  while((rsz=fread(buf, 1, sizeof(buf), f))){
    crc = crc32update(crc, (uint8_t const *)buf, rsz);
  }
  assert(0==fseek(f, 0, SEEK_SET));
  return ~crc;
}

After setting the correct crc32 in both the local file header and the central directory entry my zip archive now tests as good:

$ make mkzip && ./mkzip >ex.zip
$ unzip -t ex.zip
Archive:  ex.zip
  testing: readme                   OK
  testing: mkzip.c                  OK
  testing: mkzip                    OK
No errors detected in compressed data of ex.zip.

missing things

Of course I've ignored the dates, the internal/external file attributes and the proper version numbers for creator and extractor, but the zip archive appears healthy enough as it is. I'd be interested to hear how well it does on other unzip tools. I've tried GNU/linux unzip and 7z and both worked without complaint.

reverse engineering zip (aka cheating)

One might wonder how our zip file differs from a 'real' zip file. Conveniently the 'real' zip tool has a [-Z store] option which allows us to produce a zip without compression, just like ours.

$ zip -Z store out.zip readme.zipformat mkzip.c mkzip
  adding: readme.zipformat (stored 0%)
  adding: mkzip.c (stored 0%)
  adding: mkzip (stored 0%)

Gasp they are not indentical! (No real surprise actually considering all the liberties we've been taking with the format).

$ cmp out.zip ex.zip
out.zip ex.zip differ: byte 5, line 1

Looking in more detail we see lots of similarities. The 'real' zip file has a 'real' crc32 checksum, we just have zero (written before I fixed crc32). The 'real' zip file has an extra field after the filename, we have none.

$ hd out.zip | head
00000000  50 4b 03 04 0a 00 00 00  00 00 47 4f 7a 51 a2 bb  |PK........GOzQ..|
00000010  b0 9b 04 16 00 00 04 16  00 00 10 00 1c 00 72 65  |..............re|
00000020  61 64 6d 65 2e 7a 69 70  66 6f 72 6d 61 74 55 54  |adme.zipformatUT| <--- the extra field we don't have
00000030  09 00 03 36 7c bf 5f 36  7c bf 5f 75 78 0b 00 01  |...6|._6|._ux...|
00000040  04 e8 03 00 00 04 e8 03  00 00 23 20 5a 49 50 20  |..........# ZIP |
$ hd ex.zip | head
00000000  50 4b 03 04 00 00 00 00  00 00 00 00 00 00 00 00  |PK..............|
00000010  00 00 04 16 00 00 04 16  00 00 10 00 00 00 72 65  |..............re|
00000020  61 64 6d 65 2e 7a 69 70  66 6f 72 6d 61 74 23 20  |adme.zipformat# | <--- file contents immediately after name
00000030  5a 49 50 20 66 69 6c 65  20 66 72 6f 6d 20 73 63  |ZIP file from sc|

mistakes

1. misunderstanding CD 'record' in the spec. I should have put nentries in the EOCD.

2. crc32 - expected this to be hard. Difficult and confusing information on web. When unsure which of your 3 possible solutions are correct it's difficult to focus on one. I made a small mistake (complementing many times, not just once at the end). This cost me the most time.

concepts

These reverse-engineering concepts are covered in this article.

the promised code

The code for producing an uncompressed zip file. It takes a list of file names on the command line and writes the zip file to stdout. Use it like this:

$ ./mkzip file1 file2 .... fileN >ex.zip
/src/mkzip.c

Tags: reverse-engineering bit-twiddling C compression file-formats zip (home)