Let's try creating a ZIP file from scratch. Since I'm mostly interested in the layout and meta-data we'll be using the option to add the files uncompressed. That means I don't have to learn the [deflate] compression algorithm at the same time. The resulting knowledge is still useful in that it will allow us to bundle a set of files into a single, ubiquitous format.
unix [cpio] (or even [tar]) would be easier but it's not ubiquitous.... I'm looking at you DOS.
I'll be using C for this, since it's ubiquitous... since I like it. Also, C is good for the kind of low-level bit manipulation
we'll be doing here.
Even if you don't like C there is still some useful information here about the zip file format.
(I was going to use Common Lisp, but it's not ubiquitous.)
Useful resources:
Zip files are read from the end. This allows the zip file to built in a single pass. (It also allows some other tricks like prefixing data at the start of a zip file e.g. an executable to unpack the remainder.)
A zip reader finds the final record by searching backwards for the record's magic number e.g. 0x06054b50
(a record is just a block of bytes in the file).
Every zip file must contain an End of Central Directory Record
and the
simplest zip file consists of only an End of Central Directory Record. That would be an empty zip file, but a zip file nevertheless.
note the mention of disk numbers, this file format was invented back in the days of archives spanning multiple floppy disks.
The EOCD format is given here. https://en.wikipedia.org/wiki/ZIP_(file_format)#End_of_central_directory_record_(EOCD)
Remember that all integers (including the magic) must be written little-endian (i.e. byte reversed). I will assume your platform does that automatically since most people will be using an Intel/AMD processor. (Check the C functions ntohs() and ntohl() if your platform is big endian.)
This is the C function which will write the EOCD record. the full code will be linked at the bottom of the article
/* EOCD record [End of Central Directory] (all integers little-endian) https://en.wikipedia.org/wiki/ZIP_(file_format)#End_of_central_directory_record_(EOCD) == offset len description 0 4 End of central directory signature = 0x06054b50 4 2 Number of this disk 6 2 Disk where central directory starts 8 2 Number of central directory records on this disk 10 2 Total number of central directory records 12 4 Size of central directory (bytes) 16 4 Offset of start of central directory, relative to start of archive 20 2 Comment length (n) 22 n Comment */ static void eocd(unsigned nentries, FILE *f){ enum{magic=0x06054b50, disknum=0, no_comment=0, }; size_t debug_size = curoffset; uint32_t const cdsz = curoffset - cdoffset; putu32(magic, f); putu16(disknum, f); putu16(disknum, f); putu16(nentries, f); putu16(nentries, f); putu32(cdsz, f); putu32(cdoffset, f); putu16(no_comment, f); // no comment to put debug_size = curoffset-debug_size; assert(22==debug_size); }
Build the code and call the program to produce our zip file (on stdout):
$ make mkzip $ ./mkzip >ex.zip $ file ex.zip ex.zip: Zip archive data (empty)
Congratulations! We have created a zip file from scratch. That will be all for today. Thank you for watching...
$ unzip -l ex.zip Archive: ex.zip warning [ex.zip]: zipfile is empty
Ok, so we'd actually like to add some files to our archive. Otherwise what's the point?
Every file we add must be prefixed with a local file header
and, after all the files are added, we add a Central Directory
record which
contains the locations (offsets) of all those files.
Firstly let's add the local file header. It needs the length of the file and the length of the file name. No problem. More difficult is that the header needs the CRC32 checksum of the file (to allow the zip reader to check the integrity of the archive). https://en.wikipedia.org/wiki/ZIP_(file_format)#Local_file_header
The trailing data descriptor record which can be used to add the CRC32 checksum and file size/compressed size after we've written the file. This can help simplify the zip writer's job. We will not use these here.
I have a feeling that CRC32 checksums will be difficult. So let's just write a zero crc32 and see what happens...
https://en.wikipedia.org/wiki/Cyclic_redundancy_check#CRC-32_algorithmCall local_file_header
for each file we want to add to the zip. I've defined a struct zfile_t
to hold the file name and some
other information like it's size and crc32 (set to zero for now). full code linked at the end.
/* Local File Record https://en.wikipedia.org/wiki/ZIP_(file_format) == offset len description 0 4 Local file header signature = 0x04034b50 (read as a little-endian number) 4 2 Version needed to extract (minimum) 6 2 General purpose bit flag 8 2 Compression method 10 2 File last modification time 12 2 File last modification date 14 4 CRC-32 of uncompressed data 18 4 Compressed size 22 4 Uncompressed size 26 2 File name length (n) 28 2 Extra field length (m) 30 n File name 30+n m Extra field */ static void local_file_header(zfile_t *zf, FILE *f){ enum{magic=0x04034b50, bit_flags=0, no_compression_method=0, mod_time=0, mod_date=0, no_extra=0, }; size_t debug_size = curoffset; unsigned const fnamelen = strlen(zf->fname); putu32(magic, f); putu16(EXTRACTOR_MIN_VERSION, f); putu16(bit_flags, f); putu16(no_compression_method, f); putu16(mod_time, f); putu16(mod_date, f); putu32(zf->crc32, f); putu32(zf->sz, f); // compressed size == uncompressed size cos not compressing putu32(zf->sz, f); putu16(fnamelen, f); putu16(no_extra, f); putbytes((uint8_t const *)zf->fname, fnamelen, f); debug_size = curoffset-debug_size; assert((30+fnamelen)==debug_size); }
If we try to unzip the file now we get this error. We don't have the Central Directory record yet.
$ unzip -l ex.zip Archive: ex.zip warning [ex.zip]: 25784 extra bytes at beginning or within zipfile (attempting to process anyway) warning [ex.zip]: zipfile is empty
This record/block contains a list of all the files in the archive (it need not contain all the files we added, nor list them in the same order; this allows deleting files without rewriting the entire archive, important if you have 10 floppy disks...).
The central directory doesn't have a separate header. It's not really a record, more a sequence of records, one per file of interest. These records are similar to the local file header records but have a few extra fields.
Finally we must remember to update our, now non-empty, EOCD with link back to our central directory record. Until now we were just writing zero into the EOCD fields [size of central directory] and [offset of central directory].
We can easily capture the file offset before we start writing the Central Directory and get the size by subtracting it from the file offset after we have written the Central Directory.
/* Central Directory Entry https://en.wikipedia.org/wiki/ZIP_(file_format) == offset len description 0 4 Central directory file header signature = 0x02014b50 4 2 Version made by 6 2 Version needed to extract (minimum) 8 2 General purpose bit flag 10 2 Compression method 12 2 File last modification time 14 2 File last modification date 16 4 CRC-32 of uncompressed data 20 4 Compressed size 24 4 Uncompressed size 28 2 File name length (n) 30 2 Extra field length (m) 32 2 File comment length (k) 34 2 Disk number where file starts 36 2 Internal file attributes 38 4 External file attributes 42 4 Relative offset of local file header. This is the number of bytes between the start of the first disk on which the file occurs, and the start of the local file header. This allows software reading the central directory to locate the position of the file inside the ZIP file. 46 n File name 46+n m Extra field 46+n+m k File comment */ static unsigned cdir(zfile_t files[], FILE *f){ unsigned nfiles=0; enum{magic=0x02014b50, bit_flags=0, no_compression_method=0, mod_time=0, mod_date=0, no_extra=0, no_comment=0, disknum=0, fileattr_internal=0, fileattr_external=0, }; cdoffset = curoffset; for(zfile_t *zf=files; zf->fname; zf++,nfiles++){ size_t debug_size = curoffset; unsigned const fnamelen = strlen(zf->fname); putu32(magic, f); putu16(CREATOR_VERSION, f); putu16(EXTRACTOR_MIN_VERSION, f); putu16(bit_flags, f); putu16(no_compression_method, f); putu16(mod_time, f); putu16(mod_date, f); putu32(zf->crc32, f); putu32(zf->sz, f); // compressed size == uncompressed size cos not compressing putu32(zf->sz, f); putu16(fnamelen, f); putu16(no_extra, f); putu16(no_comment, f); putu16(disknum, f); putu16(fileattr_internal, f); putu32(fileattr_external, f); putu32(zf->offset, f); putbytes((uint8_t const *)zf->fname, fnamelen, f); // no extra field // no comment debug_size = curoffset-debug_size; assert((46+fnamelen)==debug_size); } return nfiles; }
Let's try unpacking our putative archive again. We learned that zip uses date formats based from 1980-00-00 (which is similar to the unix epoch 1970-01-01, except with bigger shoulder pads and more hair gel). At the moment I'm not too interested in the timestamps.
$ unzip -l ex.zip Archive: ex.zip Length Date Time Name --------- ---------- ----- ---- 4596 1980-00-00 00:00 readme.zipformat 6694 1980-00-00 00:00 mkzip.c 18040 1980-00-00 00:00 mkzip error: expected central file header signature not found (file #4). (please check that you have transferred or created the zipfile in the appropriate BINARY mode and that you have compiled UnZip properly)
We do see the correct filenames and the sizes match (the current) sizes of the files. Good news!
$ wc -c readme.zipformat mkzip.c mkzip 4596 readme.zipformat 6694 mkzip.c 18040 mkzip
However, something is wrong with my central file directory. Maybe the contents, maybe my offset to it in EOCD is wrong?
The problem is in the EOCD. There are 2 fields [number of central directory records] and [total number of central directory records]. I had initially set these both equal 1 cos of a misunderstanding on my part that there is a Central Directory record, rather than a sequence of them. After fixing that (correct in the code above already) I finally reach my crc32 checksum problem (i.e. that I've just written a zero for the checksum).
$ unzip -t ex.zip Archive: ex.zip testing: readme.zipformat bad CRC 7261ec57 (should be 00000000) testing: mkzip.c bad CRC e7d510f6 (should be 00000000) testing: mkzip bad CRC 7858c036 (should be 00000000)
I could leave things like this since the zip archive will actually unpack in this state (with GNU/linux [unzip] at least).
Finding the correct variant of the crc32 function for zip proved tricky. There is not a lot of information out there and what there is, is contradictory.
The best description, and the one which finally worked, was this one at OSDev. https://wiki.osdev.org/CRC32
Most of the code online uses a pre-generated look-up table with 256 entries, but this page actually showed the code to generate the table.
The algorithm given assumes a single byte string to checksum but I load files in blocks. So I must be careful to do the init and complete parts separately and only the kernel on each block. (I wasn't careful, I repeatedly 1's-complemented the crc and got the wrong answer...)
Building the lookup-table. Normally this table would be generated to a C file once and included in the final zip tool. I just build it each time.
static uint32_t poly8_lookup[256]; static void crc32_mktab(void){ uint32_t const crc_magic = 0xedb88320; uint32_t *table = poly8_lookup; uint8_t index=0,z; do{ table[index]=index; for(z=8;z;z--) table[index]=(table[index]&1)?(table[index]>>1)^crc_magic:table[index]>>1; }while(++index); }
The kernel of the crc32 calculation, called on blocks of file bytes as I read them in: Trivial No?
static uint32_t crc32update(uint32_t crc, uint8_t const *p, size_t n){ while (n-- !=0) crc = poly8_lookup[((uint8_t) crc ^ *(p++))] ^ (crc >> 8); return crc; }
The main file crc32 function. It reads the entire file and rewinds it. It would be trivial to incorporate this into the file copying function and avoid reading files twice. That's why I was too lazy to do it...
static uint32_t filecrc32(FILE *f){ uint32_t crc=0xffffffff; char buf[8192]; size_t rsz; while((rsz=fread(buf, 1, sizeof(buf), f))){ crc = crc32update(crc, (uint8_t const *)buf, rsz); } assert(0==fseek(f, 0, SEEK_SET)); return ~crc; }
After setting the correct crc32 in both the local file header and the central directory entry my zip archive now tests as good:
$ make mkzip && ./mkzip >ex.zip $ unzip -t ex.zip Archive: ex.zip testing: readme OK testing: mkzip.c OK testing: mkzip OK No errors detected in compressed data of ex.zip.
Of course I've ignored the dates, the internal/external file attributes and the proper version numbers for creator and extractor, but the zip archive appears healthy enough as it is. I'd be interested to hear how well it does on other unzip tools. I've tried GNU/linux unzip and 7z and both worked without complaint.
One might wonder how our zip file differs from a 'real' zip file. Conveniently the 'real' zip tool has a [-Z store] option which allows us to produce a zip without compression, just like ours.
$ zip -Z store out.zip readme.zipformat mkzip.c mkzip adding: readme.zipformat (stored 0%) adding: mkzip.c (stored 0%) adding: mkzip (stored 0%)
Gasp they are not indentical! (No real surprise actually considering all the liberties we've been taking with the format).
$ cmp out.zip ex.zip out.zip ex.zip differ: byte 5, line 1
Looking in more detail we see lots of similarities. The 'real' zip file has a 'real' crc32 checksum, we just have zero (written before I fixed crc32). The 'real' zip file has an extra field after the filename, we have none.
$ hd out.zip | head 00000000 50 4b 03 04 0a 00 00 00 00 00 47 4f 7a 51 a2 bb |PK........GOzQ..| 00000010 b0 9b 04 16 00 00 04 16 00 00 10 00 1c 00 72 65 |..............re| 00000020 61 64 6d 65 2e 7a 69 70 66 6f 72 6d 61 74 55 54 |adme.zipformatUT| <--- the extra field we don't have 00000030 09 00 03 36 7c bf 5f 36 7c bf 5f 75 78 0b 00 01 |...6|._6|._ux...| 00000040 04 e8 03 00 00 04 e8 03 00 00 23 20 5a 49 50 20 |..........# ZIP | $ hd ex.zip | head 00000000 50 4b 03 04 00 00 00 00 00 00 00 00 00 00 00 00 |PK..............| 00000010 00 00 04 16 00 00 04 16 00 00 10 00 00 00 72 65 |..............re| 00000020 61 64 6d 65 2e 7a 69 70 66 6f 72 6d 61 74 23 20 |adme.zipformat# | <--- file contents immediately after name 00000030 5a 49 50 20 66 69 6c 65 20 66 72 6f 6d 20 73 63 |ZIP file from sc|
1. misunderstanding CD 'record' in the spec. I should have put nentries in the EOCD.
2. crc32 - expected this to be hard. Difficult and confusing information on web. When unsure which of your 3 possible solutions are correct it's difficult to focus on one. I made a small mistake (complementing many times, not just once at the end). This cost me the most time.
These reverse-engineering concepts are covered in this article.
The code for producing an uncompressed zip file. It takes a list of file names on the command line and writes the zip file to stdout. Use it like this:
$ ./mkzip file1 file2 .... fileN >ex.zip/src/mkzip.c