Yesterday (2010-11-06), when I took a break from studying, I started looking at the Mach-O Format used by Apple’s Mac OS X.  It is a format for executable,  similar in purpose to the PE Format (Portable Executable) that is used by Microsoft Windows. That said both are not limited just to runnable executable but could also be a dynamic library (Usually denoted by dll extension for/on Windows). I started working on a simple C program, were you provide it the path to a Mach-O binary and it would print on the information. I am aware there is objdump and a few other tools which will do this but I was after learning the format, not after a way to decode it, and none of them had a nice modular/library interface to being able to do such a thing anyway.

The binary I was looking at was “Mach-O fat file with 2 architectures” but more on this later. I was able to use 7-zip to extract one of the architecture contained in it out. The format is documented on the Apple Developer site, and the main structure is as follows:

struct mach_header macho_header_struct
{
    uint32_t magic;
    cpu_type_t cputype;
    cpu_subtype_t cpusubtype;
    uint32_t filetype;
    uint32_t ncmds;
    uint32_t sizeofcmds;
    uint32_t flags;
};

 

Ignoring the error checking, the actual workload looked something like this:

struct mach_header header;
bufferSize = fread (&header,1,sizeof(header), file);
printf("mach object header\n"); 
printf("magic number\t%08x\n", header.magic); 
printf("cputype\t\t%08x\t%s\n", header.cputype, cputype_to_string(false, header.cputype)); 
printf("subcputype\t%08x\n",  header.cpusubtype); 
printf("filetype\t%08x\t%s\n",  header.filetype, filetype_to_string(header.filetype)); 
printf("ncmds\t\t%08x\n",  header.ncmds); 
printf("sizeofcmds\t%08x\n",  header.sizeofcmds); 
printf("flags\t\t%08x\n",  header.flags); 
 

Once I had that working, the next step I choose to take was to look into the “fat” file again. Since I had the above in place I realised that if it ias a “fat” executable it has a different magic number and it matches the magic number defined in the documentation. Instead of a mach_header there is a fat header which has the magic number and the number of architectures. Straight after that structure in the file is an array of structures containing the information about where to find the mach_header for each architecture. It would seem the Apple marketing term for a “fat” executable is a Universal binary as it can be used on more then one platform.

This diagram shows the layout of data for an fat executable, it shows the fields in the fat_header and fat_arch structures. Something I found interesting was the offset, the offset is from the start of the file and not from the offset location or at the least the start of the fat_arch structure. After the fat_arch should be the first mach_header structure. Something that may be interesting to follow up is if the mach_header comes straight after the last fat_arch structure or is there padding/zeroed space or some other undocumented metadata that I haven’t came across in my studies so far. Since the location is based from the offset it would suggest there no requirement for it to be straight after.

macho_header

I then was able to seek to the offset of one of the fat_arch and then the code above I had previously written for just reading the mach_header and printing it out, worked. The next step from here is to decode the load commands into the appropriate structures, a problem here is there are 37 structures. I can see two ways of dealing with this, the first way is create each different structure read it, then I can also print out information while I work but this requires doing it for all 37 structures. The second option is generically allocate it, since the command structure all have 2 fields the type and the size, so the size could be used to allocate the amount of memory it requires, then it can just read in the structure as bytes, delaying the structure till later where it can be casted on use. Another aspect I will need to think about soon is and appropriate API and how to go about making the current stuff more modular, so for example for a fat binary it will give the array of fat_arch structures, then that can be used to then get a mach_header, followed by a load_commands or maybe even just get_commands, that is just some food for my thoughts.