Wednesday, December 31, 2008

What I'm working on at Juniper

I haven't had much time to update my blog for a while.  It's mostly been work, so I think I'll tell you all about it.

I'm loving my job at Juniper, although I must say it's keeping me very busy lately.  The project I'm on is pretty close to the wire.  In the weeks approaching the Christmas holidays, I was spending every waking minute working on my code to get it ready.  That left me able to go to Christmas and enjoy it without worrying about work (more about my vacation in another post, but the basic summary is, it was terrific!).  Even though my part is working great, there's some other parts to the project that are a bit tense, so I'm helping where I can.  Since I got back from vacation, though, things have been a bit more relaxed for me.

Anyway, I can't go into all the background on the project, but I can tell you about my part.  (This gets a bit technical from here on in, so you can skip it if you're not a computer geek.)  The goal is to make it possible to compile with source code on NFS servers and object code on local disk.  The source then can be edited remotely on workstations, and easily backed up; the objects don't take up expensive NetApp space, and the compile proceeds at local speeds.  This mechanism speeds up compiles by a factor of 2-3.

The core of this is a new filesystem layer that I wrote, lcachefs.  This is a type of loopback filesystem, similar to nullfs or unionfs.  lcachefs does some magic mirroring involving three directories.  As with any filesystem, one of these is the mountpoint: where lcachefs shows its results.  Unlike most filesystems, though, it involves two sources: one is known as the source directory (which holds the NFS-mounted sources), and the other is a storage directory (local disk).

Ok, let's pretend that you first copy your sources from a source directory to the storage directory.  Then you loopback-mount the storage directory over your original source directory.  Then, you compile there.  (You mount it over your original source directory so that the filenames in the debugging information point to the original sources, instead of to a temporary storage directory.)  This looks, to the compiler, just like a regular compile.  However, instead of doing the compile on NFS, you're doing it on local disk.

This has two advantages.  The one that's easy to explain is that you're storing your objects and products on local disk (which is cheap), instead of on NetApps (which, despite all their greatness, are expensive).  Objects can easily be recreated, and take up most of the space in a build tree.

The other part is that you're working entirely on local disk.  This is much, MUCH faster than using NFS.  It's not because of the bulk data transfer (i.e., reads and writes), so higher bandwidth doesn't help.  The problem is with the per-request overhead.  There are a LOT of requests going across the wire in a compile.

Let's consider what happens when you build a file.  Specifically, we'll talk about if you're just building hello.o.  Make will look for Makefile, makefile, BSDmakefile, and .depend.  Then it will look at hello.c and hello.o.  (If it needs to, it may look for hello.s, hello.S, hello.f, and whatever other potential sources you may have.)  It'll look up hello.c.gch (if you have precompiled header support), hello.gcda (if you have coverage support).  Then it has to look for stdio.h.  Well, let's suppose that you have two directories in your -I path; let's call them foo and bar.  Now it needs to check for foo/stdio.h.gch, foo/stdio.h, bar/stdio.h.gch, and bar/stdio.h, all of which are in your source repository (hence usually on NFS) before it can go on to look at /usr/include/stdio.h.  Well, the first thing that stdio.h does (on FreeBSD; your OS may vary) is to include sys/cdefs.h.  You got it... another four lookups in the source repository to find sys/cdefs.h!  Repeat for sys/_null.h, sys/_types.h, and machine/_types.h.  In the end, just to compile hello.o, you've done 39 file lookups into your source directory.  Of those lookups, 21 are for .h files... and hello.c only includes one file, stdio.h!  The average .c file (using the FreeBSD source base as a sample) includes 9 files directly.

That's a lot of work.  These lookups are all blocking I/O, so it's one round trip each; from watching network traces, I'm actually very impressed with how fast NetApps can respond, but it's still a non-zero time.  Moreover, though, you've got to look at the CPU usage.  NFS uses RPC, and the overhead isn't cheap.  In my testing, using NFS to compile something requires four times as much kernel processing-- which works out to twice as much CPU overall-- as using local disk!  (That's all in the kernel, too, and that means that preemption isn't as easy, so scheduling takes a little hit... and don't forget the new network interrupts!)

Finally, NFS is designed around concurrent use.  You can't assume that the contents of an NFS-mounted directory will be the same now as they were 90 seconds ago.  While local disk can use metadata caches very, very effectively (and FreeBSD's filesystems do), when you're using NFS, you have to expire caches a lot.  (There are also some things that FreeBSD could do better when it comes to its NFS client, but I fixed those and only got a 5% boost.)

Ok, now you should be convinced that the speed hit for NFS is bad, and that building on local disk is much, much, faster.  Let's go back to talking about lcachefs.

Earlier, I asked you to think about copying your sources to local disk, and doing the compile there.  That's exactly what lcachefs does, but with a twist: it does this lazily, by which I mean it won't copy files until they're needed.

When you first mount an lcachefs directory, it scans just the top-level of your source directory.  It then creates stub files for each entry in the local storage.  This is just an empty file with a special uid/gid pair that marks it as a stub.  If you ls -l that file, then the filesystem will report the correct uid/gid, size, link count, etc, but in reality the locally-stored file is empty.

When you actually try to read (or otherwise use) a file or directory that's a stub, then it will copy it over to local storage before the system call returns to userland.  In the case of a file, this means it copies the contents and sets the uid/gid to its real, non-stub values.  In the case of a directory, it scans the directory and populates the stubs just like I described in the previous paragraph.

That's the basic overview.  The reality is quite complicated because of hard links, the need to watch out for vnode lock order reversal, and other icky stuff like that.  You can see all the gory details, though, once we open-source lcachefs.  I'm currently awaiting clearance from legal, but I hope to either contribute it to FreeBSD, or release it as an independent open-source project.

1 comment:

cristian said...
This comment has been removed by a blog administrator.