The Source History of Cat

By Sinclair Target, on 12 November 2018

I once had a debate with members of my extended family about whether a computer science degree is a degree worth pursuing. I was in college at the time and trying to decide whether I should major in computer science. My aunt and a cousin of mine believed that I shouldn't. They conceded that knowing how to program is of course a useful and lucrative thing, but they argued that the field of computer science advances so quickly that everything I learned would almost immediately be outdated. Better to pick up programming on the side and instead major in a field like economics or physics where the basic principles would be applicable throughout my lifetime.

I knew that my aunt and cousin were wrong and decided to major in computer science. (Sorry, aunt and cousin!) It is easy to see why the average person might believe that a field like computer science, or a profession like software engineering, completely reinvents itself every few years. We had personal computers, then the web, then phones, then machine learning... technology is always changing, so surely all the underlying principles and techniques change too. Of course, the amazing thing is how little actually changes. Most people, I'm sure, would be stunned to know just how old some of the important software on their computer really is. I'm not talking about flashy application software, admittedly—my copy of Firefox, the program I probably use the most on my computer, is not even two weeks old. But, if you pull up the manual page for something like grep, you will see that it has not been updated since 2010 (at least on MacOS). And the original version of grep was written in 1974, which in the computing world was back when dinosaurs roamed Silicon Valley. People (and programs) still depend on grep every day.

My aunt and cousin thought of computer technology as a series of increasingly elaborate sand castles supplanting one another after each high tide clears the beach. The reality, at least in many areas, is that we steadily accumulate programs that have solved problems. We might have to occasionally modify these programs to avoid software rot, but otherwise they can be left alone. grep is a simple program that solves a still-relevant problem, so it survives. Most application programming is done at a very high level, atop a pyramid of much older code solving much older problems. The ideas and concepts of 30 or 40 years ago, far from being obsolete today, have in many cases been embodied in software that you can still find installed on your laptop.

I thought it would be interesting to take a look at one such old program and see how much it had changed since it was first written. cat is maybe the simplest of all the Unix utilities, so I'm going to use it as my example. Ken Thompson wrote the original implementation of cat in 1969. If I were to tell somebody that I have a program on my computer from 1969, would that be accurate? How much has cat really evolved over the decades? How old is the software on our computers?

Thanks to repositories like this one, we can see exactly how cat has evolved since 1969. I'm going to focus on implementations of cat that are ancestors of the implementation I have on my Macbook. You will see, as we trace cat from the first versions of Unix down to the cat in MacOS today, that the program has been rewritten more times than you might expect—but it ultimately works more or less the same way it did fifty years ago.

Research Unix

Ken Thompson and Dennis Ritchie began writing Unix on a PDP 7. This was in 1969, before C, so all of the early Unix software was written in PDP 7 assembly. The exact flavor of assembly they used was unique to Unix, since Ken Thompson wrote his own assembler that added some features on top of the assembler provided by DEC, the PDP 7's manufacturer. Thompson's changes are all documented in the original Unix Programmer's Manual under the entry for as, the assembler.

The first implementation of cat is thus in PDP 7 assembly. I've added comments that try to explain what each instruction is doing, but the program is still difficult to follow unless you understand some of the extensions Thompson made while writing his assembler. There are two important ones. First, the ; character can be used to separate multiple statements on the same line. It appears that this was used most often to put system call arguments on the same line as the sys instruction. Second, Thompson added support for "temporary labels" using the digits 0 through 9. These are labels that can be reused throughout a program, thus being, according to the Unix Programmer's Manual, "less taxing both on the imagination of the programmer and on the symbol space of the assembler." From any given instruction, you can refer to the next or most recent temporary label n using nf and nb respectively. For example, if you have some code in a block labeled 1:, you can jump back to that block from further down by using the instruction jmp 1b. (But you cannot jump forward to that block from above without using jmp 1f instead.)

The most interesting thing about this first version of cat is that it contains two names we should recognize. There is a block of instructions labeled getc and a block of instructions labeled putc, demonstrating that these names are older than the C standard library. The first version of cat actually contained implementations of both functions. The implementations buffered input so that reads and writes were not done a character at a time.

The first version of cat did not last long. Ken Thompson and Dennis Ritchie were able to persuade Bell Labs to buy them a PDP 11 so that they could continue to expand and improve Unix. The PDP 11 had a different instruction set, so cat had to be rewritten. I've marked up this second version of cat with comments as well. It uses new assembler mnemonics for the new instruction set and takes advantage of the PDP 11's various addressing modes. (If you are confused by the parentheses and dollar signs in the source code, those are used to indicate different addressing modes.) But it also leverages the ; character and temporary labels just like the first version of cat, meaning that these features must have been retained when as was adapted for the PDP 11.

The second version of cat is significantly simpler than the first. It is also more "Unix-y" in that it doesn't just expect a list of filename arguments—it will, when given no arguments, read from stdin, which is what cat still does today. You can also give this version of cat an argument of - to indicate that it should read from stdin.

In 1973, in preparation for the release of the Fourth Edition of Unix, much of Unix was rewritten in C. But cat does not seem to have been rewritten in C until a while after that. The first C implementation of cat only shows up in the Seventh Edition of Unix. This implementation is really fun to look through because it is so simple. Of all the implementations to follow, this one most resembles the idealized cat used as a pedagogic demonstration in K&R C. The heart of the program is the classic two-liner:

while ((c = getc(fi)) != EOF)
    putchar(c);

There is of course quite a bit more code than that, but the extra code is mostly there to ensure that you aren't reading and writing to the same file. The other interesting thing to note is that this implementation of cat only recognized one flag, -u. The -u flag could be used to avoid buffering input and output, which cat would otherwise do in blocks of 512 bytes.

BSD

After the Seventh Edition, Unix spawned all sorts of derivatives and offshoots. MacOS is built on top of Darwin, which in turn is derived from the Berkeley Software Distribution (BSD), so BSD is the Unix offshoot we are most interested in. BSD was originally just a collection of useful programs and add-ons for Unix, but it eventually became a complete operating system. BSD seems to have relied on the original cat implementation up until the fourth BSD release, known as 4BSD, when support was added for a whole slew of new flags. The 4BSD implementation of cat is clearly derived from the original implementation, though it adds a new function to implement the behavior triggered by the new flags. The naming conventions already used in the file were adhered to—the fflg variable, used to mark whether input was being read from stdin or a file, was joined by nflg, bflg, vflg, sflg, eflg, and tflg, all there to record whether or not each new flag was supplied in the invocation of the program. These were the last command-line flags added to cat; the man page for cat today lists these flags and no others, at least on Mac OS. 4BSD was released in 1980, so this set of flags is 38 years old.

cat would be entirely rewritten a final time for BSD Net/2, which was, among other things, an attempt to avoid licensing issues by replacing all AT&T Unix-derived code with new code. BSD Net/2 was released in 1991. This final rewrite of cat was done by Kevin Fall, who graduated from Berkeley in 1988 and spent the next year working as a staff member at the Computer Systems Research Group (CSRG). Fall told me that a list of Unix utilities still implemented using AT&T code was put up on a wall at CSRG and staff were told to pick the utilities they wanted to reimplement. Fall picked cat and mknod. The cat implementation bundled with MacOS today is built from a source file that still bears his name at the very top. His version of cat, even though it is a relatively trivial program, is today used by millions.

Fall's original implementation of cat is much longer than anything we have seen so far. Other than support for a -? help flag, it adds nothing in the way of new functionality. Conceptually, it is very similar to the 4BSD implementation. It is only longer because Fall separates the implementation into a "raw" mode and a "cooked" mode. The "raw" mode is cat classic; it prints a file character for character. The "cooked" mode is cat with all the 4BSD command-line options. The distinction makes sense but it also pads out the implementation so that it seems more complex at first glance than it actually is. There is also a fancy error handling function at the end of the file that further adds to its length.

MacOS

In 2001, Apple launched Mac OS X. The launch was an important one for Apple, because Apple had spent many years trying and failing to replace its existing operating system (classic Mac OS), which had long been showing its age. There were two previous attempts to create a new operating system internally, but both went nowhere; in the end, Apple bought NeXT, Steve Jobs' company, which had developed an operating system and object-oriented programming framework called NeXTSTEP. Apple took NeXTSTEP and used it as a basis for Mac OS X. NeXTSTEP was in part built on BSD, so using NeXTSTEP as a starting point for Mac OS X brought BSD-derived code right into the center of the Apple universe.

The very first release of Mac OS X thus includes an implementation of cat pulled from the NetBSD project. NetBSD, which remains in development today, began as a fork of 386BSD, which in turn was based directly on BSD Net/2. So the first Mac OS X implementation of cat is Kevin Fall's cat. The only thing that had changed over the intervening decade was that Fall's error-handling function err() was removed and the err() function made available by err.h was used in its place. err.h is a BSD extension to the C standard library.

The NetBSD implementation of cat was later swapped out for FreeBSD's implementation of cat. According to Wikipedia, Apple began using FreeBSD instead of NetBSD in Mac OS X 10.3 (Panther). But the Mac OS X implementation of cat, according to Apple's own open source releases, was not replaced until Mac OS X 10.5 (Leopard) was released in 2007. The FreeBSD implementation that Apple swapped in for the Leopard release is the same implementation on Apple computers today. As of 2018, the implementation has not been updated or changed at all since 2007.

So the Mac OS cat is old. As it happens, it is actually two years older than its 2007 appearance in MacOS X would suggest. This 2005 change, which is visible in FreeBSD's Github mirror, was the last change made to FreeBSD's cat before Apple pulled it into Mac OS X. So the Mac OS X cat implementation, which has not been kept in sync with FreeBSD's cat implementation, is officially 13 years old. There's a larger debate to be had about how much software can change before it really counts as the same software; in this case, the source file has not changed at all since 2005.

The cat implementation used by Mac OS today is not that different from the implementation that Fall wrote for the 1991 BSD Net/2 release. The biggest difference is that a whole new function was added to provide Unix domain socket support. At some point, a FreeBSD developer also seems to have decided that Fall's raw_args() function and cook_args() should be combined into a single function called scanfiles(). Otherwise, the heart of the program is still Fall's code.

I asked Fall how he felt about having written the cat implementation now used by millions of Apple users, either directly or indirectly through some program that relies on cat being present. Fall, who is now a consultant and a co-author of the most recent editions of TCP/IP Illustrated, says that he is surprised when people get such a thrill out of learning about his work on cat. Fall has had a long career in computing and has worked on many high-profile projects, but it seems that many people still get most excited about the six months of work he put into rewriting cat in 1989.

The Hundred-Year-Old Program

In the grand scheme of things, computers are not an old invention. We're used to hundred-year-old photographs or even hundred-year-old camera footage. But computer programs are in a different category—they're high-tech and new. At least, they are now. As the computing industry matures, will we someday find ourselves using programs that approach the hundred-year-old mark?

Computer hardware will presumably change enough that we won't be able to take an executable compiled today and run it on hardware a century from now. Perhaps advances in programming language design will also mean that nobody will understand C in the future and cat will have long since been rewritten in another language. (Though C has already been around for fifty years, and it doesn't look like it is about to be replaced any time soon.) But barring all that, why not just keep using the cat we have forever?

I think the history of cat shows that some ideas in computer science are in fact very durable. Indeed, with cat, both the idea and the program itself are old. It may not be accurate to say that the cat on my computer is from 1969. But I could make a case for saying that the cat on my computer is from 1989, when Fall wrote his implementation of cat. Lots of other software is just as ancient. So maybe we shouldn't think of computer science and software development primarily as fields that disrupt the status quo and invent new things. Our computer systems are built out of historical artifacts. At some point, we may all spend more time trying to understand and maintain those historical artifacts than we spend writing new code.


originally posted at two bit history under CC BY-SA 4.0 by Sinclair Target

Made with ♥ by a group of nerds on Earth!