Software Preservation at the Computer History Museum

This article is a synopsis of several email and phone conversations, occurring in May and June of 2005, between Sellam Ismail, Curator of Software for the Computer History Museum (CHM) and Ellen Walker, DACS Analyst. It has been organized into a series of Questions and Answers for your reading convenience. The questions are not in any particular order.

Software archaeology involves digging into an existing (often perceived as ancient) code base to recover understanding of algorithms, but if the code is trapped on outmoded computer media, for which the means to read it is no longer available, the digging must begin at a deeper level, where the code being investigated is buried on old disks, tapes, or even punched cards. These old forms of storage are like a tomb for the program that lay within, and the first step in gaining access is solving the riddle of how to read them.

The CHM maintains a collection of vintage computers and related hardware. As a historical computer consultant specializing in vintage computing technology, I am often called upon to read programs and data from old computer media that runs the gamut from punched cards to paper tape to strange and bizarre tape

UNIVAC Punched Tape (1960s)

Computer control instructions are contained on punched paper tape from an early UNIVAC plant in Utica, NY. Try to imagine what the debugging process would be for this code. How would one proceed to dig into it?. How would the process differ from your current process? The image is provided by courtesy of the Department of Information Systems, London School of Economics website (see http://is.lse.ac.uk/History/UNIVAC- PunchedTape.htm).

formats that many people today either haven't even heard of or hardly remember. Each job I get is an interesting challenge, often times requiring many hours of investigative work to determine the format of the media and to piece together a functional system capable of recovering the bits. For this, I draw upon my vast collection of vintage computers: over 2,000 with the oldest being an original PDP-8 from 1965.

Most recently I was called upon to recover actual archaeological data from a set of VHS tapes at the Mel Fisher Museum in Key West, Florida. In the late 1970s, a company called Alpha Microsystems (one of the first microcomputer companies which is still in business today) pioneered a system for storing data on standard VHS tapes using ordinary video cassette recorders. At that time, the system was deemed to be a practical and elegant solution to the problem of backing up entire hard drives, which were then counted in the single megabytes but were getting bigger by leaps and bounds each year (much like our current trend).

When Mel Fisher discovered the Atocha treasure ship in 1986, which sunk off the Florida Keys in 1622, one of his first priorities was to make a proper record of the finds pulled up from the wreck. The sheer magnitude of the motherlode (for example, there were over 100,000 silver coins recovered) required a flexible and efficient solution for documenting and cataloguing each artifact. Fisher hired a computer consultant who designed a system to digitally photograph each coin so that a visual record could be made. The photos were taken hundreds at a time and stored to the hard disk of the digital camera station. Once the hard drive was full, the processed photos were backed up to VHS tapes, the hard drive was cleared, and the next batch was processed. The result was over 150 tapes consisting of tens of thousands of digital photographs. The tapes were then stored away for safe keeping and quietly forgotten.

The Mel Fisher Museum recently re-discovered the tapes (now fifteen years after their creation) and realized they had no way to read them. The equipment to process the tapes had long since vanished. Worse yet, the tapes held the only photographs of the silver coins that were pulled from beneath the ocean. My firm was hired to recover the data from the tapes. After 4 years and thousands of dollars in effort, we were able to track down and assemble the necessary hardware and software to read these tapes and convert the images to a modern graphics format (they were stored in a proprietary 16-level grayscale). The project involved dozens of hours of searching for interface cards, special VCRs, old software, ancient versions of DOS, and properly antiquated (i.e. slow) PCs to make everything work.

Unfortunately, this situation is quite common. Organizations have historically not considered the ramifications that the obsolescence of computer media can have. Before it is realized, media that holds perhaps thousands of man-hours of computer code and data can be put at risk when the last unit of a particular Zip drive leaves the assembly line. The issue has plagued government and private sector entities for decades. Computer technology advances so quickly that computer media can be outmoded suddenly and without warning. An organization that does not have a proper plan for the obsolescence of its data stores is one day going to face the same problem.

Through its acquisition of outdated hardware and software, the CHM is providing linkage to computer technologies of the past. The scope of usage of our computer artifacts, including software, can be whatever we, the community of software archaeologists, want it to be.

2 - What is the Museum currently doing regarding the collection and cataloguing of software?

In the past, the Museum tended to collect software as an afterthought. Software would usually come in as part of a hardware donation. As such, not much discretion was used in determining what software artifacts the Museum should be accepting, and as a result we ended up with a lot of incredible artifacts (such as the operating system and programs for the MIT Whirlwind) as well as a lot of rubbish, such as entirely non-interesting driver disks and random media with unknown stuff on it. Some of this "unknown stuff" may yet prove to be very historically significant, but without the proper context to go along with it, we will have to do a lot of investigating to separate the wheat from the chaff.

Fortunately, the Museum recognized this deficit in properly collecting software and created the position of Software Curator. I was tapped to fill the position and have since been establishing collecting guidelines for software as well as building out the infrastructure for properly maintaining the software collection, including physical storage, cataloguing, and access.

To that end, I first started by organizing the Museum's existing holdings. We have two rooms devoted exclusively to software. I've set up the necessary shelving and have arranged the various software artifacts in a structured manner to make the cataloguing process more efficient. I added the proper facilities for a Software Collection Catalog to the Museum's database, defining the fields and developing a data dictionary to use as reference for populating each record. I also developed a Software Collection Taxonomy, which ultimately serves the purpose of assisting researches in finding the type of software they are looking for. By using the taxonomy to categorize software, researchers can, for instance, request all titles in our collection that have something to do with spreadsheets, and from there they can perform a more refined search.

We've begun the cataloguing process and plan to have it completed by the end of the summer. We have over a thousand packaged software titles (e.g. commercial software), and thousands of other artifacts including software on common media such punched cards, paper tape, magnetic tape, floppy disk, etc., and on more exotic media such as magnetic film and program rods (steel rods with holes drilled into them to represent bits). Our software collection also includes source listings on paper. We have software going back to the early 1950s and as late as from within the last several years.

I have initiated a project (outside of the CHM) called FutureKeep to develop a universal media imaging format so that the data on media of any type can be described and encoded digitally in a manner that will allow the original data image to be reconstructed at a future date if the need or desire ever arises. It is also being designed to serve as a universal image format for simulators. The specification for this format also takes into account the quickly advancing nature of computing and storage technology and the fact that media of today will be outmoded and difficult to source in mere months or years or, at most decades. The intent is to create archives that will (hopefully) withstand the test of time and be readable and useable centuries from now.

3 - Can you talk about your plans for preserving the software, and the reality of where you are with it right now?

As for preservation of the software (i.e. the code itself), this is a major undertaking and we are currently studying the issues.

In the meantime, I am currently formulating a plan for the creation of a "transcoding" lab at the Museum. We've settled on the term "transcoding" to describe the process of extracting information from one medium and storing it on another, in this case a centralized server where all the bits can be conveniently managed. The lab will contain all the hardware needed to read all the various media we have in our software collection. We will begin methodically transcoding all the media in our software archive as soon as the lab is ready for action. This cannot come soon enough as we are in somewhat of a race against time with a lot of the artifacts (i.e. the physical media) in the software collection, some of which are stored on disks and tapes which are either at or well beyond their theoretical lifespan (though I should add in practice we find that magnetic media is more durable than once predicted) and some of which cannot be currently read by conventional means.

We must keep in mind that no one truly knows just how long magnetic media will really last, just like we don't truly know how long CD or DVD media will really last as the technology is still relatively too new to have data from the real world. CD media is thought to have a lifespan of 100 years, but those estimates are based on accelerated testing, and they certainly don't apply to the cheap commodity CD-R media you buy off the shelf today (which can last anywhere from years to seconds). Floppy disks were thought to have a lifespan of 15-20 years, but I am finding that disks even 30 years old still read just fine, while 3.5" disks manufactured in the late 1990s die between the time it takes me to copy a file to one of them and then walk over to a PC to which I'm trying to transfer that file. A big factor is the quality of the manufacturing process of the media and, having studied this issue, I always recommend that people research the media they are buying to store data for the long-term as they may have a rude awakening in the not too distant future.

At the CHM we will record the images and store them to hard drives on our server in an ad hoc format for now, but we will eventually want to encode everything in a uniform structured image format.

5 - How are you addressing copyright issues and the proprietary nature of items in your collection?

We have taken the first steps towards developing an access policy for our potential audience of researchers and hobbyists. This policy will take into account the fact that some of our software artifacts are considered sensitive or are proprietary and could be used for competitive advantage. The issue of copyrights is a big can of worms wriggling and writhing about, waiting for someone to come along and open it up and uncover the icky sliminess contained within. We, of course, have to be very sensitive about these issues and, to that end, we have fields in our database that flag certain artifacts as being under embargo or not for general release. The proprietary nature of some of our artifacts is a relatively easy issue to deal with, but the copyright issues are a far greater concern and, given the direction these issues are currently heading, we may well have to keep some of our digital software artifacts locked down for a good long time to come, which we feel would be a loss to society. In fact, the way copyright law in the United States currently stands, we may even be breaking the law if we circumvent copy protection mechanisms on old, obsolete and no longer published software in order to archive it. This is an issue on which the Museum has commented to the US Copyright Office. We are hoping to get a permanent waiver of the Digital Millennium Copyright Act (DMCA) for the Museum and similar institutions so that preservation efforts can be exempted from the DMCA. This is an unfortunate example of where the provisions of the DMCA clearly fall short of what they were trying to accomplish, and indicates a lack of foresight by the drafters of that law.

One of the more important tasks I've completed since joining the staff of the Museum is developing a Software Selection Criteria document to guide the collections department in deciding what software donation offers to accept or decline. As I mentioned previously, most software typically comes in as a part of physical artifact donation and would automatically be moved to the software room and placed on a shelf. Now, all software is vetted against the Software Selection Criteria to ensure the software has certain historic characteristics that make it a good candidate for long-term preservation. Proper historical preservation for artifacts of any kind is an intensive task that requires lots of resources, both human and otherwise, so the intent of the Software Selection Criteria is to make sure the software we are accepting into the collection is worthy of that expenditure of resources. The criteria address various purposes or value that a software artifact might confer to the Museum, such as the obvious historical merit, or assisting in curatorial efforts (i.e. something useful in completing a collection or including in exhibits), and so on.

Software Selection Criteria

Must meet one or more, preferably two, of the following conditions:

1. Sold a significant number of copies or had a large install base.
2. Serves to demonstrate a significant and colossal failure.
3. Introduced a new paradigm, product family, or launched a new industry.
4. Developed using a new and significant software development methodology.
5. Supports existing Museum artifacts.
6. The underlying code itself has qualities of merit worth preserving.
7. The software was utilized in something of historical import.
8. Sufficiently antiquated, i.e. 1960s and earlier.

7 - How is hardware preservation different from software preservation? To what extent does software preservation depend on hardware preservation? Can the two be separated?

The main differences are the size of the artifacts and the resources required to manage them. Hardware, especially older computers from the Paleolithic era of computing (i.e. the 1950s), requires lots of space and therefore funding to acquire and store. Software, on the other hand, takes up much less space, and once the bits are safely rendered in a digital format, the original media could theoretically be tossed, though for the time being, the Museum's policy is to retain the original media as the ultimate archival medium, at least until it proves to be utterly useless in holding data (i.e. magnetic tapes in which the magnetic coating is flaking off the base substrate). Even then, of course, the original media may retain some value as cultural artifacts in their own right. This is a consideration that has been especially promoted, for example, by the Smithsonian Institution.

As far as hardware and software as individual artifacts, each cannot be properly understood without the other. True, you can successfully display a computer that is just sitting there powered down and not being anything other than a hulk of metal and plastic, but it is not really telling the whole story. On the other hand, how does one present software without actually executing it on its native hardware, or in the very least under simulation? It's an interesting conundrum, especially when one considers the effort required to resurrect a decades old computer system. The Museum has so far restored to working order an IBM 1620, a DEC PDP-1, and an IBM 1401 is currently undergoing restoration. However, without software, the effort that went into the restoration of these machines would be pointless. It is only when you sprinkle in the magic of software that the machine does anything particularly interesting or useful. Software is indeed the soul of the computer.

With regard to preservation, hardware is needed to preserve software only to the extent that the bits must be recovered from the original media before the media lose their ability to retain information. And while it's nice to be able to run software on its native hardware, it is not always practical to do so. Machine restorations take lots of time and effort by highly skilled individuals, not to mention a significant amount of money (the IBM 1401 restoration has so far consumed several thousand dollars, the majority of which was kindly donated by Museum supporters). On the other hand, once we have the software in a modern digital format, we can always utilize simulators to recreate the effective feel of operating historic software. Some simulators try to recreate the full experience, with the recorded sounds of whirring disk or tape drives and teletypes pecking away, and while it's a far cry from the experience of running Spacewar! on an actual PDP-1 and watching the front panel lights blink away, it's better than not having any experience. I imagine at some point in the future we'll be able to synthesize a more realistic experience, say something like what they have in the imaginary Holodeck in the Star Trek series, but for now we have to mostly rely on simulators for historic computers from the 1950s and 1960s. Fortunately, we can still enjoy the wealth of minicomputer and microcomputer software on native platforms since these classes of machines from the early 1970s onwards are relatively easy to setup and maintain in working condition. However, they too will someday be impractical to run, if only because their mechanical parts get worn out, or components die, or bits in ROM fade away.

8 - How is the software that you have preserved currently being used and by whom (what types of patrons)?

Currently, we don't get many requests from outside our organization to access the software, but of the few requests we do get, most are from attorneys seeking specific titles as prior art for patent infringement lawsuits. We do, however, utilize assets in our software archive internally to support exhibits we set up and also, to give our volunteer restoration teams something to run on the machines they resurrect. On request, we provide the researcher access to the physical media itself, and we would assist them in actually retrieving or executing the software on either suitable hardware or, if that's not possible or practical, on simulators.

Have something to donate to CHM?
Complete the web form at
http://www.computerhistory.org/collections/donateArtifact/
Or
Call the Museum at 650/810-1010

9 - What is your plan for preventing your software holdings from becoming obsolete as technology whizzes by? What is the cost that you incur to record the history of computing?

We are establishing tools, procedures and guidelines that we hope have fully taken into account all the various and germane parameters to ensure that our efforts remain relevant for some time to come. Of course, these tools, procedures and guidelines will require periodic revamping to keep up with changing technology and evolving methods of software storage and preservation, but we believe we have started off with a very good precedent that should carry us through at least the next five years and give us time to evaluate and plan for the next step.