Reverse Engineering and Software Archaeology

By Ralph E. Johnson - University of Illinois at Urbana-Champaign

I often study software systems written by someone else. When I was an undergraduate, I learned compilers by reading some books and a couple of Pascal compilers. When I came to the University of Illinois as a young professor in 1985, I wanted to learn how object-oriented programming changed the way people developed software, especially how it enabled the development of reusable software, so I studied a variety of frameworks. I developed some with my students, but I learned just as much studying ones that other people developed. For the last ten years, my focus has been as much on software patterns as on frameworks, and I have continued to study other systems. I can learn several systems in the time it would take to build one. Since my job is to learn about software rather than to produce it, I spend as much time studying existing systems as developing new ones.

Although there are many differences between my life as an academic and the life of a typical software developer, one similarity is the need to understand someone else's software. Most programmers spend most of their time working with software written by someone else. Even when they think they are starting a new project, they usually end up spending a lot of time trying to figure out one of the packages that they are using. Documentation is rarely good enough, so they will usually have to learn the design by experimentation. If they are using open source software then they can learn the design by reading the code, though this is making a virtue of necessity, since open source software usually doesn't have much documentation.

Studying software is serious business and should be treated seriously. Unfortunately, many developers think of it as a necessary evil. They are paid to write new software, and they consider time spent reading old software as wasted. However, old software is valuable. It is valuable to users and it is valuable to software developers. Like anything of value, we need to understand it so we can take care of it and make use of it. As software lasts longer, the ability to learn an existing software system becomes more important. Learning the design of an existing system is much more than a necessary evil; it is crucial to modern software development. Learning a software system can be hard work, but sometimes it is rewarded with flashes of insight and moments of beauty.

One of the signs that old software has become important is that we have invented language to talk about it . We talk about a "legacy" project, which deals with software that we inherited from someone else, in contrast to a "green field" project, in which we get to choose a new architecture and make the basic design decisions. We have to "reverse engineer" software with inadequate documentation. We have to "reengineer" software when needed changes are too large for a simple patch. People use terms like "software archaeology" to emphasize either that there are no documents or experts who can explain the system, or that the system is very old. But software archaeology is different from extreme reverse engineering. This article will show the difference between reverse engineering and software archaeology and explain how both of them are important to software developers, but in different ways.

Reverse Engineering

Software archaeology is closely related to reverse engineering and reengineering, but it is not the same. Wikipedia gives a popular definition of reverse engineering when it says that reverse engineering is figuring out what a system does without source code. But the IEEE Technical Council on Software Engineering (http://tcse.org/revengr) gives a broader definition that is more often used by textbook authors, which is that reverse engineering is any process of analyzing a system and creating a representation at a higher level[1]. While this includes converting a binary to C, it also includes developing a UML class diagram for a Java program, making a flow chart for a Fortran program, or documenting the interfaces of a package.

Reengineering

Reengineering is "the examination and alteration of a subject system to reconstitute it in a new form and the subsequent implementation of the new form" [1]. Reengineering almost always requires reverse engineering. This is partly because few systems are documented as well as they should be, and partly because the reason that the system is being reengineered is because something was not taken into account during its design, and if it was documented then it would have been taken into account.

Reverse engineering does not imply that there is no documentation. It implies that there is not enough of the right kind of documentation. Object Oriented Reengineering Patterns [2] by Demeyer, Ducasse, and Nierstrasz is an outstanding book on software reverse engineering and reengineering. One of its main points is to use interviews with users and maintainers when possible and to read available documentation. Reverse engineering is not only needed when you have a binary written in an unknown language, it is also important when you are trying to make a change to a large system that is different from any previous changes. Reverse engineering is sometimes needed even for systems with an active development community and thousands of pages of documentation.

Software Archaeology

"Software archaeology" conjures up images of people poring over documents in a forgotten language, looking at broken artifacts whose purpose is unknown. But. software archaeology is more than looking at old code. Most programmers add features to old code every day. Age is not always related to obscurity. Some systems built last year are more obscure and hard to understand than many old systems. Most current programmers were not even born when Unix was first released to the public, yet a lot of the code in modern Unix kernels dates to that time and Unix Relase 6 is easily understood by modern Unix kernel programmers.

Software archaeology is a point of view, not a set of technologies. The tools and techniques used for software archaeology are the same as those used for reverse engineering. But while reverse engineering has an immediate purpose, software archaeology has a much longer-range purpose. We reverse engineer a system because we want to improve or replace it. Maybe we just need to improve the documentation to improve maintenance costs. Maybe we want to change a part of the system and do not know it well enough to plan the change. Maybe we want to make a new system but want to make sure that it is as good as the old so we need a specification of the old system. All these are examples of reverse engineering. Reverse engineering is a means to an end. Eventually we will stop it and move on to "more important things".

Archeologists unearth old civilizations because they are curious about how people lived and they want to better understand how we became the people we are. They are not looking to solve current problems like overpopulation, though books like Collapse [3] show that archaeology can tell us a lot about these problems. Archaeology is general and long-term, and is definitely not a means to an end.

Similarly, software archeologists are trying to learn about software, not to solve an immediate problem. We believe that better understanding of software will lead to better software. We don't study existing systems to blindly copy them, but to learn from them. We study systems to learn patterns, to learn different architectural styles, to see principles and techniques in action.

One of the best examples of software archaeology is Grady Booch's project to write a handbook for software architects (http://www.booch.com/architecture). The core of this project is studying a hundred important software systems, such as Photoshop, the Google search engine, AWACS, and 5ESS. The book will have a few pages about each project, describing the key architectural features of each. It will also have a collection of architectural patterns that Grady Booch expects to discover along the way. This is a very challenging project, but one that will make a tremendous impact if it is successful.

A more mundane example is a recent project of mine to learn more about security. I decided to study a secure system to see why it was so secure. So, Munawar Hafiz and I looked at qmail, a mail transport agent that is a competitor to sendmail. Since its introduction in 1997, qmail has a perfect security record. We compared its design to that of sendmail and found a number of differences that led to better security. The designs and our analysis of what makes qmail so secure are reported in Munawar Hafiz's MS thesis [5].

Since qmail is a system that is heavily used and that is probably understood by many people, you might argue that it was not a case of software archaeology. However, you'd be wrong. Like most open source software, qmail doesn't have much documentation. The only documentation that comes with the system is man pages. They described the interfaces to the components of qmail. We learned a little from web pages written by others, very little directly from the author of qmail, and learned the most about the design by reading the code. It was a classic case of reverse engineering, but we didn't do it to add a feature or fix a bug. We reverse engineered qmail so we could learn why it was so secure and learn lessons we could apply to other systems.

Why Reverse Engineering Is Important

All programmers need to know how to reverse engineer a software system. They need those skills when they join an on-going project, or when they want to learn to use a new component. Software engineering texts tend to treat reverse engineering as a minor issue, something that only a few people need to know. In reality, reverse engineering is a crucial skill, probably more important than being able to design a new system. Maintenance programming is all about figuring out what a system does and how to change it. New programmers are often assigned to maintenance until they prove themselves. So, if they can't figure out what an existing system does, they will never be given the chance to design a new system. Most new systems are built from reusable components, so even if a programmer is designing a new system, it is important to be able to study the components and to infer how to use them. On large projects, crucial design decisions are usually made by a small group, and most of the team spends more time figuring out the existing system than designing new features. Thus, many developers are more likely to use reverse engineering skills than design skills.

During the last few decades, maintenance and software reuse have both increased. However, the quality of software documentation has not increased that much, so the need to reverse engineer software has increased. It is good that we have more software that is worth maintaining. The cost of building good software that is worth keeping is that we have to maintain it. The amount of maintenance will probably go up. Unless there is a break-though in the quality of documentation, this means that reverse engineering will be even more important in the future.

Why Software Archaeology Is Important

Software archaeology is less important than reverse engineering. Reverse engineering has immediate benefits while software archaeology has longer-term benefits. Nevertheless, the benefits of software archaeology are real and can be large.

One of the main reasons I study systems is to discover patterns. Patterns are always discovered by looking at existing systems. Often, these systems are not learned for the purpose of discovering patterns. Instead, the discovery of patterns is a side-effect of learning how to use a library or change a system. For example, the four of us who wrote Design Patterns had all been interested in frameworks design and had studied many frameworks. We were not studying the frameworks to learn patterns, but we had learned them anyway. However, most software developers will not work on more than one or two systems per year, and many developers will work on a system for several years before moving to another. Moreover, these systems might only use well-known patterns. To find new patterns, you have to look in new places. This is why I studied qmail. None of the systems that I knew were very secure. If I wanted to learn about security then I would have to look for other systems.

Another reason to study a system is to confirm a pattern or to learn more about it. When I'm trying to write a pattern, people often tell me about other systems that have the same pattern, and I'll look at them to see whether it really is the same pattern and to learn variations and complications. Design is about trade-offs, and every pattern has its dark side. I try to find a system where a pattern is abused so that I can learn when not to use it.

There are many other reasons to study existing systems. You can learn about a class of programs by reading a few examples, as I did with compilers. You can learn an architectural style or how to achieve a particular software quality, like reliability or portability. The best way to fight a patent that you think is improper is to find prior art, and that often requires digging through old software. If you want to understand how software evolves, you not only have to look at old software, you have to look at several versions. Practicing software archaeology is a great way to learn new design ideas. It is not for everybody, since many software developers just want to focus on getting the job done. But it is important for software developers who want to get better at doing their job.

Lessons Learned

So far, I have been fairly philosophical and have focused on definitions rather than tools and techniques. I'd like to leave you with two concrete suggestions that follow from what I've said. The first is how people can learn to read software. The second is how to make software easier to read.

There are few courses on how to study software, and most schools do not teach it. The University of Illinois (where I teach) is not an exception. Several of us try to teach code reading as a side-effect of our courses by giving students software packages that they are required to understand and to change. But we do not spend much time explaining how to approach a large system and other professors might teach the course differently.

Some of our graduates are experts at learning large software systems. These are usually those students who have worked on several open source software projects. They have learned several large systems and know that they can do it again. But these students almost always picked up their skills outside of class.

Therefore, if you want to make sure that developers know how to learn software systems, you can't depend on their college education. I recommend two books on the subject of studying software; "Code Reading: The Open Source Perspective" by Diomidis Spinellis [6] and "Object Oriented Reengineering Patterns" by Demeyer, Ducasse, and Nierstrasz [2]. The first book is more code-oriented and Unix and C oriented. The second is focused on larger systems, object- oriented systems, and on reengineering a system once you have learned it. People with a lot of experience reading software will probably know most of the material in these books, but they will still be able to pick up some useful techniques. Books are only a start, of course. Learning to read software takes practice. However, the books are a guide that will make learning faster.

Is reverse engineering necessary? Perhaps software developers can provide documentation that is good enough that the people who come after them do not have to read the software, but only need to read the documentation. Although this is possible in theory, in practice good documentation is too expensive for all but the most popular software. Developers try to automate documentation with tools like JavaDoc. These tools are useful, but tend to produce detailed, low- level documentation. Developers still have to reverse engineer these documents to learn the purpose and the architecture of the system.

Reverse engineering will always be necessary. Even though more and more software is long- lived, most software dies young. It doesn't make sense to develop expensive documentation for software before we know whether it is needed. So, we will continue to have under-documented software and people will continue to add documentation to software long after it has been written.

The most important documentation is the high-level documentation. It is relatively easy to reverse engineer flow charts and UML diagrams. There are tools that help automate them. But the high-level documentation is more about the intent of the designers than the actual code, and so is hard to derive from the code. If a system is well-designed and there is good high-level documentation, then it is relatively easy to figure out the low-level details from the code. But the high-level documentation is hard to produce without close communication with the developers.

High-level documentation is more valuable than low-level documentation. Low-level documentation is easy to produce, but it quickly becomes out of date. High-level documentation requires more thinking, but it is less likely to become obsolete as new features are added. Low- level documentation tends to be large and so, often, it is not read. Newcomers are much more likely to read a 20 page high-level document than a 200 or 2000 page low-level document. So, high-level documentation is more likely to be correct and is more likely to be read.

Reverse-engineering is always going to be an important skill. However, we can build our systems in such a way that they are easier to reverse-engineer by documenting the goals and assumptions of the system and its architecture. This would not only make the systems more maintainable, it would help future generations of software archeologists.

References

[1] Chikofsky E.J. and Cross, J.H., "Reverse Engineering and Design Recover; A Taxonomy", IEEE Software, pp.13-17, 1990.

[2] Demeyer, S., Ducasse, S., and Nierstrasz, O., Object Oriented Reengineering Patterns. Morgan Kaufmann, 2002.

[3] Diamond, J., Collapse: How Societies Choose to Fail or Succeed, Penguin Books Ltd, 2005.

[4] Gamma, E., Helm, R., Johnson, R., and Vlissides, J., Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995.

[5] Munawar Hafiz, M., Security Architecture of Mail Transfer Agents. MS Thesis, University of Illinois at Urbana-Champaign, 2005. Available on the internet.

[6] Spinellis, D., Code Reading: The Open Source Perspective. Addison-Wesley, 2003.

About the Author

Professor Johnson is on the faculty of the Department of Computer Science at the University of Illinois. He is the leader of the UIUC patterns/Software Architecture Group and the coordinator of the senior projects program for the department. His professional interests cover nearly all things object-oriented, especially frameworks, patterns, business objects, Smalltalk, COM and refactoring. He received his PhD and MS from Cornell University and his BA from Knox College.

e-mail: [email protected]

Department of Computer Science

1304 West Springfield Avenue

Urbana, IL 61801

(217) 244-0093

(217) 244-6869- fax

October 2005
Vol. 8, Number 3

Software Archaeology

Articles in this issue:

Tech Views

Reverse Engineering and Software Archaeology

Software Archaeology

Migration of Legacy Components to Service-Oriented Architectures

Software Preservation at the Computer History Museum

Download this issue (PDF)

Receive the Software Tech News