Reverse Engineering and Software Archaeology
By Ralph E. Johnson - University of Illinois at Urbana-Champaign
I often study software systems written by someone else. When I was an undergraduate, I learned
compilers by reading some books and a couple of Pascal compilers. When I came to the
University of Illinois as a young professor in 1985, I wanted to learn how object-oriented
programming changed the way people developed software, especially how it enabled the
development of reusable software, so I studied a variety of frameworks. I developed some with
my students, but I learned just as much studying ones that other people developed. For the last
ten years, my focus has been as much on software patterns as on frameworks, and I have
continued to study other systems. I can learn several systems in the time it would take to build
one. Since my job is to learn about software rather than to produce it, I spend as much time
studying existing systems as developing new ones.
Although there are many differences between my life as an academic and the life of a typical
software developer, one similarity is the need to understand someone else's software. Most
programmers spend most of their time working with software written by someone else. Even
when they think they are starting a new project, they usually end up spending a lot of time trying
to figure out one of the packages that they are using. Documentation is rarely good enough, so
they will usually have to learn the design by experimentation. If they are using open source
software then they can learn the design by reading the code, though this is making a virtue of
necessity, since open source software usually doesn't have much documentation.
Studying software is serious business and should be treated seriously. Unfortunately, many
developers think of it as a necessary evil. They are paid to write new software, and they consider
time spent reading old software as wasted. However, old software is valuable. It is valuable to
users and it is valuable to software developers. Like anything of value, we need to understand it
so we can take care of it and make use of it. As software lasts longer, the ability to learn an
existing software system becomes more important. Learning the design of an existing system is
much more than a necessary evil; it is crucial to modern software development. Learning a
software system can be hard work, but sometimes it is rewarded with flashes of insight and
moments of beauty.
One of the signs that old software has become important is that we have invented language to
talk about it . We talk about a "legacy" project, which deals with software that we inherited
from someone else, in contrast to a "green field" project, in which we get to choose a new
architecture and make the basic design decisions. We have to "reverse engineer" software with
inadequate documentation. We have to "reengineer" software when needed changes are too
large for a simple patch. People use terms like "software archaeology" to emphasize either that
there are no documents or experts who can explain the system, or that the system is very old.
But software archaeology is different from extreme reverse engineering. This article will show the
difference between reverse engineering and software archaeology and explain how both of them
are important to software developers, but in different ways.
Reverse Engineering
Software archaeology is closely related to reverse engineering and reengineering, but it is not the
same. Wikipedia gives a popular definition of reverse engineering when it says that reverse
engineering is figuring out what a system does without source code. But the IEEE Technical
Council on Software Engineering (http://tcse.org/revengr) gives a broader definition that is more
often used by textbook authors, which is that reverse engineering is any process of analyzing a
system and creating a representation at a higher level[1]. While this includes converting a binary
to C, it also includes developing a UML class diagram for a Java program, making a flow chart for
a Fortran program, or documenting the interfaces of a package.
Reengineering
Reengineering is "the examination and alteration of a subject system to reconstitute it in a new
form and the subsequent implementation of the new form" [1]. Reengineering almost always
requires reverse engineering. This is partly because few systems are documented as well as
they should be, and partly because the reason that the system is being reengineered is because
something was not taken into account during its design, and if it was documented then it would
have been taken into account.
Reverse engineering does not imply that there is no documentation. It implies that there is not
enough of the right kind of documentation. Object Oriented Reengineering Patterns [2] by
Demeyer, Ducasse, and Nierstrasz is an outstanding book on software reverse engineering and
reengineering. One of its main points is to use interviews with users and maintainers when
possible and to read available documentation. Reverse engineering is not only needed when you
have a binary written in an unknown language, it is also important when you are trying to make a
change to a large system that is different from any previous changes. Reverse engineering is
sometimes needed even for systems with an active development community and thousands of
pages of documentation.
Software Archaeology
"Software archaeology" conjures up images of people poring over documents in a forgotten
language, looking at broken artifacts whose purpose is unknown. But. software archaeology is
more than looking at old code. Most programmers add features to old code every day. Age is
not always related to obscurity. Some systems built last year are more obscure and hard to
understand than many old systems. Most current programmers were not even born when Unix
was first released to the public, yet a lot of the code in modern Unix kernels dates to that time and
Unix Relase 6 is easily understood by modern Unix kernel programmers.
Software archaeology is a point of view, not a set of technologies. The tools and techniques
used for software archaeology are the same as those used for reverse engineering. But while
reverse engineering has an immediate purpose, software archaeology has a much longer-range
purpose. We reverse engineer a system because we want to improve or replace it. Maybe we
just need to improve the documentation to improve maintenance costs. Maybe we want to
change a part of the system and do not know it well enough to plan the change. Maybe we want
to make a new system but want to make sure that it is as good as the old so we need a
specification of the old system. All these are examples of reverse engineering. Reverse
engineering is a means to an end. Eventually we will stop it and move on to "more important
things".
Archeologists unearth old civilizations because they are curious about how people lived and they
want to better understand how we became the people we are. They are not looking to solve
current problems like overpopulation, though books like Collapse [3] show that archaeology can
tell us a lot about these problems. Archaeology is general and long-term, and is definitely not a
means to an end.
Similarly, software archeologists are trying to learn about software, not to solve an immediate
problem. We believe that better understanding of software will lead to better software. We don't
study existing systems to blindly copy them, but to learn from them. We study systems to learn
patterns, to learn different architectural styles, to see principles and techniques in action.
One of the best examples of software archaeology is Grady Booch's project to write a handbook
for software architects (http://www.booch.com/architecture). The core of this project is studying a
hundred important software systems, such as Photoshop, the Google search engine, AWACS,
and 5ESS. The book will have a few pages about each project, describing the key architectural
features of each. It will also have a collection of architectural patterns that Grady Booch expects
to discover along the way. This is a very challenging project, but one that will make a
tremendous impact if it is successful.
A more mundane example is a recent project of mine to learn more about security. I decided to
study a secure system to see why it was so secure. So, Munawar Hafiz and I looked at qmail, a
mail transport agent that is a competitor to sendmail. Since its introduction in 1997, qmail has a
perfect security record. We compared its design to that of sendmail and found a number of
differences that led to better security. The designs and our analysis of what makes qmail so
secure are reported in Munawar Hafiz's MS thesis [5].
Since qmail is a system that is heavily used and that is probably understood by many people, you
might argue that it was not a case of software archaeology. However, you'd be wrong. Like most
open source software, qmail doesn't have much documentation. The only documentation that
comes with the system is man pages. They described the interfaces to the components of qmail.
We learned a little from web pages written by others, very little directly from the author of qmail,
and learned the most about the design by reading the code. It was a classic case of reverse
engineering, but we didn't do it to add a feature or fix a bug. We reverse engineered qmail so we
could learn why it was so secure and learn lessons we could apply to other systems.
Why Reverse Engineering Is Important
All programmers need to know how to reverse engineer a software system. They need those
skills when they join an on-going project, or when they want to learn to use a new component.
Software engineering texts tend to treat reverse engineering as a minor issue, something that
only a few people need to know. In reality, reverse engineering is a crucial skill, probably more
important than being able to design a new system. Maintenance programming is all about
figuring out what a system does and how to change it. New programmers are often assigned to
maintenance until they prove themselves. So, if they can't figure out what an existing system
does, they will never be given the chance to design a new system. Most new systems are built
from reusable components, so even if a programmer is designing a new system, it is important to
be able to study the components and to infer how to use them. On large projects, crucial design
decisions are usually made by a small group, and most of the team spends more time figuring out
the existing system than designing new features. Thus, many developers are more likely to use
reverse engineering skills than design skills.
During the last few decades, maintenance and software reuse have both increased. However,
the quality of software documentation has not increased that much, so the need to reverse
engineer software has increased. It is good that we have more software that is worth
maintaining. The cost of building good software that is worth keeping is that we have to maintain
it. The amount of maintenance will probably go up. Unless there is a break-though in the quality
of documentation, this means that reverse engineering will be even more important in the future.
Why Software Archaeology Is Important
Software archaeology is less important than reverse engineering. Reverse engineering has
immediate benefits while software archaeology has longer-term benefits. Nevertheless, the
benefits of software archaeology are real and can be large.
One of the main reasons I study systems is to discover patterns. Patterns are always discovered
by looking at existing systems. Often, these systems are not learned for the purpose of
discovering patterns. Instead, the discovery of patterns is a side-effect of learning how to use a
library or change a system. For example, the four of us who wrote Design Patterns had all been
interested in frameworks design and had studied many frameworks. We were not studying the
frameworks to learn patterns, but we had learned them anyway. However, most software
developers will not work on more than one or two systems per year, and many developers will
work on a system for several years before moving to another. Moreover, these systems might
only use well-known patterns. To find new patterns, you have to look in new places. This is why
I studied qmail. None of the systems that I knew were very secure. If I wanted to learn about
security then I would have to look for other systems.
Another reason to study a system is to confirm a pattern or to learn more about it. When I'm
trying to write a pattern, people often tell me about other systems that have the same pattern, and
I'll look at them to see whether it really is the same pattern and to learn variations and
complications. Design is about trade-offs, and every pattern has its dark side. I try to find a
system where a pattern is abused so that I can learn when not to use it.
There are many other reasons to study existing systems. You can learn about a class of
programs by reading a few examples, as I did with compilers. You can learn an architectural
style or how to achieve a particular software quality, like reliability or portability. The best way to
fight a patent that you think is improper is to find prior art, and that often requires digging through
old software. If you want to understand how software evolves, you not only have to look at old
software, you have to look at several versions. Practicing software archaeology is a great way to
learn new design ideas. It is not for everybody, since many software developers just want to
focus on getting the job done. But it is important for software developers who want to get better
at doing their job.
Lessons Learned
So far, I have been fairly philosophical and have focused on definitions rather than tools and
techniques. I'd like to leave you with two concrete suggestions that follow from what I've said.
The first is how people can learn to read software. The second is how to make software easier to
read.
There are few courses on how to study software, and most schools do not teach it. The
University of Illinois (where I teach) is not an exception. Several of us try to teach code reading
as a side-effect of our courses by giving students software packages that they are required to
understand and to change. But we do not spend much time explaining how to approach a large
system and other professors might teach the course differently.
Some of our graduates are experts at learning large software systems. These are usually those
students who have worked on several open source software projects. They have learned several
large systems and know that they can do it again. But these students almost always picked up
their skills outside of class.
Therefore, if you want to make sure that developers know how to learn software systems, you
can't depend on their college education. I recommend two books on the subject of studying
software; "Code Reading: The Open Source Perspective" by Diomidis Spinellis [6] and "Object
Oriented Reengineering Patterns" by Demeyer, Ducasse, and Nierstrasz [2]. The first book is
more code-oriented and Unix and C oriented. The second is focused on larger systems, object-
oriented systems, and on reengineering a system once you have learned it. People with a lot of
experience reading software will probably know most of the material in these books, but they will
still be able to pick up some useful techniques. Books are only a start, of course. Learning to
read software takes practice. However, the books are a guide that will make learning faster.
Is reverse engineering necessary? Perhaps software developers can provide documentation that
is good enough that the people who come after them do not have to read the software, but only
need to read the documentation. Although this is possible in theory, in practice good
documentation is too expensive for all but the most popular software. Developers try to automate
documentation with tools like JavaDoc. These tools are useful, but tend to produce detailed, low-
level documentation. Developers still have to reverse engineer these documents to learn the
purpose and the architecture of the system.
Reverse engineering will always be necessary. Even though more and more software is long-
lived, most software dies young. It doesn't make sense to develop expensive documentation for
software before we know whether it is needed. So, we will continue to have under-documented
software and people will continue to add documentation to software long after it has been written.
The most important documentation is the high-level documentation. It is relatively easy to
reverse engineer flow charts and UML diagrams. There are tools that help automate them. But
the high-level documentation is more about the intent of the designers than the actual code, and
so is hard to derive from the code. If a system is well-designed and there is good high-level
documentation, then it is relatively easy to figure out the low-level details from the code. But the
high-level documentation is hard to produce without close communication with the developers.
High-level documentation is more valuable than low-level documentation. Low-level
documentation is easy to produce, but it quickly becomes out of date. High-level documentation
requires more thinking, but it is less likely to become obsolete as new features are added. Low-
level documentation tends to be large and so, often, it is not read. Newcomers are much more
likely to read a 20 page high-level document than a 200 or 2000 page low-level document. So,
high-level documentation is more likely to be correct and is more likely to be read.
Reverse-engineering is always going to be an important skill. However, we can build our systems
in such a way that they are easier to reverse-engineer by documenting the goals and
assumptions of the system and its architecture. This would not only make the systems more
maintainable, it would help future generations of software archeologists.
References
[1] Chikofsky E.J. and Cross, J.H., "Reverse Engineering and Design Recover; A Taxonomy",
IEEE Software, pp.13-17, 1990.
[2] Demeyer, S., Ducasse, S., and Nierstrasz, O., Object Oriented Reengineering Patterns.
Morgan Kaufmann, 2002.
[3] Diamond, J., Collapse: How Societies Choose to Fail or Succeed, Penguin Books Ltd, 2005.
[4] Gamma, E., Helm, R., Johnson, R., and Vlissides, J., Design Patterns: Elements of Reusable
Object-Oriented Software. Addison-Wesley, 1995.
[5] Munawar Hafiz, M., Security Architecture of Mail Transfer Agents. MS Thesis, University of
Illinois at Urbana-Champaign, 2005. Available on the internet.
[6] Spinellis, D., Code Reading: The Open Source Perspective. Addison-Wesley, 2003.
About the Author
Professor Johnson is on the faculty of the Department of Computer Science at the University of
Illinois. He is the leader of the UIUC patterns/Software Architecture Group and the coordinator of
the senior projects program for the department. His professional interests cover nearly all things
object-oriented, especially frameworks, patterns, business objects, Smalltalk, COM and
refactoring. He received his PhD and MS from Cornell University and his BA from Knox College.
e-mail: [email protected]
Department of Computer Science
1304 West Springfield Avenue
Urbana, IL 61801
(217) 244-0093
(217) 244-6869- fax
|