Software Tech News 4-4: Msg*Log: E-mail-based Agent Messaging to Improve Robustness in a Distributed Logistics Planner

Volume 4 Number 4 - Software Agents

^©Msg*Log: E-mail-based Agent Messaging to Improve Robustness in a Distributed Logistics Planner

Tom Bannon, Steve Ford, Craig Thompson, and David Wells, Object Services and Consulting Inc.

1. Introduction

The Department of Defense (DOD) Joint Vision 2020 [1] identified "focused logistics" as a key component for achieving full spectrum dominance in future conflicts. To that end, the Defense Advanced Research Projects Agency (DARPA) Advanced Logistics Project (ALP) has successfully demonstrated the use of an extensible collection of loosely coupled agents to generate, monitor, and dynamically modify complex, multilevel logistics plans. ALP agents reside on an agent infrastructure called Cougaar (Cognitive Agent Architecture). The current DARPA Ultra*Log program is improving the robustness, scalability, and security of the Cougaar agent infrastructure with the goal of allowing ALP agent-based planning to survive and perform in a highly chaotic environment in which unforeseen or uncontrollable events interfere with the ALP planners' operation, and with the ability of the military procurement, transport, and warehousing organizations to physically execute the plans generated by ALP.

Because the ALP planning process is highly distributed, reliable and timely message delivery to multiple planning agents is mandatory, even as the communications infrastructure degrades and as the planning agents move. The existing Cougaar communications mechanism is based on Java Remote Method Invocation (RMI) extended with message queues and retry policies. This performs well under stable conditions, but rapidly loses its ability to deliver messages as chaos increases due to inherent properties of the RMI protocol. This failure will ultimately make it impossible for Ultra*Log to achieve its stated robustness and performance goals of suffering "no more than 20% capability degradation and 30% performance degradation under conditions of 45% information infrastructure loss in an environment of 90% of maximal real-world chaos".

This article describes the ongoing Msg*Log (Messaging Logistics) effort to improve the robustness of Cougaar agent communications by transparently adding additional message transport mechanisms based on the robust and ubiquitous E-mail and Netnews protocols. The resultant three transport mechanisms have different operational properties that make each suitable under different conditions. To capitalize on this, Msg*Log provides a higher level Adaptive Message Transport capability to automatically switch among transport mechanisms to cope with degradation of the communications infrastructure and to adjust the level of redundancy.

2. The Logistics Problem

Logistics planning and execution is a huge and complex problem, with operational and financial implications for the DOD. The DOD states that:

If Desert Shield/Storm logistics had optimized lift scheduling, detailed coordination between planning and execution, and visibility into the logistics pipeline, significant improvements would have been possible in deployment surges and resource sequencing, planning and replanning driven by changing requirements, and improved control over the logistics pipeline. [11]

Estimates are that this would have allowed the campaign to have concluded 100 days sooner, reduced the quantity of material transported by 1M Tons, and reduced logistics costs by $800M. These represent improvements of approximately 45%, 33%, and 40%, respectively.

Technically, logistics planning and monitoring poses a challenging problem. It is clearly a large problem as evidenced by the Desert Storm example. The types of planning activities are varied, including inventory, packing, route planning, and scheduling, each subject to multitudinous and varied constraints. The problem is highly distributed, yet interconnected; the supported organizations are autonomous and can reside anywhere on the globe, but not only does an organization need to know its own part of the overall plan, it needs to know enough about other organizations and their plans to be able to collaborate in planning and execution. Complicating matters further, these organizations can move during the course of an operation, sometimes in an unexpected manner. Additional dynamism enters the system in the form of changed logistics requests and the potential loss of logistics resources as the military operation progresses, any of which may require plans to be modified on-the-fly. The tempo and scope of operations can change rapidly as the situation progresses from peace to deployment to combat to peacekeeping and back to peace. Finally, all this must be accomplished in an environment in which opponents are attempting to prevent logistics activities from being planned and executed by both kinetic and information attacks.

3. Agent-based Logistics Planning

A problem of this nature calls for a highly flexible, scalable, and robust computing solution that can be incrementally improved and expanded. The DARPA Advanced Logistics Project (ALP) and its successor Ultra*Log are developing an agent-based computing solution to distributed logistics under stress.

The following presents sufficient information about ALP and Cougaar to understand the role of Msg*Log.

More information about ALP and Ultra*Log can be found at [5] and [6]. Complete details of the Cougaar agent infrastructure on which ALP planning agents reside can be found at [7].

The Cougaar architecture provides the technical basis for a loose confederation of ALP planners that interoperate via message passing. ALP planners, corresponding to Cougaar agents can be grouped into nodes and communities that perform more complex planning or monitoring. Agents generally perform a very specific task (e.g., job-shop scheduling). A node is Cougaar software that supports one or more agents within a single Java Virtual Machine (JVM); typically several related agents will be placed into a single node to optimize communications (e.g., a scheduler, packing planner, and associated data access supporting a warehousing organization could reside in a single node). A community is a logical grouping of agents, typically associated with a DOD organization (e.g., the US Transport Command). Communities can be hierarchical. Communities have no physical interpretation in Cougaar, but since they correspond to an organization, there is often at least some physical "closeness" between members of a community (e.g., on a LAN or within a single base). A society is a collection of agents and communities working on a common problem.

All but the simplest plans will involve interaction of multiple agents, node, or communities and will involve considerable remote interaction. Thus, robust ALP planning requires "survivable" internode and intercommunity communication that will continue to operate as networks partition or become swamped as planning agents and nodes migrate along with their owning organization or as part of an overall survivability plan. Figure 1 features a graphic by Dr. Todd Carrico of DARPA, which illustrates some of these concepts.

Figure 1: Technical Vision - Applying Agent Technology to the Logistics Domain

The current Cougaar implementation uses Java RMI (Java's RPC mechanism) for internode communication. The strength of RMI is its tight integration with Java and its speed. Its weakness is its dependence on point-to-point direct connectivity between sender and receiver and on accurate knowledge of the recipient's IP address and port ID. Under low chaos conditions, these requirements are generally met and when violated are easily remedied, making RMI a good choice, especially when combined with message queues and retry policies (which Cougaar currently supports). However, as chaos increases, these requirements are less likely to be met: partitions become more frequent and take longer to repair, while at the same time, nodes are more likely to migrate. In the worst case, end-to-end connectivity between critical pairs of nodes may never exist, at least not often enough to produce plans in an acceptable time. Thus, even though an RMI-based approach can tolerate highly chaotic conditions, it will not be able to accomplish much during them, which will prevent Ultra*Log from reaching its goal of "no more than 20% capability degradation and 30% performance degradation under conditions of 45% information infrastructure loss in an environment of 90% of maximal real-world chaos".

**4. Msg*Log**

Clearly, an important requirement for overall ALP/Cougaar survivability [5][6][7] is for inter-agent messaging to be reliable and timely even when the communications infrastructure is stressed. Since Cougaar's RMI-based message transport is not reliable under those conditions, Cougaar needs a better way to perform remote messaging when chaos increases.

Two ways to address this suggest themselves. RMI can be further extended with intermediate servers and naming and redirection mechanisms to provide a store-and-forward capability and support for mobile agents, or alternate transport mechanisms that already provide these capabilities can be adapted for Cougaar use. RMI extensions would require the ability to install software on intermediate network nodes; a degree of access not generally available in third party networks. Since use of such networks is required for many applications, including logistics, RMI extensions alone will be insufficient. Fortunately, robust protocols that allow message delivery across frequently partitioning networks and that support mobile recipients while not requiring control over intermediate nodes already exist, namely Simple Message Transfer Protocol (SMTP) and Network News Transport Protocol (NNTP), the E-mail and UseNet newsgroup protocols.

STMP and NNTP-based internode communications mechanisms have two characteristics that recommend them. Like RMI, SMTP and NNTP are both well defined, mature protocols with many implementations, so any system built on top of them will have industrial-strength underpinnings. The many SMTP and NNTP servers already in place ensure wide "reach" for the ALP system; in fact allowing delivery of messages to places unreachable by RMI. More importantly from a survivability perspective, SMTP and NNTP have different operational characteristics that, while making them slower than RMI, allow them to continue to function with reasonable performance at levels of chaos that would preclude RMI from delivering any messages at all.

Unlike RMI, which requires end-to-end connectivity between sender and receiver, SMTP and NNTP use store-and-forward mechanisms that only require individual hops along a delivery path to be functioning. A consequence is that SMTP and NNTP can deliver messages even if sender and receiver are never connected at any point in time. In a frequently partitioning network environment, this is clearly an advantage.

Second, RMI requires that a sender have knowledge of its recipient's IP address and port ID, which is difficult to ensure if ALP agents/nodes and logistics organizations are moving around in cyberspace or the real world. Since in both SMTP and NNTP, messages are "picked up" rather than delivered, all they require is that a sender know how to contact some "mail drop" where a recipient can look for its messages. This allows message delivery in cases where the sender does not know where a mobile recipient resides, but the recipient knows where it can pick up its mail. Since there can be many mail drops and these can be relatively stationary, this provides a much more robust solution. This can be exploited to further improve robustness by sending the same message to multiple servers, making it more likely that a recipient will be able to access its messages from somewhere, even in the face of many failures. NNTP in particular is efficient at distributing messages with high fan out through multiple layers and large numbers of servers, making it particularly useful under highly chaotic conditions.

The new Msg*Log SMTP and NNTP-based transport mechanisms convert Cougaar inter-agent messages into E-mail messages or news postings, transmit them via the existing E-mail or news infrastructure, convert them back to Cougaar messages on the receiving end, and hand them off to the existing Cougaar mechanisms for delivery to the appropriate local agents. For compatibility with the existing Cougaar RMI-based transport, Msg*Log uses Java serialization as an efficient external representation for both SMTP and NNTP message transport. Other encodings are certainly possible; a previous OBJS E-mail-based agent communications system called eGents used XML as en external encoding in the CoABS 24x7 Grid [15] where inter-language portability and human readability was more crucial than performance.

Msg*Log's implementation uses existing E-mail and news servers, and existing Java SMTP, POP3, and NNTP clients. This enables applications to control routing at a level to which they have access, via established Internet services that route public messages between servers out of line with the direct path between two Cougaar nodes. With respect to caching, it should be noted that Internet routers do not cache packets for significant periods of time waiting for a connection to the next router, so an additional advantage of our design over dynamic network routing is that ours operates in the presence of great latency and full disconnectedness.

Cougaar's NameServer is implemented (NameServer is an interface, not a class) to map Cougaar nodes to one or more E-mail addresses. While there are several possibilities for mapping Cougaar agent addressing to E-mail addressing, for compatibility with the Cougaar architecture, our current approach requires an E-mail address for every Cougaar node (effectively a Java VM) in which multiple Cougaar agents can run as threads. Each Cougaar node has at least one E-mail address (an E-mail account on at least one E-mail server used for sending/receiving messages); multiple accounts at multiple servers are possible in the interest of sending and receiving messages via multiple routes. Both the SMTP and NNTP-based transport mechanisms include the ability to pick up E-mail or news-encode inter-agent messages on an appropriate schedule and will have the ability to try alternate sites.

The SMTP and NNTP-based transport mechanisms integrate easily into the Cougaar architecture, which provides explicit support for swapping transport mechanisms by subclassing the Cougaar MessageTransport class. This packaging makes the existence and use of transport alternatives transparent to Cougaar agents and in fact, to the rest of the Cougaar infrastructure. Cougaar agents need not know which transport mechanism is being used. If additional transport mechanisms were needed (e.g., based on broadcast or groupware), they would be added in the same manner.

The ability to route messages via alternate message transports improves the flexibility of internode communications. However, the bigger benefit comes by providing an adaptive mechanism to dynamically choose between those message transports and parameterize their use on a message by message basis using network QoS information and higher level application-dependent intelligence about the relative importance of the message and the likelihood of failure of its delivery for reasons not detectable via QoS metrics. For instance, if direct connectivity is not available between two nodes when a message is sent, then RMI will not work; if the need to send the message immediately is critical, then alternate routes to the recipient could be considered. Knowledge of the network topography combined with the partial information about the failure might suffice to identify a reliable route to the remote node via a chain of SMTP servers, or a set of routes of probable reliability. Alternatively, a security agent might have advised that a route over which direct communication would travel is not to be trusted, and alternate routes should be considered. Finally, general policies could state that under certain operational conditions (an InfoCon), that public networks are not to be used for any data unless encrypted, regardless of the information's inherent sensitivity.

This is the idea behind Msg*Log's Adaptive Message Transport (AMT), a higher level transport mechanism with the ability to select among, parameterize, and monitor the behavior of the various lower-level transport mechanisms known to it. This mechanism merges disparate sources of relevant information, plans, and executes an appropriate communications strategy using the available transport mechanisms. AMT is designed to manage not only messaging via RMI, SMTP, and NNTP, but also to be extensible with future MessageTransport implementations.

5. Metrics

The effectiveness of Msg*Log in improving Cougaar robustness is evaluated by how well it increases the survivability of ALP logistics planning capability in the face of stresses to the computing and communications environment. To judge this, we want answers to the following questions:

When Msg*Log alone is added to Cougaar, how much do logistics plans degrade in quality and timeliness of generation when subjected to various kinds and intensities of communications stress, and how does that compare to identical planning conducted under similar conditions without Msg*Log?
When all Ultra*Log enhancements are added to Cougaar, how much do logistics plans degrade in quality and timeliness of generation when subjected to various kinds and intensities of communications stress, and how does that compare to identical planning conducted under similar conditions without Ultra*Log enhancements?

Robustness in the face of environmental stresses can only be measured by having a baseline of functionality of the unstressed system. There is no formal notion of a "best possible logistics plan", since it is never possible to state that a given plan could not be improved with more time or better planning tools or smarter logisticians. Thus, an appropriate baseline for planner behavior is:

the quality of the logistics plan,
generated in some nominal time,
by a given ALP society when,
requested to satisfy a given set of (possibly changing) logistics requests using
a given set of (possibly changing) logistics resources (truck, ports, etc.), when
the computing resources (computers, networks, data sets) used in the planning process are not stressed in any way (no failure, degradation, or corruption).

A test measures the effectiveness and operational behavior of Msg*Log across a range of stresses to the computing environment. A test uses a single test case (an ALP society, a computing environment, a set of (time varying) logistics requests, and a set of (time varying) logistics resources) to exercise the system. Testing will be done twice this year, once with a moderate size test case and the other with a 1000 node ALP society. In future years, tests of increasing complexity will be defined by the Ultra*Log program's assessment team. Within a test, several individual experiments inject varying types and intensities of stress. A model for injecting synthetic stresses can be found in [14]. As the computing environment is stressed, it is anticipated that logistics plan quality will degrade. The extent to which logistics plan quality does not degrade is a measure of Cougaar robustness. Stresses to be applied in the tests include:

Kinetic attacks

Permanent loss of geographically adjacent communications links
Permanent loss of target agents (what will M*L do if the recipient dies?)
Permanent loss of SMTP and/or NNTP servers

Information warfare attacks

Denial of service attacks place excess load on communications links
Network partitioning
Rolling network partitions (partition moves across network)
IW attacks crash SMTP and/or NNTP servers
Spamming of STMP and/or NNTP servers (causes delays or overflows)
Protocol attacks on RMI, SMTP, and/or NNTP protocols (renders a transport unusable)
Loss of stored messages at SMTP and/or NNTP servers

In addition to measuring the effectiveness of Msg*Log in increasing Cougaar robustness, we are interested in collecting operational or infrastructure level statistics about the concrete behavior of Msg*Log under different conditions. These operational statistics are to be used to set Msg*Log policy [3] and to direct future development. For these purposes, we want answers to the following questions about the behavior of the Msg*Log transport mechanisms:

What is the mechanism's performance under various kinds and intensities of communications stress, and how does that compare with the performance of the original RMI-based transport under similar conditions?
What is the mechanism's resource utilization under various kinds and intensities of communications stress, and how does that compare with the resource utilization of the original RMI-based transport?
Under what kinds and intensities of communications stress do each of the Msg*Log transport mechanisms become unable to perform reliable and timely message delivery, and how does this compare to the behavior of the original RMI-based transport?
Under what kinds and intensities of communications stress does Msg*Log (including the ability to select the "best" transport mechanism) become unable to perform reliable and timely message delivery, and how does this compare to the behavior of the original RMI-based transport?
How does Msg*Log affect the performance of the original RMI-based transport?

Three kinds of information must be collected to answer the above questions:

Message delivery information (when the message was sent, the transport mechanism used, message size, delivery time, distance between sender and receiver and the distance the message actually traveled (in terms of LAN and WAN hops), whether the recipient was fixed or mobile (not necessarily known until message is received), and the environmental stresses present.
Disk and memory utilization at the SMTP & NNTP servers at various time points (for correlation with message send times to create a map of disk usage as a function of number and size of messages)
Bandwidth utilization

From the raw statistics captured during testing, the following statistics should be computed:

For each transport mechanism, a scatter plot of message delivery times as a function of distance between sender and receiver (actual distance, not how far the message went)
For Msg*Log, a scatter plot of message delivery times as a function of distance between sender and receiver (actual distance, not how far the message went)
For each of the above, the number of undeliverable messages at each distance (factor out messages that were not delivered due to agent failure, since Msg*Log can't do anything about those)

The different experiments in a test will be conducted at different "levels of stress". The scatter plots produced for the experiments can be sliced to produce scatter plots of the effect of stresses as follows. For each inter-agent distance, a scatter plot is made that maps delivery times (at that inter-agent distance) as a function of the stress level. Figure 2 shows the expected shape of the graphs (read "stress level" for "chaos level") the terms have been used interchangeably in the past, although the term "stress" is preferable as "chaos" has a specific, and different, meaning). Only the mean delivery time is shown, but the actual graphs will be scatter plots.

Figure2: Mean Delivery Time as a Function of Environmental Stress (Sample)

For all of these, standard statistics such as mean delivery time (excluding undeliverable messages) and fraction of messages delivered in less than a given time T can be computed. If T is chosen meaningfully with regard to the Ultra*Log concept of operations, it becomes a useful predictor of whether a given transport mechanism, or Msg*Log itself, is likely to be able to provide timely message delivery under given stress and distance conditions. The presumed graph of this for a given distance and time T is shown in Figure 3.

Figure 3: Probability of Message Delivery as a Function of Environmental Stress (Sample)

6. Status

A Preliminary Msg*Log Design is complete, and implementation of the E-mail and NNTP Message Transports are underway. A proof-of-concept implementation of E-mail Message Transport has been demonstrated using a sample Cougaar society. The first software delivery will include a small number of hand-coded Message Transport policies to select and parameterize the three Message Transports. We expect to integrate with BBN's QoS Monitoring

Service [12][13] and evolve the Msg*Log design and implementation in concert with the evolving Cougaar architecture over the life of the program.

7. Plans

In the near term, we will be completing the integration of Msg*Log with the remainder of the Cougaar architecture, participating in survivability testing, doing general performance improvements, and writing some simple policies for Adaptive Message Transport.

Beyond that we will be developing more sophisticated adaptation mechanisms and developing communications policies compatible with evolving higher level survivability and security policies. We will also be investigating other techniques such as stealthy message delivery, traffic analysis masking, and service bidding that could well be supported by message transport schemes that allow delivery to multiple (perhaps intentionally spurious) addresses and delayed message pickup at convenient (and safe) times.

There is a strong possibility that we will add additional message transport mechanisms, in particular one based on extensions to RMI as discussed in Section 4. As noted there, RMI could be extended with intermediate servers, name mapping, and redirection to achieve functionality similar to the SMTP and NNTP transport mechanisms discussed in this article. However, the operational behavior might be quite different. In particular, SMTP and NNTP servers are optimized to minimize resources required per message (cpu/message and storage/message), but not necessarily for throughput speed (time/message), although that is generally pretty good as a side effect. An RMI-based transport mechanism using store-and-forward servers built specifically to relay a message as fast as possible might be a useful alternative in proprietary networks where it is possible to install the servers at intermediate nodes.

8. Potential Future Work

There are a number of more speculative avenues of research which we wish to pursue.

We plan to examine the possibility of exercising direct control over message routing at the network level for traffic that can be limited private networks. It is certainly feasible that some logistics communities might have dedicated intranets supporting at least part of their computing assets. We have not chosen this approach initially because over the open Internet today, there is insufficient access to routing information and insufficient control of routing at the network level available to the application to make it work. In this regard, recent work on intelligent swarms [9] in which insect simulations based on simple rules and weighted pheromone trails are applied to networks to find low cost routes that reconfigure when the network is disturbed appear interesting.

Another area of interest is using Msg*Log to support groupware such as ISIS, Horus, and Ensemble [10]. Groupware systems allow the definition of groups of distributed processes and guarantee certain properties of message delivery to the group members, including message ordering, quorum calls, and handling of network partitions and reconnections. This might well be a useful capability for both logistics applications and agent systems in general, since it would allow synchronized communication with groups of agents working on a single task _ say logistics "what-iffing". Groupware systems all rely on message delivery by a lower level transport mechanism; as such Msg*Log is complementary. In the case of Horus, this possibility is enhanced by a replaceable protocol stack and that could be adapted relatively easily to use Msg*Log transport. Interesting technical questions arise when different members of a group must be reached via different transport mechanisms with different delivery properties.

It would also be interesting to explore using the ALP planning capabilities to choose message transport protocols and routing strategies; in effect to treat message delivery itself as a logistics problem occurring at the infrastructure level. A concern is that routing plans could not be made fast enough, with the result that performance would degrade. The severity of this problem would depend on how frequently routing plans would have to change in response to network loads and disruption. It is possible that "normal" selection and routing could be done by Msg*Log policies as described above, with only "hard" problems delegated to the logistics tools.

9. Conclusion

This article has described a system to improve the survivability of an agent infrastructure (Cougaar) that is being used to support a complex logistics planning system. Techniques to measure the improved survivability were also presented.

There has long been speculation that so-called "ilities" (survivability, scalability, quality of service, etc.) can best be added to software architectures by exercising control over the communication paths in a complex system. This conclusion was partially born out by our own Object-Oriented Database System (Open OODB) [16] work on sentries, later similar work at Object Management Group (OMG) on interceptors for adding security to systems, complex ACME connectors replacing simpler ones, and our Object Services and Consulting (OBJS) work with MCC on the Object Infrastructure Project. If Msg*Log can significantly improve the survivability of a complex agent infrastructure by exercising such control, it would serve as an important data point in this larger architectural quest.

www.objs.com/

Author Contact Information Author Biographies
Tom Bannon Object Services and Consulting Inc. 5111 Purdue Ave. Dallas, TX 75209 Phone: 214-902-8368 E-mail: [email protected]	Steve Ford Object Services and Consulting Inc. 225 State Rd. West Grove, PA 19390 Phone: 610-345-0290 E-mail: [email protected]	David Wells Object Services and Consulting Inc. 6111 Baywood Ave. Baltimore, MD 21209 Phone: 410-318-8938 E-mail: [email protected]

©Msg*Log: E-mail-based Agent Messaging to Improve Robustness in a Distributed Logistics Planner