Architecture and Performance of a Parallel Data-Fusion Multi-Target Tracking Code for Real-Time Applications

Charles Pedersen, AFRL Rome Research Site

Abstract

An existing serial code for fusing asynchronous data from multiple radar and optical sensors, and then simultaneously initiating, maintaining and dropping tracks on multiple missile targets, has been successfully parallelized at the Air Force Research Laboratory (AFRL) under the aegis of the DoD High Performance Computing Modernization Office (HPCMO).

This paper describes the functionality of the original serial code and the implementation of a parallel software architecture that distributes these functions across the multi-processor parallel architecture of the HPCMO Intel Paragon Computer at AFRL Rome Research Site (RRS). The objective of the work is to demonstrate a Fusion Tracker code that is both parallel and portable, potentially for implementation on real-time embeddable High Performance Computing (HPC) architectures.

The improved performance levels achieved by the Parallel Fusion Tracker are presented for the main metrics of interest in real-time applications, namely latency, total computation load, and total sustainable throughput. Results are presented for combinations of 1 to 126 targets being tracked on 1 to 126 parallel nodes, up to a total of 126 targets per node. It is shown that the single, key parameter that determines both latency and overall throughput is the number of targets per node, irrespective of other parameter variations. It is further shown that overheads introduced by use of the Message Passing Interface (MPI) have negligible significance when there are greater than 4 to 8 targets per node. Code and algorithm optimization of the baseline serial code remain as future tasks for achieving higher parallel performance levels above the 115 to 257 MFLOPS reported here.

Introduction

When a digital computer is contemplated for application in a real-time signal or information processing environment, an issue of major concern is "latency," the time delay between data being received at the computer input and the effects of that data appearing at its output. Latency, or information timeliness, in any given application is determined both by the amount of computation to be done and the computational speed of the computer doing it, i.e. its effective "throughput," measured in millions of floating operations per second (MFLOPS). As the computational requirements for future real-time defense applications grow, seemingly without bound, the need to maintain information timeliness can be addressed by: 1) optimizing the processing algorithms; 2) increasing computational speed of the processor; or 3) applying many processors in parallel High Performance Computing (HPC) architectures. In fact, all three of these technology avenues are continually being pursued, and when they are used in combination they lead to the record-breaking computational landmarks of the day. This paper specifically addresses the third alternative for performance improvement, namely parallel processing.

This paper is a study of the application of parallel high performance computing to a candidate serial algorithm for jointly accomplishing data fusion from many sensors and simultaneously tracking multiple targets in real-time. The emphasis is on comparing the architectures of the serial and parallel algorithms, and characterizing the performance benefits achieved by the parallel algorithm. In addition to technical results this paper includes a discussion of the parallelization effort and particular lessons learned from it.

System Context

The future battlespace scenario that motivates the Fusion Tracker code is illustrated in Figure 1.

The notional scenario includes multiple satellite, airborne and ground radar, optical and infra-red surveillance sensors viewing a large geographic footprint that contains large numbers of friendly, hostile and other targets of various types, whether aircraft, missiles or ground vehicles. In such a situation the amount of information that can be collected, especially when imaging is included, is simply enormous and will swamp human interpretation and reaction times.

A simplified block diagram that describes the context for the data fusion and tracking program to be discussed is shown in Figure 2. It has been assumed that, prior to arriving at the fusion tracking function, the data collected by a number of sensors individually has been reduced to individual target detections. Each detection arrives at the fusion tracker input with associated target coordinates and data quality flags. In development of the baseline serial code it was further assumed that the results of data fusion and tracking are collected into a data server that contains the state vectors for each individual target in the battlespace. In this design, the downstream functions of target identification, prioritization, scheduling, interdiction resource allocation and so forth can then function asynchronously from the fusion and tracking process, and can draw necessary target information from the state vector server as needed.

The functions that then fall to the multi-sensor data fusion and multi-target tracking code include Input Measurement Processing, Data Association and Track Maintenance, and State Vector Management, as shown in Figure 2.

There were two driving requirements for the Parallel Fusion Tracker code: first, that detections and metric coordinates from 100 targets and other objects simultaneously in the multiple fields of view would have to be processed in real time; and second, that embeddable HPC technology would be used in order to support HPC deployments in mobile and airborne applications.

Objective

The objective of this code parallelization effort was, therefore, to demonstrate an implementation of multi-sensor data fusion and multi-target tracking functions within an integrated multi-node portable HPC architecture. The key metrics to be determined in support of ongoing system analyses included: required computational throughput in MFLOPS; latency between receipt of input data and resulting outputs; and scalability, processor utilization and memory requirements. Furthermore, the standard Message Passing Interface (MPI) functions were to be used for inter-node communications in order to promote code portability across multiple HPC computer platforms.

Conclusions

A major factor in determining the measured performance levels is the fact that the fusion tracker code was parallelized "as-is" without optimization of either the C code or the algorithms employed. Both the serial and parallel fusion tracking codes are written in straight C code (matrix routines included), plus MPI calls for the parallel version. Neither uses library functions optimized for the Paragon nor do they otherwise engage in cache management to speed the calculations.

In fact each of these considerations applies equally well to both serial and parallel code and the standard advice still remains true. Truly high performance begins with selection of a good serial algorithm, followed by optimization of it, followed then by parallelization. In the case of the Parallel Fusion Tracker, code improvement and optimization remain as tasks for the future.

A serial data fusion and tracking code has been successfully parallelized through the use of standard MPI communication functions. Apart from debugging, the two major components of effort were 1) reverse engineering of the original code in the absence of detailed code documentation, and 2) creation, declaration and definition of portable MPI derived data types to correspond to each of the complicated data structures present in the original code. The result is a Parallel Fusion Tracker code written in standard C and MPI that will be portable to other High Performance Computing architectures, such as the 384-node Power-PC-based Sky computer recently acquired by AFRL Rome Research Site.

The Parallel Fusion Tracker was instrumented and detailed data was collected for the latency and computational effort contributed by each of its 15 major functional modules, for numbers of targets and nodes between 1 and 126, in combinations ranging between 1 and 126 targets per node. The relationship between total computational load, represented by number of targets being tracked, and the resulting latency and throughput MFLOP levels were determined as functions of the number of nodes. Above 4 to 8 targets per node it was shown that the code is highly scalable, meaning that latency and number of nodes can be traded against each other at any total computational level. Latency, for example, might therefore be held to a particular value by choosing the number of nodes appropriate to the number of targets to be tracked. The governing functions for both latency and computation per node are shown to be very nearly deterministic functions of just one single variable, namely targets per node, over very wide ranges of numbers of targets or of nodes.

Throughputs for individual functions as high as 6 MFLOPS were recorded for Track Propagation, as well as 257 MFLOPS for the integrated Association and Tracking function of the overall Fusion Tracker when tracking 126 targets. When Input Processing and State Vector Management are included they contribute more to latency than they do to computation, and overall performance levels are diluted to 5 MFLOPS and 115 MFLOPS respectively when tracking 126 targets on 126 nodes. These levels, though low for parallel computation, are nevertheless typical for codes that simply rely on straightforward C coding without the use of optimized library functions or careful memory management during computation. Code optimization of the baseline serial code and its parallel version remain as tasks for the future.

Author Contact Information

Charles Pedersen, Ph.D
AFRL, AFRL Rome Research Site
26 Electronic Parkway
Rome, NY 13441
[email protected]

Acknowledgement: The author gratefully acknowledges the help of Mr. David Welchons (Integrated Sensors Inc), the creator of the original serial code, who kindly and patiently answered questions about it during this modification to a parallel architecture. The author is also indebted to Mr. Bill Fontana (AFRL), who made the serial code available early in its development cycle in order to permit this parallelization to proceed.