Software Tech News 7:1 - TeraGrid Software Strategy: E Pluribus Unum

Volume 7, Number 1 - Grid Computing

TeraGrid Software Strategy: E Pluribus Unum

By Patricia Kovatch, San Diego Supercomputer Center

TeraGrid: Background and Development

"Build and deploy the world's fastest distributed infrastructure for open scientific research:" this is the mission of the TeraGrid project. It is a ~$100M multi-year project, sponsored by the National Science Foundation (NSF). It's the first joint multi-site supercomputing effort featuring over 24 TeraFlops (TF) of initial combined compute power (13 TF of Itanium2), one PetaByte of online storage and a 40 Gigabit per second backbone. Additional sites are in the process of adding their resources to the TeraGrid. A grid infrastructure was developed and deployed to help scientists and researchers make use of these resources. Cosmology, weather and geophysics applications make up the initial allocations on the TeraGrid. Coordinating the efforts to manage the sheer amount of software, operations and user requests has been an interesting challenge.

The NSF vision, as given in the Request For Proposal (RFP) for the TeraGrid project, reads, "NSF seeks to open a pathway to future computing, communications and information environments by creating a very large-scale system that is part of the rapidly expanding computational grid." The goals of the TeraGrid are to "enable new science by offering new capabilities, build an extensible grid that can be grown and copied, and provide an evolutionary pathway for current users." The solicitation specified that the proposals must include distributed, multi-site facilities with single site and "grid enabled" high end compute capabilities connected via ultra high-speed networks. A distributed storage system with both online and archival storage capabilities was required. Remote visualization and a production-level quality of service were necessary components.

Many Major Research Equipment (MRE) projects have common needs: geographically distributed instruments, Terabytes to PetaBytes of data, data unification and sharing between multiple formats and remote sites, high end computation (simulations, analysis and data mining), presentation of results (visualization, virtual reality, etc.). Examples of these projects include:

Atacama Large Millimeter Array (ALMA)
EarthScope (Structure and Evolution of the North American Continent)
IceCube (High Energy Neutrino Detector)
Laser Interferometer Gravitational Wave Observatory (LIGO)
Network for Earthquake Engineering Simulation (NEES)

Considering the needs of these projects, we sought to create a seamless environment for the geographically distributed scientists of the TeraGrid. We decided to minimize differences in hardware and software. We researched and tested ways to make resources available equally from each site. We defined a unified user support operation and we developed an automatic reporting and testing infrastructure to guarantee that a certain service level of resources would be available to our scientists and researchers.

January 1, 2004 marked the first day of production for the four founding sites of the TeraGrid. To make this happen, these four diverse sites had to define and implement common goals. Thirteen working groups were ordained to flesh out the vision and make detailed implementation plans. The working groups consisted of Clusters, Grid, Networks, Performance Evaluation, User Services, Data, Operations, Applications, Visualization, External Relations, Interoperability, Account Management, and Security.

TeraGrid Founding Sites
San Diego Supercomputer Center (SDSC)
National Computational Science Alliance (NCSA)
Argonne National Laboratory (ANL)
California Institute of Technology (Caltech)
New Sites
Pittsburgh Supercomputer Center (PSC)
Indiana University
Purdue University
Texas Advanced Computing Center (TACC)
Oak Ridge National Laboratories (ORNL)
Corporate Partners
IBM, Intel, Qwest, Sun, Myricom, Oracle

TeraGrid: Software Management Strategies

The basic TeraGrid software stack consists of:

A base operating system
Applicable drivers (for example, the SDSC Itanium2 cluster has Myrinet, Qlogic (fibre channel), and SysKonnect Gigabit Ethernet drivers)
Compilers (GNU and Intel for TeraGrid Itanium2 clusters)
Message passing interfaces (MPICH, MPICH-GM (for Myrinet on Itanium2)
Numerical libraries (FFTw, ScaLAPACK, etc.)
Data formatting tools (HDF4 and HDF5)
Archival storage and data collection access
Database clients
Parallel filesystem daemons
Resource manager (Portable Batch System (PBS) on Itanium2)
Globus Grid infrastructure

Many other applications, such as the Storage Resource Broker (SRB), help the scientists manage their data sets for their research. In total, there are over 100 separate pieces of software. Keeping all of these pieces of software up-to-date and interoperable at many different sites can be unwieldy, but there are several techniques TeraGrid employed to manage all of these pieces.

Use of a Global Software Repository Each site has representatives responsible for parts of the software stack who are tasked with watching for updates to the software (including security), testing these updates and checking in the software, accompanying configuration files and build instructions to the global software repository. Other sites then check out the programs and files from the global software repository and install them locally. TeraGrid currently uses an open source repository tool called Concurrent Versions Systems (CVS) to manage the checking in and checking out of software versions of all of our software, from the kernel to applications. Version changes are typically applied during one of the regularly coordinated, scheduled (weekly) Preventive Maintenance (PM) periods which are staggered between the sites so not all resources are unavailable at the same time. This scheduling strategy allows software (and hardware) upgrades to happen in a timely manner. Updates to the software stack due to security concerns such as root exploits happen immediately. Policies and procedures are in place to prevent, protect, detect and handle security incidents.

Maintenance of Nodes The nodes are maintained using a standard suite of tools for automatic and repeatable installation and updating software, automatically parsing the logs, etc. Being able to update and check the software on all the nodes (as well as install the nodes) is critical to maintaining systems with multiple nodes.

Consistent User Environment To abstract away the local dependencies at each site, agreed upon environmental variables are used to make users' scripts portable. This enables users to experience a consistent environment regardless of which site is accessed, an attribute called TeraGrid "roaming". There are more than 20 variables defined for scratch and parallel filesystems, library paths, MPICH, etc. The software that manages setting the user environment variables is called Softenv. More information about Softenv is available at http://www-unix.mcs.anl.gov/systems/software/msys/ .

Sophisticated Monitoring Tools One "at-a-glance" view of each site's software versions and basic software functionalities helps everyone view the status of the system immediately. This "Inca Test Harness and Reporting Framework" is a flexible software infrastructure for automated testing, verification and monitoring of the TeraGrid software stack and environment. Inca is composed of a set of "reporters" that interact with the system and report status information, a "harness" that provides basic control of the reporters and collection of information, and "clients" that provide a web interface to the information collected by the reporters. Currently over 900 pieces of information about the software and environment are checked.

A reporter is a self-contained pluggable component implemented as a script or executable that performs a test, benchmark or query and outputs a result in XML. For example, a reporter can output package version information or test whether a grid service is up and available. Reporters can be written in any language and Perl and Python API helper libraries are available to ease the process of writing reporters. Currently, there are more than 100 reporters written with these API's. The execution details of the reporters (frequency, inputs, etc.) are handled by the harness. The harness contains a set of daemons which manage the distributed execution of the reporters, collect the data to a central location and publish the data into an information service such as MDS2. Clients compare the resource information collected by Inca against a representation of the software stack and environment and display it on a TeraGrid web status page. A version of the Inca Test Harness and Reporting Framework is available for download at http://tech.teragrid.org/inca/ .

Software Change Request Process Anyone can submit a request to add or update software to the TeraGrid software stack by submitting a software change request. An organized plan for testing, deployment and integration of the software is developed and implemented. Test nodes at each site are connected and help with the interoperability testing.

Use of the Globus Grid Toolkit The toolkit helps create a geographically distributed infrastructure between the sites. Gx-map and CACL make it easier to manage this infrastructure. CACL is an Open SSL based Certificate Authority (CA) CLient system that issues digital certificates. It automatically authenticates the identity of the user using the username and password. The encrypted request is submitted to the CA daemon, which decrypts it and authenticates the user in the same way that ordinary login authentication is done. It then either issues a certificate or a rejection notice to the CA CLient program. If the certificate is issued, it is automatically placed in the user's home directory. There is a program that the CA administrator can use to revoke certificates and create server or service certificates. These commands can only be run by a system administrator on the machine that provides the CA services. More information is available at http://www.sdsc.edu/CA.

Gx-map allows users to add their certificate automatically to the grid-mapfile. This file maps a specified Distinguished Name (DN) to a Unix account name. Gx-map propagates these changes automatically between systems. An auxiliary tool called gx-check-cacl-index automatically checks for new user certificates issued or revoked by CACL and invokes the gx-map command to request the appropriate updates. With these systems, a user can obtain a digital certificate and update the grid-mapfile entries on a number of systems without manual intervention from systems administrators. This software can be found at http://users.sdsc.edu/~kst/gx-map/ .

TeraGrid: Account Management and Accounting

To make it easy for scientists and researchers to get started on the TeraGrid, there is one location that gives information about getting accounts and allocations: http://teragrid.org/docs/guide_access.html . Account requests funnel to the national, peer-reviewed allocation committees. The National Resource Allocation Committee (NRAC), Partnership Resource Allocation Committee (PRAC) and
Alliance Allocations Board (AAB) meet several times a year and allocate over ten million Service Units per year. A Service Unit (SU) is approximately one Central Processing Unit (CPU) hour. Most allocations last for one year, though multi-year allocations are now available to scientists with multi-year projects. Start-up allocations can be granted by the Development Allocations Committee (DAC) for up to 30,000 SUs. Once allocations are approved, the project and account notifications are transported between the remote sites and the TeraGrid Central Database (Postgresql) located at SDSC via the Account Management Information Exchange (AMIE) system. AMIE uses an XML format to facilitate import and export between each site's local database and central database. Soon Perl scripts will automatically transport and accept account addition and deletion requests and allocation usage statistics from SDSC to the other sites.

When a user submits a job to a TeraGrid machine, a wrapper script around the job submission commands (PBS or Globus) checks to see whether the account has an allocation. The job runs if the account has enough allocation to fulfill the job request. After the job completes, the job information is logged. Then the TGAccounting package, developed at SDSC, converts the resource manager (PBS) and Globus logs into an XML format compatible with the Global Grid Forum (GGF) Usage Record Format and National Middleware Initiative accounting protocols. These scripts run nightly. The XML is then parsed and imported into records in the local site's database and finally transported via AMIE to the TeraGrid Central Database where the allocation is deducted. Soon the allocations will be fungible between sites.

TeraGrid: Operations and Unified Help Desk

Though the TeraGrid resources span several sites, the TeraGrid Help Desk appears as one entity to the scientists and researchers who use it. There is one phone number, one e-mail address ([email protected]) and one website (http://www.teragrid.org) presented to the users. To support this, there is one ticket system that accepts the trouble tickets via e-mail or web input. This ticket system has been specially tailored by NCSA to quantify and track tickets for the TeraGrid. It automatically notifies the representatives from each site when new tickets arrive. The tickets can be assigned to a specific site or all sites depending on the request in the ticket. Site representatives update tickets via a webpage interface to the ticket system database.

Both NCSA and SDSC have round-the-clock operations centers and trade off the monitoring to provide coverage on a 24x7 basis. These centers have procedures for contacting people during off-hours for support as needed.

Each TeraGrid site has an operations center with monitors dedicated to displaying the status of the machines. Clumon, software developed at NCSA, monitors performance metrics from the nodes and queues and correlates data and other queue and job information from the resource manager and scheduler. It provides a near-real time, graphical display about the health of the system and the processes running on it. Visit http://clumon.ncsa.uiuc.edu for more information. Another monitoring tool called Ganglia is also used.

TeraGrid: Collaboration

Each site brings different expertise to the TeraGrid. Each site has provided critical insight, ideas and implementation effort to the TeraGrid project. Each site has contributed essential software development, testing and troubleshooting. The sites have worked closely together to develop the design and implementation of the TeraGrid facilitated by weekly conference calls and periodic face-to-face meetings.

TeraGrid: Futures

Research and demonstration of new technologies that improve upon the current implementation are continually being conducted. Specific areas include automatic scheduling of resources at multiple sites (metascheduling), sharing home directories over the wide area (parallel filesystems mounted over WAN) and remote manipulation of storage facilities (Fibre Channel over IP protocol). In all of these cases, prototype programs have been demonstrated at the Supercomputing conferences. For more information on these demonstrations, please visit http://www.sdsc.edu/~pkovatch/new-tech .

As new sites join on, new resources are available and new interoperability and scalability issues are identified. Growth will continue to provide new challenges and rewards to the scientists and researchers making use of the ever-expanding TeraGrid.

About the Author

Patricia Kovatch is the Manager of the High Performance Computing Production Archival Storage Systems at the San Diego Supercomputer Center (SDSC). She is SDSC's lead for the Cluster and Grid Working Group. She also serves on the Data and Performance Evaluation Working Groups. She has worked on 512 processor commodity clusters with Myrinet and currently Itanium2 clusters for the TeraGrid project.

Author Contact Information

Patricia Kovatch
Manager, HPC Systems and Archival Storage
San Diego Supercomputer Center
PH: (858)822-5441
E-mail: [email protected]

Previous Table of Contents Next