Centralized Network Management Aids NASA
Centralized Network Management Aids NASA<@VM>Design for Up Time<@VM>In-house vs. Outsource<@VM>Trouble-shooting
By Jon William Toigo
The value of network management systems in the government market can be hard to quantify, but not when it comes to two NASA centers, where such systems are vitally important to today's high profile space missions.
Without effective network management at NASA's Marshall Space Flight Center in Huntsville, Ala., and Kennedy Space Center in Cape Canaveral, Fla., the nation's fleet of space shuttles might as well stay on the launch pad.
"Supporting payload operations means that we plan the use of power resources, crew time, video downlinks, command uplinks and other scheduled events," said Keith Cornett, chief of the Mission Systems Division at Marshall.
His NASA facility hosts the Payload Operations Integration Center, which will coordinate the scientific and commercial payload operations for the space station now being assembled in low Earth orbit by the United States and its international partners.
The payload operations center will be in testing mode until the space station begins early operations in April 2000. Plans call for the space station to be constructed through 44 space launches between November 1998 and 2004.
The requirements analysis process for Marshall's payload operations center support network began in 1990. Receiving data from experiments conducted aboard the space shuttle and space station and directing it elsewhere requires a reliable, fault-tolerant network, Cornett said.
Marshall's Payload Operations Integration Center (POIC) is slated to come online this March, he said. Data from the space shuttles and space station will be transmitted beginning in June to this center via the NASA Tracking and Data Relay Satellite System.
Front-end processors at the center will take the initial feed from the satellites ? roughly the equivalent to 40 to 50 T1 facilities ? and multicast the data onto the dual ring Fiber Distributed Data Interface (FDDI) network backbone at Marshall.
The network initially will serve 170 seats at POIC, but eventually will be expanded to support 400 to 500 users. The network will connect to other NASA centers, universities, commercial sites and the control centers of global partners on the space station project, including the European and Japanese space agencies.
The POIC offers the latest in network and processing capabilities and provides a comfortable work space for agency and civilian personnel who are directly involved with the missions, Cornett said. It has about 170 workstations and approximately 25 servers, all running Unix, for use in processing the data that is received on experiments and other payload missions.
An FDDI network, running TCP/IP, handles the delivery of data traffic through this subnetwork, Cornett said. The facility uses Silicon Graphics Indy workstations and Challenge L, Challenge XL and Oregon 2000 servers.
Cisco Systems 7500 and 7000 routers and Cabletron Systems intelligent hubs have been used to establish three subnetworks.
Planning for network availability began during the early design phase, when every effort was made to guard against single points of failure, Cornett said.
"We have redundant FDDI rings with automatic failover. We implemented backup routers and use dual-homed servers so that if a transceiver goes down, the servers will continue to be a part of the ring. Every network component needed to have redundancy because we believed that excellent network management begins with design," he said.
Cornett also insisted on the selection of network products, such as Cabletron hubs, that offered hot swappable components and cards, so that failed components could be changed out while systems and networks remained online and redundant components carried the load. Despite these safeguards, he conceded that some components of the network lack the built-in redundancy he would have preferred.
"We have some digital switches, specially designed to handle serial data flows, that are somewhat antiquated. In a few other cases, vendors haven't moved their equipment ahead on the manageability curve to deliver all of the resilience we wanted."
Concerned about potential problem areas in the carefully designed network, Cornett's team determined early on that another criterion in network device selection was native support for centralized management through Simple Network Management Protocol agents. They also sought products that featured diagnostic and troubleshooting capabilities to enable the quick resolution of outages.
"Where this type of product was unavailable," Cornett said, "we built fault isolation capabilities ourselves, sometimes by strapping a PC to the device for error logging and reporting."
While provisions for fault tolerance in network design are part of the solution for resilient network operations, the Marshall team also needed to monitor and manage the network in real time. "That is the job for our personnel in the data operations control room," Cornett said.
Pattie Sanderson, the POIC systems manager, said effective management means deploying tools to monitor system and network events that can be used effectively by a small staff.
"We have three network computer engineers responsible for overall network design, three network systems engineers who are systems managers, and two operator positions basically 16 persons over five shifts responsible for network and systems management," Sanderson said.
Marshall has deployed Cabletron Systems' Spectrum network management system to manage Cabletron hubs, Cisco Systems routers, and Silicon Graphics workstations and servers. Harris Corp. of Melbourne, Fla., provided its Harris Network Management product to manage components used for data, voice and video communications.
"We have also deployed various RMON [remote monitoring standard] probes to monitor network segment performance," Sanderson added.
Network Associates' Sniffer software products are used to monitor the FDDI networks subnet performance and to aid in fault isolation. Empire Technologies' SystemEdge Management Agents are used to enhance Spectrum's management of Cisco equipment.
The multiplicity of network management products reflects a fact of managing complex network environments, Cornett said.
"To say that we use a single network management system would make a nice story, but a single product containing all of the functionality we require for monitoring and troubleshooting our networks and end stations just doesn't exist," he said.
Fortunately, the network management systems in use predate by several years the rollout of the POIC, Cornett said. Operators have become skilled in their use, which relieves pressure to have more staff.
"Between the design elements in our network and the effective management capabilities we have in place, we could sustain a failure and resolve it within a normal work week without losing our operational capability," he said.
The bottom line is that centralized network management enables a small staff to manage a large number of devices effectively, he said.
While NASA's Marshall center prefers to manage its own networks, the Kennedy Space Center contracts out network management to a third party contractor, United Space Alliance.
Headquartered in Houston, United Space Alliance is a joint venture between the Boeing Co. and Lockheed Martin Corp. NASA awarded the company a contract valued at $19 billion in 1996 to manage space shuttle flight operations.
United Space Alliance's network support group is responsible for providing network infrastructure support for NASA space flight operations at Kennedy. The purview of the network support team's staff of 35 includes assuring LAN services to 5,000 NASA and contractor desktop users engaged in shuttle launch processing.
In addition, the team manages the Kennedy network backbone, three 100 MB FDDI rings bridged with Nortel/Bay routers. In this role, the network support group serves some 9,000 users across all three functional areas launch, payload, and base operations.
A wide array of routing and switching devices are used to support users located across Kennedy Space Center's 90 facilities, including permanent buildings and trailers. Fifteen Cisco routers manage traffic routing and prioritization at the top level. One layer below, the network incorporates 100 MB Fast Ethernet and 10 MB Ethernet switching hubs, as well as wiring concentrators from 3Com, Cabletron, Chipcom and Fibermux.
"We are continually uploading and downloading information to and from the backbone on this complex, multi-tier network," said Matthew Guessetto, a network engineer in the network support group.
"The networks and connections in our charge are mission critical in every sense of the word," he said. Availability and performance have a major impact on the productivity of NASA and supporting contractor personnel.
And while the network is growing in scope and importance, Guessetto said, support budgets are not scaling in parallel with the growth of the network. That means his team manage more with less.
In this environment, he said, the ability to leverage the knowledge of a limited number of expert personnel to the network is central to success. Leading-edge diagnostic and management/repair technologies play a critical role in ensuring the success of our team, Guessetto said.
"Currently, we are running three types of Ethernet to support desktop users shared 10 MB Ethernet, switched 10 MB Ethernet, and switched 100 MB Fast Ethernet. The system also incorporates an array of Token Ring 4 and 16 MB hubs."
The network support group operates a range of 20 distributed and portable Network Associates Sniffer Total Network Visibility packet analysis tools to gain visibility into Ethernet, Fast Ethernet, Token Ring, and FDDI subnets. The Sniffers operate on both desktop-to-backbone, and backbone configurations.
In addition, the group uses OpenView, Hewlett-Packard Co.'s system management framework. Sniffer and OpenView provide complementary functionality, he said.
"In normal circumstances, only small segments of the subnetwork are affected by a problem," Guessetto said. In such cases, the team can analyze the issue using the Distributed Sniffer. Frequently, it will fix the problem from the HP OpenView console.
On the flip side, HP OpenView "alerts us of major connectivity [breaks] in real time. Where the network is not available, we are denied access to the distributed Sniffer to diagnose the problem. In these circumstances, we dispatch a technician with a portable Sniffer running on a laptop to start testing the network. The technician works his way back from the furthest area without service to the closest point of service to isolate the problem."
Problems with the network can develop from numerous sources, including user errors, desktop configurations, applications, and hardware issues, Guessetto said. Any of these can lead to connectivity failure or poor performance. Subnetwork problems also may be a sign of problems with the backbone infrastructure, compounding the complexity of trouble-shooting, he noted.
"In many cases, when users experience challenges, they immediately presume that there is a problem on the network," he said. "Challenges range from slow performance, to application time outs, to failure to see certain devices, to complete failure."
In many circumstances, the problems may result from desktop configuration problems or from groups loading 'chatty' applications that are not suited for organizational use."
Guessetto said his group leverages Network Associates Sniffer diagnostics in conjunction with HP OpenView's big-picture monitoring and trouble-shooting capabilities for proactive network management.
"If OpenView detects slow or no response from a networked device, it automatically generates an alarm, and the console operator generates a trouble ticket. Alternatively, we may receive notification from a user, in which case we log the issue and generate a manual trouble ticket," Guessetto said.
Trouble-shooters then turn to the Distributed Sniffers that run full time on the network to diagnose potential problem causes. In most cases, he said, problems do not bring down the whole network. The Distributed Sniffers can be used to isolate faults quickly and resolve user trouble tickets.
"The Sniffers allow us to employ more than 385 protocol decodes to analyze the conversations between devices at the packet level. The Sniffer isolates problems by analyzing network traffic to objectify whether data is leaving a machine correctly," Guessetto said. "We can diagnose problems remotely from the console, which allows us to make best value decisions on whether to send a technician to the location."
He added that the Sniffers can be managed through a centralized console, providing an end-to-end view of what is happening between machines experiencing a communication problem.
For example, the group recently investigated a slow-response situation and ran the Sniffer between server and client to analyze the conversation and determine the problem. The issue was caused by a team's decision to run a local area network application across the wide area network. The software's small packet communication approach led to significant network latency.
Trouble-shooters diagnosed the problem remotely from the Sniffer console and explained the challenge to the user.
In other cases, the Sniffer may show that a problem is caused by insufficient server processing capacity relative to application load in a specific network sector, Guessetto said. The Sniffer also enables the group to identify duplicate IP addresses, bad network interface cards and cabling issues.
He credited the product with aiding in the resolution of "flapping" and "looping" issues routing problems sometimes brought about by the expansion and reconfiguration of networks.