Convenient fault diagnosis with LLDP
30 September 2010
This rule-based ‘expert system’ helps to locate network misconfiguration errors.
In the field of industrial automation technology, network infrastructure devices based on Ethernet provide a wide range of functionality and support many different network protocols, for instance to implement fail-safe (redundant) connections and precise time synchronisation or ensure short response times.
This wide functional range and the resulting configuration settings always entail a risk of configuration mismatches between devices within a given network topology, which may have adverse effects on the availability and processing speed of the network.
Why configuration error detection?
Essentially due to a feeling of uncertainty, network providers sometimes tend to pay for expensive service tasks performed by a device manufacturer, only to have their network and device configuration checked for proper settings. Also, these service calls become necessary more often than expected, simply because quite trivial configuration errors were overlooked during initial system set up, although the information required was in principle available. Therefore, it would be of great benefit to the customer, if he had a tool that allows to conveniently detect and virtualise such errors.
To alleviate this issue of user ergonomics, the software delivered for the Hirschmann network infrastructure devices now includes an extended LLDP (Link Layer Discovery Protocol) implementation as well as a network management software (as of Version 04.2.00), which provides users with a convenient means to quickly identify and correct misconfigurations of various higher protocols in networks including Hirschmann devices. The following sections intend to illustrate the basic principles and mechanisms used to this end as well as the benefits and implications for network management.
Network management: state of the art
The ISO has developed the FCAPS model , which describes the functional tasks of network management:
* (F) Fault Management: recognise, log, signal and correct error conditions
* (C) Configuration Management: record all components (configuration items) to be monitored
* (A) Accounting Management: record the usage of the network for billing purposes
* (P) Performance Management: collect and analyse traffic/performance data, set thresholds
* (S) Security Management: authenticate users, authorise access and use
Configuration Management is fully supported by most of the network management tools on the market.
In these tools however, Fault Management is usually limited to the recognition and visualisation of dynamic deteriorations in the monitored network. To this end, most of the network management systems offer solutions that use, for instance, the Simple Network Management Protocol (SNMP) to transmit “traps” sent from the monitored devices to inform on condition changes. However, these traps are typically event-driven and triggered only when a specific preconfigured condition or event occurs (for instance, a trap could be triggered when a configured threshold for the enclosure temperature of a device is exceeded).
On the other hand, if a misconfiguration is static by nature and does not generate a change of condition or recordable event, these faults can hardly be detected with the tools currently used: they simply do not cover the recognition of static configuration errors affecting the device functionality. The main goal of the Hirschmann solution is to eliminate this lack of static configuration analysis.
The computer as expert?
As a general rule, a static configuration analysis can be performed by a human being with the help of a “traditional” network management system such as Hirschmann Industrial HiVision or Nagios. To do so, this person needs to check the configuration of individual devices against patterns that are known to be valid and error-free.
However, in many cases the sheer size of the network makes it virtually impossible to keep track of all details. Essentially, this has something to do with the psychology of perception . Human beings are able to process only a limited amount of information. When the quantity of data becomes too large, we begin to select and, in so doing, run the risk of filtering out important information as well. A computer-implemented configuration error detection is not subject to such restrictions of limited perception, but always works with the same consistent reliability. To benefit from this advantage, we need to find a way to reproduce human intelligence and expertise in software. Actually, what we are looking for is a solution that provides the knowledge of a human expert – e.g. a service technician who is able to verify the configuration – in a software, with consistent reliability and at the push of a button, which is why this kind of software is often referred to as “expert system”.
To this end, an analysis was performed in the context of a thesis  to identify the functionalities typically implemented in network components that provide the potential for error detection. The findings of this analysis were then used to derive the sets of rules needed by a misconfiguration detection expert system.
Challenges of configuration error detection
Before being in the position to implement an expert system, one needs to fulfill a number of basic requirements. Two key issues were identified as fundamental challenges in the development of an application for the detection of configuration errors (see Figure 1):
1. It is necessary to define the data base required for the detection of configuration errors as well as a taxonomy for the analysis of the data. Also, one has to take into account that the context, i.e. the perspective from which the network is examined and the data is taken, depends on the area of operations and field of application and has to be changed accordingly. In addition, the protocols and procedures used to collect the data need to be adapted to both physical and logical requirements:
This graph shows three contexts. First, the strictly local case where only one local device is examined in isolation; second, the context of a direct (physical) connection between two neighboring infrastructure devices; and finally, these two devices as part of a higher-level context, i.e. the logical overall topology or simply the logical context.
For each of these typical contexts, the protocols used for data acquisition and the basic procedures may be quite different, as, for instance, a transport protocol for information exchange at the logical level may not be (very well) suited for exchanging information at the physical level. More specifically, a transport protocol for the physical context only takes care of transporting the data between two neighboring devices, while a protocol for the logical context has to transport configuration data from and to a much larger number of devices. In the local context, error detection does not require any transport protocol, as data from only one device is to be analysed.
To detect error conditions, one needs to monitor a system in the appropriate context. For instance, if you change from the local to the physical level, you will have to deal with other data and other analysis parameters.
2. The findings gathered from error detection have to be processed in a way that allows a network operator to take appropriate corrective measures within a short time. To this end, the application must provide a descriptive error model that allows to categorise the misconfigurations and error conditions detected.
3. For a modern Ethernet LAN, appropriate network management solutions are the obvious choice to overcome these challenges. This applies to both the detection of errors and the localisation of network problems as they occur.
Structuring of identifiable error conditions in contexts
To get more clarity in this topic, it is advisable to start by classifying the possible types of configuration errors according to the well-known ISO-OSI layer model.
Figure 1 ‘VLAN Misconfiguration’: In a physical context, different VLAN IDs are configured on two switches. An analysis of the Port and Protocol VLAN information provided in the LLDP-EXT-DOT1-MIB allows to detect differing VLAN settings between infra
Layer 1: cabling, speed, duplex
Layer 2: redundancy protocols (e.g. Spanning Tree)
Layer 3: IP protocols, multicast
Higher layers: application protocols
Then, we shall examine how the different sets of data, protocols and functionalities from different layers fit into the contextual representation. For instance, this allows to clearly assign the different configurations of directly interconnected network ports of structure devices to a physical context. Conversely, there are routing protocols such as RIP that can be clearly assigned to a logical context.
However, there also are protocols and functionalities that cannot be clearly assigned to a context, such as for instance protocols for handling IP multicast traffic. Furthermore, it is not always sure that the OSI layers can be mapped 1:1 to the defined contexts. For instance, the Layer 1 parameter “Duplex” defined above is irrelevant for the local context. However, for the physical context, i.e. the connection between two network devices, this parameter is very important, as a misconfiguration of this parameter at a network interface can easily result in a reduced performance and/or the loss of the connection.
In theory the concept of contextual classification is freely scalable, in practice it is limited only by the technological boundaries of today’s data network technologies. The basic requirement for any working error detection in each context is a transport protocol adapted to the level of complexity or optimally suited for the corresponding context. For instance, LLDP is the right choice for error detection at the physical level, but we shall come back to that later.
Contexts = complexity levels
A thorough analysis of the Hirschmann devices or, more generally, network infrastructure devices such as Industrial Ethernet switches has shown that the three contexts referred to above (local, physical, logical) represent different levels of complexity in the acquisition of information that are relevant for implementing an error detection system. If the system is applied to more complex devices and/or network structures, it may become necessary to define additional contexts, but for an initial approach in the field of Industrial Ethernet networks this system is quite appropriate.
The cases where data acquisition for the detection of any misconfigurations is performed in a local device only, are called the “local context”, which requires no configuration information from other devices in the network and therefore no transport protocol either. Obviously these strictly local configuration errors – such as a default gateway or a time server address outside of the configured IP subnet of the management agent – could be intercepted when the configuration settings are made initially, but due to task sequence issues, this could rapidly become very difficult to program. Therefore it might be very useful to let a downstream function systematically check such configuration parameters for errors, be it only to simplify the software architecture of the devices checked.
In the physical context, a misconfiguration affects the direct relation between two neighboring devices (point-to-point) and can usually be described through a simple set of rules, e.g. “If Device A is set to 10MBit half-duplex, the corresponding port on Device B must have the same setting, otherwise misconfiguration.”
While easy to describe and handle on the programming side with rules, this kind of configuration mismatch is usually limited to local configuration settings, Layer 1, Layer 2 and a few Layer 3 protocols.
On higher protocol levels, i.e. Layer 3 and higher, things become much more complicated, as you need to leave the physical context and enter the logical context. Configuration errors that affect certain protocols in both the physical and the logical context are handled according to where they occur: through the analysis and the transport protocol assigned to the context.
Beyond the strictly separated contexts, the feedbacks between individual contexts can be used to determine complex error conditions or to draw inferences regarding cause and effect. This can be illustrated by a simple example: A device is temporarily inaccessible through its Layer 3 interface (IP address), meaning that higher-level services may also be subject to interruptions. At the same time, the analysis component in charge of the physical context signals an error in the duplex configuration of the network interfaces. This results in a trivial but complex error condition, with the detection of the error in the logical context arising from an error in the physical context (duplex mismatch).
Using LLDP for the Hirschmann solution
The tool that provides the misconfiguration detection functionality for the Hirschmann infrastructure devices essentially consists of programmed sets of rules, which is referred to as a “rule-based expert system”. The information – to which the rules are applied – is taken directly from the devices involved using SNMP. The transport protocols intended for each context are used for the transmission of configuration data between individual devices. LLDP is used for the physical context. For details on the basic functionality of LLDP please refer to , general information on how to use LLDP for configuration error detection is provided in the LLDP standard in annex F , in more detail in .
As transport protocol for the physical context, LLDP presents a number of benefits:
* LLDP Protocol Data Units (PDUs), the data frames sent by an LLDP-enabled end device, can still pass through direct connections between devices even when they are blocked and no longer available for communication purposes (e.g. VLAN misconfigurations). This makes it possible to recognise misconfigurations even though no network traffic is possible due to higher protocol layers.
* LLDP PDUs transmit data to directly neighboring devices only and therefore accurately represent a physical context, i.e. the direct connection between exactly two devices. This ensures the availability on a device of local data accurately defined in the LLDP MIB as well configuration data from the neighboring devices, so that they can be compared to each other.
Representing configuration errors
When applying the programmed configuration rules to the data base, the identified problems have to be visualised in a precise and graphically clear – i.e. ergonomic – way.
To this end, the global view of all misconfigurations detected is a decisive factor. The approach adopted for the error detection application does not use the misconfigurations as starting point for the visualisation, but first analyses how the condition of a network presents itself from the perspective of a user. The reason for this is that the loss of productivity possibly resulting from a configuration mismatch between network devices is due to the fact that the data network is no longer available to the user for his work. Therefore, any priorities set for the correction of multiple misconfigurations must be based on the goal of restoring the user’s work environment as quickly as possible and in the best possible condition.
Hence, a condition model including four possible conditions was developed for a clear visualisation within the application used for configuration error detection. For a quick and intuitive recognition view these error conditions are color-coded (also refer to Figure 3):
1. No error detected = green symbol
2. An error was detected that could affect the network performance, but has no immediate impact on the availability of the network = yellow warning symbol
3. An error was detected that could affect the availability = red warning symbol
4. Insufficient data base, no error detection possible = grey symbol
For error detection in the context of a direct physical connection, this “traffic light logic” ensures that the configuration status regarding the connected neighboring device is clearly evident for each device port involved; and when a connection to a neighboring device is selected with the selection bar, the system displays a detailed description of all configuration errors detected.
If a port shows multiple errors, they are listed and described in the order of priority, with the highest priority at the top.
This ensures that, in the context of the scanned device, the error condition of a connection is clearly shown for each directly connected device. The error condition with the highest-priority can be read directly, while the detailed view display all error conditions sorted by priority.
Figure 2 ‘Error Detection’: The error detection GUI component. The upper part displays the neighbouring devices, which can be selected; the text field underneath contains a detailed description of the detected misconfigurations.
Hence, the administrator is in a position to perform the necessary configuration changes on the devices, beginning with highest-priority error, in order to first improve the availability of the network and then its performance.
As described above, the method for static misconfiguration detection and visualisation by way of a rule-based expert system provides network integrators, network operators and service people with a very useful and convenient tool that allows to verify and, where required, improve the quality of the device configuration. The underlying taxonomy is scalable, so that extending the error detection functionality is possible through new rule sets for individual contexts or adding entire contexts to the software component, as required by the network management software and the devices to be monitored.
The following describes a selection of typical configuration errors...
In practice, a frequent reason of problems is the duplex mismatch on Ethernet. This occurs when the automatic negotiation of speed and duplex mode fails. Auto-negotiation only works if both stations are configured accordingly. If one station is set to fixed mode of operation, the other detects the speed of operation and operates in half-duplex mode. The information on this misconfiguration can easily be determined with LLDP, as the necessary data is provided in the lldpXdot3LocPortTable.
Virtual Local Area Network (VLAN) misconfigurations
An analysis of the Port and Protocol VLAN information provided in the LLDP-EXT-DOT1-MIB allows to detect differing VLAN settings between infrastructure devices, also refer to Figure 2: VLAN Misconfiguration.
In this example, no direct connection is available due to devices configured for entirely different VLANs. In most cases however, it will still be possible to access one of the devices via SNMP. This allows to identify this misconfiguration after all and resolve the issue by (temporarily) changing the VLAN configuration of the still accessible device.
Profinet IO misconfigurations
The entries in the LLDP-EXT-PNO-MIB allow to detect misconfigurations within a Profinet IO network. For instance, you can compare line propagation delays of local devices and neighboring devices or RT2/RT3 support and configurations .
Based on the transmitted Precision Time Protocol (PTP) Version 1 information, it is possible to detect misconfigurations that prevent correct time synchronisation within a PTP network. For instance, a synchronisation interval configured differently on two neighboring devices in the same PTP subdomain will result in neighboring devices being unable to synchronise. This can be detected by comparing the configured values .
Misconfigurations with redundancy protocols (e.g. Rapid Spanning Tree Protocol (RSTP), ring redundancy)
A comparison of the transmitted protocol identifiers (in the “lldpXdot1RemProtocolTable” and “lldpXdot1LocProtocolTable”) allows to determine whether a (redundancy) protocol is configured at two directly neighboring device interfaces. In case of differing protocol settings or if a ring redundancy protocol is configured at a port with an incompatible device/end device, this can be detected and signaled as error.
(4) Thesis by Oliver Kleineberg: “Analyse höherer Netzwerkprotokolle zur Konfigurationsfehlererkennung auf Basis des LLDP“ – University Esslingen, Faculty of Computer Science, summer term 2008
(5) Markus Rentschler: „Die Entdeckung der Netzwerk-Topologie“, Elektronik 24/2005; (http://www.elektroniknet.de/index.php?id=lexikon_startshow.php?k=b&id=890)
(8) Bachelor Thesis by Marc Rufener: “PROFINET Konfigurationsfehlererkennung mittels LLDP“, FH Bern, summer term 2008 (http://book.bfh.ch/pdf/078.pdf)
About the authors...
MSc, Dipl.-Ing. (FH) Markus Rentschler studied Communications Engineering at the Konstanz University of Applied Sciences until 1993 and Digital Systems Engineering at the Heriot-Watt University in Edinburgh. He joined Hirschmann Automation & Control as developer in the field of embedded software in 1999 and has been leading the System test group since 2007. E-Mail: Markus.Rentschler@belden.com.
Dipl.-Ing. (FH) Oliver Kleineberg studied Technical Computer Science at the University of Esslingen until 2008. In his thesis he provided an analysis of existing protocols and functionalities on network infrastructure devices as regards their usability for configuration error detection purposes as well as fundamental description of implementation approaches and methods. He joined Hirschmann Automation & Control in September 2008 and works in the field of research and technical business development. E-Mail: Oliver.Kleineberg@belden.com
Contact Details and Archive...
Most Viewed Articles...