How can I troubleshoot SNMP communications issues?
Troubleshooting SNMP communications issues with monitored devices.
StruxureWare Data Center Expert
There are a number of reasons StruxureWare Data Center Expert (DCE) may not be able to communicate with a device. This could occur during discovery or communications could be lost some time later. The following document outlines configuration options as well as what you may want to check to verify proper SNMP communications.
StruxureWare DCE can communicate to APC and non APC SNMP devices alike. Since different devices may have different configuration options or may have similar options in different places, we will try to focus on device configurations for APC devices.
Initial Configuration for discovery:
The device must first have SNMP enabled. On version 6 firmware on an APC UPS, this would be found under the menu:
Configuration-->Network->SNMP V1--> Access.
For SNMP version 3, you would look under the menu:
Configuration-->Network->SNMP V3--> Access.
Once enabled we need the read community name to access information on the device using SNMP version one. For “Priority Scanning” of APC devices we also need the write community name. On version 6 firmware on an APC UPS, this would be found under:
Configuration-->Network->SNMP V1--> Access Control.
There are Listings for the community names, the type of access (read/write) as well as a place to specify an NMS IP. Should an IP or range be specified, only those systems will be allowed to use these community names. For troubleshooting purposes, this is best left at all zeros which will allow it to be read by any system. After communications is established, these options can be tweaked.
For SNMP version 3 configuration, you would go to the following menu:
Configuration-->Network->SNMP V3--> User Profiles.
Security for SNMP version 3 includes User Name, Authentication Passphrase, Privacy Passphrase, Authentication Protocol, and Privacy Protocol. This must be noted and will be required to communicate with StruxureWare if you device to use this version of SNMP. Only 1 version should be enabled on an APC device at a time. Once configured, you can enable each profile individually and associate it with a specific NMS IP in the same way you did with SNMP version one. This is configured under:
Configuration-->Network->SNMP V3--> Access Control.
StruxureWare also needs a Device Definition File or DDF file. Without a DDF file, the device may not show as not communicating but no sensors will be available. StruxureWare can not load a MIB file directly. A DDF file is an XML file that contains the SNMP OID information for specific sensors as well as the proper formatting. StruxureWare should already contain the DDF files for APC devices as well as some other well known manufacturers and a generic UPS DDF. If StruxureWare does not have the proper DDF, there is a team that can be contacted via e-mail to have a DDF created. Multivendorsupport@apcc.com is the group that can be contacted to get the DDF created.
If a DDF is sent to you and needs to be loaded, you can go to the following menu:
DeviceàSNMP Device Communication Settingsà Device Definition Files. You can see which fles are already loaded or click the “Add/Update Definitions…” button, select the “Local File” option, and hit the “Browse” button to find the file on your local system.
Once the device configuration is noted, StruxureWare DCE needs to know these parameters in order to communicate with the device. These settings will be added during discovery of the device or range of devices. Click “Device” and then “Add Devices” or right click and choose “Add Devices” from the monitoring perspective. Here you can choose SNMPv1 or SNMPv3 to add an SNMP device. Please note that SNMPv2c is not supported. Select the proper version and hit next.
Here you can add an IP or range of IPs. If adding a range, only devices using the same community names or SNMP version 3 authentication and encryption settings will be discovered. Incorrect community names will result in an error on the (APC) device noting unauthorized access. Please note there is an option to change the port. Port 161 is the default port and in the case of an APC device, this cannot be changed. Click “Next” if you want to schedule the discovery or “Finish”.
If you hit “Next”, you can now choose to put the device in a specific group or folder. Hit “Next” if you want to schedule the discovery or “Finish”.
If you hit “Next”, you can now enable discovery scheduling. You can run this discovery any day of the week at a specific time. This may be useful if multiple devices are added to the range. When adding a single device, Scheduling will not be helpful and will only cause additional discoveries to be run on devices that you have already discovered.. You can also choose to run the discovery now and hit finish.
If you have chosen to run the discovery now, the discovery should be saved and should run. If you had hit finish on previous screens, the discovery will not run but should still be saved. In either case the device discovery can be run manually. If the saved devices window is not showing, be sure you are on the monitoring perspective and go to the following menus:
The discovery should tell you when it was run or Never and you can right click the discovery and edit it or run it from this window.
Troubleshooting Failed Discoveries:
The most common reason for a failed discovery is incorrect security parameters. Verify that the SNMP read community name matches the device you are trying to discover. In SNMP version 3, all parameters much match the settings on the device. Be sure to verify there is no specific NMS IP associated with the security profiles or community names. An incorrect IP or any type of Network Address Translation could cause this feature to block SNMP traffic. On an APC Network Management card, the logs should indicate if a system tried to access the device with incorrect security parameters. The IP of the system that attempted this action should also be in the event that was logged.
You may also want to note the port used during the discovery. If this does not match the port configured on the device, discovery will probably fail. As mentioned before, APC devices will use the default port 161. 3rd party configuration however may vary.
Should all your security and port features match and you have ruled out ports, profiles, and community names being the issue. You may want to simplify the settings. If you are using a long complex community name with special characters, try using something simple for testing purposes. There are no known special characters that will cause discovery to fail but we cannot rule out issues with the device. For SNMP version 3, try using a different set of security options or potentially no encryption. If SNMP version 3 continues to fail, perhaps testing with the more simple SNMP version 1 can rule out version 3 configuration issues.
If all of this fails, you may want to look at the network. A packet capture can be helpful in this case. Tools such as “WireShark” can be used to capture network traffic and can help determine if the SNMP packets are actually getting to the device and if so, if the device is responding properly. A system running a packet capture near the StruxureWare server will show if StruxureWare is sending and receiving while a packet capture closer to the device will show if the network is allowing the SNMP packets from StruxureWare all the way to the device. Keep in mind that although utilities such has ping will show if basic network traffic can get from one system to another, this does not rule out the possibility of networks blocking specific ports or protocols.
Troubleshooting lost communication after discovery:
If an SNMP device has had it’s IP changed, it will go into a lost communications state. A device cannot be rediscovered using a different IP without the loss of data from the old IP. IN order to maintain communications and keep historical data, do not change device IPs.
Assuming the IP has not changed, another potential change that could cause issues is security. If the Community names have been changed on the device but not in StruxureWare, the systems will fail to communicate. To change this information in StruxureWare, go to the following menu item:
DeviceàSNMP Device Communication SettingsàDevice Scan Settings.
Select the specific device IP having a lost comm. Issue and click “Edit Device Scan Settings”. At this point, you can change the port, timeouts and retries, as well as the security information. Make sure this matches what is on the card. In the same way we discussed device discovery above, if you are having issues with the new security settings, try something simple for testing purposes.
Lost communications issues can also be caused by a change in the network configuration. Firewalls can be reconfigured sometimes without notice to block ports or traffic from specific IPs. A packet capture can again come in handy here. We can usually tell if the network is blocking traffic by what is being sent and if it is sent, what is being received.
Intermittent lost communications issues:
Intermittent communications issues can usually be attributed to traffic issues. This could be network related or it could be based on the device itself. If there is too much network traffic or too many systems are trying to poll a single device at the same time, the UDP packets used by SNMP may not be able to get through to the device or have a chance to be answered by the device.
You can increase timeouts and retries and this may allow the systems to regain communications caused by such traffic. You can increase these settings in StruxureWare here:
Device -> SNMP Device Communication Settings ->Device Scan Settings
Here you can select specific devices and change timeouts and retries. You can also increase the scan interval for that specific device if you want to poll that device less often. You can also change the global scan settings so that all SNMP devices are polled less often to decrease the total amount of traffic created by StruxureWare itself.
If available on the device, you can also specify an NMS IP. APC devices hav this feature when configuring the community name or profile. Setting a range or specific IP will stop other systems from being able to poll the device in question. This may or may not be a permanent fix but can also be useful for testing.