EV Observe - Fix Box Unavailability Issues

Last modified on 2022/08/11 19:13

Self-monitoring of the Box and cross-monitoring by another Box are required for ensuring that the Box deployed on the EV Observe platform works correctly and as such, that the platform is running smoothly. Open url.png See Monitor a Box

At times, however, the Box may experience unavailability. You must perform corrective actions to reestablish communication between the Box and the EV Observe platform.

Procedure for fixing Box unavailability issues

1. Identify the issue encountered on the Box.

     Open url.png See Common issues encountered

2. Perform the corrective actions to fix the problem.

     Open url.png See How to fix unavailability issues

3. If the problem persists, you should declare an incident on the EasyVista Support site and provide the information required.

4. Once the elements you provided have been analyzed, the EasyVista Support team will communicate the solutions to solve your unavailability issue.

Common issues encountered

Issue encountered Corrective actions
Unable to connect to the VPN tunnel See the procedure called Check network access
High latency of the Box across the VPN tunnel See the procedure called Check network access
Disruptive loss of connection See the procedure called Check network access
Undefined status for all control points
Outdated timestamping of controls run by the Box See the procedure called Restart the remoteOperationBox and nagios processes
Unable to reload the configuration for the Box See the procedure called Restart the remoteOperationBox and nagios processes
Acknowledgments not taken into account See the procedure called Restart the remoteOperationBox and nagios processes
Immediate controls run in the Web app not taken into account See the procedure called Restart the remoteOperationBox and nagios processes

How to fix Box unavailability issues

Check network access

Step 1: Check the performance of the Box
1. Check the performance of the unstable Box.

2. If required, add sufficient resources to the CPU load, RAM and disk space.
 

Step 2: Check the timestamping function of the Box

1. Run the command below to check that the time of the unstable Box is accurate.

date

2. If required, correct the date and time.
 

Step 3: Check the firewall rules

1. Check that no firewall rule was recently modified or deleted.

2. If required, correct the rules.
 

Step 4: Check the VPN port connection of the Box

1. Check that the unstable Box is able to access the EV Observe VPN port for traffic to the central platform. Depending on your platform, run one of the commands below.

  • Platform: https://servicenav.io

    telnet vpn.servicenav.io $(awk -F ‘[ ]’ ‘NR==42 {print int($3)}’ /etc/openvpn/client.conf)

  • Platform: https://azure.servicenav.io

    telnet vpn-azure.servicenav.io $(awk -F ‘[ ]’ ‘NR==42 {print int($3)}’ /etc/openvpn/client.conf)

  • Platform: On-premises

    telnet <ip-publique-plateforme> <port>

The result below will appear if the Box is able to access the EV Observe VPN port.
         Check network access - VPN tunnel OK.png

2. If required, correct the configuration of the firewall.
 

Step 5: Check the LAN IP address of the Box

1. Check that the LAN IP address of the unstable Box was not assigned to another machine within the same network.

Restart the remoteOperationBox and nagios processes

  • The remoteOperationBox process is in charge of sending and receiving messages between the Box and the central platform. If it is not working correctly:
    • Monitored data collected by the Box will not be sent to the central platform.
    • Actions performed in the Web app will not be pushed to the Box.
  • The nagios process is in charge of scheduling control points. It communicates with the remoteOperationBox process and integrates the running of immediate controls or the acknowledgments performed in the Web app.
     

Step 1: Log in to the Box

1. Log in to the unstable Box using an SSH client.
 

Step 2: Stop the remoteOperationBox process

1. Run the command below to stop the remoteOperationBox process.

service remoteOperationBox stop

2. Run the command below to check that no remoteOperationBox process is still running.

ps aux | grep remoteOperationBox

3. Run the commands below to manually kill all process instances that are still running.

Replace <id> with the ID of the process instance.

kill <id>

or if it is still lingering:

kill -9 <id>

Step 3: Stop the nagios process

1. Run the command below to stop the nagios process.

service nagios stop

2. Run the command below to check that no nagios process is still running.

ps aux | grep nagios

Note: The nagios process may take some time to stop. If this is the case, run the ps command several times.

3. Run the commands below to manually kill all process instances that are still running.

Replace <id> with the ID of the process instance.

kill <id>

or if it is still lingering:

kill -9 <id>

The remoteOperationBox and nagios processes are now stopped. No process is running after the ps command.
 

Step 4: Restart the processes

1. Run the command below to restart the nagios process.

service nagios start

2. Run the command below to restart the remoteOperationBox process.

service remoteOperationBox start

  • Check that the six process instances are running.
             Check remoteOperationBox process - Instances OK.png

3. Check that the Web app is now working correctly.

Tags:
Powered by XWiki © EasyVista 2022