What if our critical services are monitored and auto-remediation happens when they are down or unavailable! Flint can make this happen for you.

Let's see in more detail how Flint can automate the downtime-to-uptime scenario within an IT infrastructure in integration with Nagios Core.

In this demo use-case, we will see how we can automatically fix an apache server down incident with Flint IT Automation Platform.

Prerequisites

  • Nagios Core installed on localhost and the monitored Apache2 server is running on IP 192.168.2.11.
  • The Nagios web console should be accessible through: http://localhost/nagios3/ In order to view the http service which needs to be monitored
  • Flint is installed and running on the Nagios server (Flint can be installed on separate host as well)

Things to do:

Lets first configure nagios to monitor the http server:

1. Create a new file in /etc/nagios3/conf.d/ file with .cfg extension so the path of new file will be: /etc/nagios3/conf.d/http_server.cfg

2. In the http_server.cfg file, we need to define host information which we want to monitor, as below:

define host{
  use                                generic-host ; Name of host template to use
  host_name                          webhost
  alias                              webhost
  address                            192.168.2.11
}

3. In the same http_server.cfg file, we need to define the necessary parameters of the web service which we are monitoring:

# check web service
define service {
  use                                 generic-service
  host_name                           webhost
  service_description                 HTTPWebhost
  check_command                       check_http!-H $HOSTADDRESS$
  notification_interval               2 ; set > 0 if you want to be renotified
  event_handler                       notify-flint
}

In the service definition above, service_description is the HTTP service, check_command keeps a check on the the apache2 server on the specified IP address and the event_handler notifies Flint if any event occurs.

4. We need to define few necessary commands so that Flint knows which service it needs to fix, following are the defined commands in the http_server.cfg file:

define command {
  command_name             notify-flint
  command_line             curl -X POST -H "Content-Type: application/json" -H "x-flint-username: admin" -H "x-flint-password: admin123" -d '{"servicename":"$SERVICEDISPLAYNAME$","servicestate": "$SERVICESTATE$", "hostname":"$HOSTNAME$","hoststatetype":"$HOSTSTATETYPE$","hostattempt":"$HOSTATTEMPT$", "servicedesc": "$SERVICEDESC$", "servicestateid":"$SERVICESTATEID$","serviceeventid":"$SERVICEEVENTID$","serviceproblemid":"$SERVICEPROBLEMID$","servicelatency":"$SERVICELATENCY$","serviceexecutiontime":"$SERVICEEXECUTIONTIME$","serviceduration":"$SERVICEDURATION$","hostaddress":"$HOSTADDRESS$"}'
'http://localhost:3501/v1/bit/run/example:restart.rb'

The input parameters defined in the JSON of the cURL command will tell the Flint platform in detail which service needs to be restarted in order to avoid further disruptions.

5. After saving the changes in the http_server.cfg file it is mandatory to restart the Nagios Core service by firing the command in ubuntu terminal:

sudo service nagios3 restart

6. Now to test the whole scenario we need to make the apache2 server down by firing the below command in our ubuntu terminal window:

sudo service apache2 stop

The above command will stop the http service running on that server.The same will be shown on the Nagios Core web console as current servicestate: “CRITICAL”

Service state information

7. Login to Flint to enable the ssh connector:Go to Connectors from left side navigation and click on the action tab and enable the ssh connector

enable ssh connector

8. Go to Logs and check the logs console, it will show the following result as soon as the service goes “CRITICAL” the event handler will notify Flint and the same will be displayed in the logs

Logs status As we can see in the logs the servicestate:"CRITICAL" that means the apache2 server has gone down.

9. At this moment the cURL command mentioned in the command_line of the http_server.cfg file will trigger the ‘restart.rb’ flintbit in order to restart the server within no time

curl -X POST -H "Content-Type: application/json" -H "x-flint-username: admin" -H "x-flint-password: admin123" -d '{"servicename":"$SERVICEDISPLAYNAME$","servicestate":"$SERVICESTATE$","hostname":"$HOSTNAME$","hoststatetype":"$HOSTSTATETYPE$","hostattempt":"$HOSTATTEMPT$","servicedesc":"$SERVICEDESC$","servicestateid":"$SERVICESTATEID$","serviceeventid":"$SERVICEEVENTID$","serviceproblemid":"$SERVICEPROBLEMID$","servicelatency":"$SERVICELATENCY$","serviceexecutiontime":"$SERVICEEXECUTIONTIME$","serviceduration":"$SERVICEDURATION$","hostaddress":"$HOSTADDRESS$"}'
'http://localhost:3501/v1/bit/run/example:restart.rb'

The cURL command will POST request the necessary input parameters defined in JSON to the Flint platform and will trigger the ‘restart.rb’ flintbit which will restart the apache2 server.

Logs_status

By calling the ssh connector in ‘restart.rb’ flintbit we achieved to make the server up

@log.info("input is:"+ @input.to_s)
servicename=               @input.get("servicename")   #getting the values from JSON
hostname=                  @input.get("hostname")
servicestate=              @input.get("servicestate")
hoststatetype=             @input.get("hoststatetype")
hostattempt=               @input.get("hostattempt")
servicedesc=               @input.get("servicedesc")
hoststateid=               @input.get("servicestateid")
serviceeventid=            @input.get("serviceeventid")
serviceproblemid=          @input.get("serviceproblemid")
servicelatency=            @input.get("servicelatency")
serviceexecutiontime=      @input.get("serviceexecutiontime")
serviceduration=           @input.get("serviceduration")
hostaddress=               @input.get("hostaddress")

if servicestate == "CRITICAL"                                       #service goes ‘Down’
  response=@call.connector("ssh")                                   #calling ssh connector   
                  .set("target",hostaddress)
              .set("type","exec")              
              .set("username","webhost")
                  .set("password","webhost1")
                  .set("command","sudo service apache2 start")     #Starting web server apache2
                  .set("timeout",60000)
                  .sync

  #SSH Connector Response Parameter
  result=response.get("result")
  @log.info("#{result.to_s}")
end

The above flintbit calls the ssh connector to remotely connect to the server which has gone in the “CRITICAL” servicestate.

10. The same will get reflected on the Nagios Core console as current servicestate:“OK”

Nagios servicestate

Bang on!!! Your service is up & running again, this is how Flint-Nagios integration can make things simple for an IT infrastructure by avoiding manual interventions where Nagios Core monitors the infrastructure and Flint remediates the real-time problems.

Conclusion:

With Flint and Nagios, it's very easy to setup alert and event-driven automation with just few lines of code. Flint’s flexible workflow engine allows to implement custom handling of events and also can be integrated with Service Desks and other notification systems.

Oot-of-the-box connectors helps to integrate or invoke actions on various systems, applications, network devices, and cloud infrastructure.

To know more about Flint capabilities, visit http://www.getflint.io or share your automation requirements with us at info@infiverve.com

Flint is free to use. Download it today!

Next Post Previous Post