Building a high-available failover cluster with Pacemaker, Corosync & PCS

When running mission-critical services, you don’t want to depend on a single (virtual) machine to provide those services. Even when your systems would never crash or hang, from time to time you will need to do some maintenance and restart some services or even the whole machine. Fortunately, clusters were designed to overcome these problems and give the ability to reach a near 100% uptime for your services.

Introduction

There are a lot of different scenarios and types of clusters but here I will focus on a simple, 2 node, high availability cluster that is serving a website. The focus is on the availability and not on balancing the load over multiple nodes or improving performance. Of course this example can be expanded or customized to whatever your requirement would be.

To reach the service(s) offered by our simple cluster, we will create a virtual IP which represents the cluster nodes, regardless of how many there are. The client only needs to know our virtual IP and doesn’t have to bother for the “real” IP addresses of the nodes or which node is the active one.

In a stable situation, our cluster should look something like this:

cluster_normal

There is one owner of the virtual IP, in this case that is node 01. The owner of the virtual IP also provides the service for the cluster at that moment. A client that is trying to reach our website via 192.168.202.100 will be served the webpages from the webserver running on node 01. In the above situation, the second node is not doing anything besides waiting for node 01 to fail and take over. This scenario is called active-passive.

In case something happens to node 01, the system crashes, the node is no longer reachable or the webserver isn’t responding anymore, node 02 will become the owner of the virtual IP and start its webserver to provide the same services as were running on node 01:

cluster_failure

For the client, nothing changes since the virtual IP remains the same. The client doesn’t know that the first node is no longer reachable and sees the same website as he is used to (assuming that both the webserver on node 01 and node 02 server the same webpages).

When we would need to do some maintenance on one of the nodes, we could easily manually switch the virtual IP and server-owner, do our maintenance on one node, switch back to the first node and do our maintenance on the second node. Without downtime.

Building the cluster

To build this simple cluster, we need a few basic components:

  • Service which you want to be always available (webserver, mailserver, file-server,…)
  • Resource manager that can start and stop resources (like Pacemaker)
  • Messaging component which is responsible for communication and membership (like Corosync or Heartbeat)
  • Optionally: file synchronization which will keep filesystems equal at all cluster nodes (with DRDB or GlusterFS)
  • Optionally: Cluster manager to easily manange the cluster settings on all nodes (like PCS)

The example is based on CentOS 7 but should work without modifications on basically all el6 and el7 platforms and with some minor modifications on other Linux distributions as well.

The components we will use will be Apache (webserver)as our service, Pacemaker as resource manager, Corosync as messaging (Heartbeat is considered deprecate since CentOS 7) and PCS to manage our cluster easily.

In the examples given, pay attention to the host where the command is executed since that can be critical in getting things to work.

Preparation

Start with configuring both cluster nodes with a static IP, a nice hostname, make sure that they are in the same subnet and can reach each other by nodename. This seems to be a very logical thing but could easily be forgotten and cause problems later down the road.

Firewall

Before we can take any actions for our cluster, we need to allow cluster traffic trough the firewall (if it’s active on any of the nodes). The details of these firewall rules can be found elsewhere. Just assume that this is what you have to open:

Open UDP-ports 5404 and 5405 for Corosync:

Open TCP-port 2224 for PCS

Allow IGMP-traffic

Allow multicast-traffic

Save the changes you made to iptables:

When testing the cluster, you could temporarily disable the firewall to be sure that blocked ports aren’t causing unexpected problems.

Installation

After setting up the basics, we need to install the packages for the components that we planned to use:

To manage the cluster nodes, we will use PCS. This allows us to have a single interface to manage all cluster nodes. By installing the necessary packages, Yum also created a user, hacluster, which can be used together with PCS to do the configuration of the cluster nodes. Before we can use PCS, we need to configure public key authentication or give the user a password on both nodes:

Next, start the pcsd service on both nodes:

Since we will configure all nodes from one point, we need to authenticate on all nodes before we are allowed to change the configuration. Use the previously configured hacluster user and password to do this.

From here, we can control the cluster by using PCS from node01. It’s no longer required to repeat all commands on both nodes (imagine you need to configure a 100-node cluster without automation).

Create the cluster and add nodes

We’ll start by adding both nodes to a cluster named cluster_web:

The above command creates the cluster node configuration in /etc/corosync.conf. The syntax in that file is quite readable in case you would like to automate/script this.

After creating the cluster and adding nodes to it, we can start it. The cluster won’t do a lot yet since we didn’t configure any resources.

You could also start the pacemaker and corosync services on both nodes (as will happen at boot time) to accomplish this.

To check the status of the cluster after starting it:

To check the status of the nodes in the cluster:

Cluster configuration

To check the configuration for errors, and there still are some:

The above message tells us that there still is an error regarding STONITH (Shoot The Other Node In The Head), which is a mechanism to ensure that you don’t end up with two nodes that both think they are active and claim to be the service and virtual IP owner, also called a split brain situation. Since we have simple cluster, we’ll just disable the stonith option:

While configuring the behavior of the cluster, we can also configure the quorum settings. The quorum describes the minimum number of nodes in the cluster that need to be active in order for the cluster to be available. This can be handy in a situation where a lot of nodes provide simultaneous computing power. When the number of available nodes is too low, it’s better to stop the cluster rather than deliver a non-working service. By default, the quorum is considered too low if the total number of nodes is smaller than twice the number of active nodes. For a 2 node cluster that means that both nodes need to be available in order for the cluster to be available. In our case this would completely destroy the purpose of the cluster.

To ignore a low quorum:

Virtual IP address

The next step is to actually let our cluster do something. We will add a virtual IP to our cluster. This virtual IP is the IP address that which will be contacted to reach the services (the webserver in our case). A virtual IP is a resource. To add the resource:

As you can see in the output of the second command, the resource is marked as started. So the new, virtual, IP address should be reachable.

To see who is the current owner of the resource/virtual IP:

Apache webserver configuration

Once our virtual IP is up and running, we will install and configure the service which we want to make high-available on both nodes: Apache. To start, install Apache and configure a simple static webpage on both nodes that is different. This is just temporary to check the function of our cluster. Later the webpages on node 01 and node 02 should be synchronized in order to serve the same website regardless of which node is active.

Install Apache on both nodes:

Make sure that the firewall allows traffic trough TCP-port 80:

In order for the cluster to check if Apache is still active and responding on the active node, we need to create a small test mechanism. For that, we will add a status-page that will be regularly queried. The page won’t be available to the outside in order to avoid getting the status of the wrong node.

Create a file /etc/httpd/conf.d/serverstatus.conf with the following contents on both nodes:

Disable the current Listen-statement in the Apache configuration in order to avoid trying to listen multiple times on the same port.

Start Apache on both nodes and verify if the status page is working:

Put a simple webpage in the document-root of the Apache server that contains the node name in order to know which one of the nodes we reach. This is just temporary.

Let the cluster control Apache

Now we will stop the webserver on both nodes. From now on, the cluster is responsible for starting and stopping it. First we need to enable Apache to listen to the outside world again (remember, we disabled the Listen-statement in the default configuration). Since we want our website to be served on the virtual IP, we will configure Apache to listen on that IP address.

First stop Apache:

Then configure where to listen:

Now that Apache is ready to be controlled by our cluster, we’ll add a resource for the webserver. Remember that we only need to do this from one node since all nodes are configured by PCS:

By default, the cluster will try to balance the resources over the cluster. That means that the virtual IP, which is a resource, will be started on a different node than the webserver-resource. Starting the webserver on a node that isn’t the owner of the virtual IP will cause it to fail since we configured Apache to listen on the virtual IP. In order to make sure that the virtual IP and webserver always stay together, we can add a constraint:

To avoid the situation where the webserver would start before the virtual IP is started or owned by a certain node, we need to add another constraint which determines the order of availability of both resources:

When both the cluster nodes are not equally powered machines and you would like the resources to be available on the most powerful machine, you can add another constraint for location:

To look at the configured constraints:

After configuring the cluster with the correct constraints, restart it and check the status:

As you can see, the virtual IP and the webserver are both running on node01. If all goes well, you should be able to reach the website on the virtual IP address (192.168.202.100):

cluster_node01

If you want to test the failover, you can stop the cluster for node01 and see if the website is still available on the virtual IP:

A refresh of the same URL gives us the webpage served by node02. Since we created both small but different webpages, we can see where we eventually end:

cluster_node02

Enable the cluster-components to start up at boot

To start the cluster setup and the components that are related to it, you should simple enable the services to run when the machine is booting:

Unfortunately, after rebooting the system, the cluster is not starting and the following messages appear in /var/log/messages:

Apparently, this is a known bug which is described in Redhat bugzilla bug #1030583.

It seems that the interfaces are reporting that they are available to systemd, and the target network-online is reached, while they still need some time in order to be used.

A possible workaround (not so clean), is to delay the Corosync start for 10 seconds in order to be sure that the network interfaces are available. To do so, edit the systemd-service file for corosync: /usr/lib/systemd/system/corosync.service

Line 8 was added to get the desired delay when starting Corosync.

After changing the service files (customized files should actually reside in /etc/systemd/system), reload the systemd daemon:

After rebooting the system, you should see that the cluster started as it should and that the resources are started automatically.

Now you have a 2 node web-cluster that enables you to reach a much higher uptime. The next and last thing to do is to ensure that both webservers serve the same webpages to the client. In order to do so, you can configure DRDB. More about that in this post: Use DRBD in a cluster with Corosync and Pacemaker on CentOS 7

75 thoughts on “Building a high-available failover cluster with Pacemaker, Corosync & PCS

  1. Hi,
    Thank you for publishing a nice and working how tos that helps a (newbies) like me to start learning new technology using linux.

    I tried and followed all your configuration setup it works perfectly except when we reboot one of the nodes.

    [root@node1 ~]# pcs cluster status
    Error: cluster is not currently running on this node

    [root@node1 ~]# pcs status nodes
    Error: error running crm_mon, is pacemaker running?

    [root@node1 ~]# service pacemaker start
    Redirecting to /bin/systemctl start pacemaker.service

    [root@node1 ~]# pcs status cluster
    Cluster Status:
    Last updated: Thu Sep 4 16:15:16 2014
    Last change: Thu Sep 4 05:44:06 2014 via cibadmin on node1
    Stack: corosync
    Current DC: NONE
    2 Nodes configured
    2 Resources configured

    [root@node1 ~]# pcs status nodes
    Pacemaker Nodes:
    Online: node1
    Standby:
    Offline: node2

    on node2:
    [root@node2 ~]# pcs status cluster
    Error: cluster is not currently running on this node

    [root@node2 ~]# pcs status nodes
    Error: error running crm_mon, is pacemaker running?

    [root@node2 ~]# service pacemaker start
    Redirecting to /bin/systemctl start pacemaker.service

    [root@node2 ~]# pcs status cluster
    Cluster Status:
    Last updated: Thu Sep 4 16:17:08 2014
    Last change: Thu Sep 4 05:43:10 2014 via cibadmin on node1
    Stack: corosync
    Current DC: NONE
    2 Nodes configured
    2 Resources configured

    [root@node2 ~]# pcs status nodes
    Pacemaker Nodes:
    Online: node2
    Standby:
    Offline: node1

    It seems that the configuration and settings do not survive reboot on one of the cluster nodes. I tried this a couple of times but I still got this same problem during reboot.
    Did I missed something from your configuration? Can you help figure out the problem?

    Thanks!
    Edgar

  2. A few things you could check:

    Make sure that Corosync and Pacemaker start at boot (or at least start them both manually) on both nodes:
    $ sudo systemctl enable corosync
    $ sudo systemctl enable pacemaker

    There is a know bug which appears at boot on RHEL 7 or CentOS 7, I reported a workaround in Redhat bugzilla bug #1030583 but it’s no longer public.

    The workardound is to let Corosync wait for 10s at boot, so it doesn’t start when the interfaces aren’t completely available (ugly workaround, I know :))

    Change /usr/lib/systemd/system/corosync.service to include the ExecStartPre:

    [Service]
    ExecStartPre=/usr/bin/sleep 10
    ExecStart=/usr/share/corosync/corosync start

    Then, reload systemd:
    $ sudo systemctl daemon-reload

    You can also look in /var/log/pacemaker.log or look for something related in /var/log/messages.

    In case these steps won’t help, I will check to redo the tutorial myself and see if I missed or forgot to write something.

    Keep me posted :)

  3. Hi there,
    Excellent material for cluster in general and how to build it using pcs. This is the first time I have created cluster in Linux [Centos EL v7] and without this step by step blog it would have meant hours of frustration and reading.

    I have got the 2 node cluster up & running, and I can see the webpage [virtual IP] works smoothly with node failover. Only small issue I am facing is, its showing ‘webserver’ as stopped with couple of errors. I tried to dig-in but didn’t get any clue. Any advise would be greatly appreciated.

    [root@centos7 html]# pcs status
    Cluster name: my_cluster
    Last updated: Wed Oct 1 11:22:37 2014
    Last change: Wed Oct 1 11:01:12 2014 via cibadmin on centos7p
    Stack: corosync
    Current DC: centos7p (2) – partition with quorum
    Version: 1.1.10-32.el7_0-368c726
    2 Nodes configured
    2 Resources configured

    Online: [ centos7 centos7p ]

    Full list of resources:

    virtual_ip (ocf::heartbeat:IPaddr2): Started centos7
    webserver (ocf::heartbeat:apache): Stopped

    Failed actions:
    webserver_start_0 on centos7p ‘unknown error’ (1): call=12, status=Timed Out, last-rc-change=’Wed Oct 1 11:04:59 2014′, queued=40004ms, exec=0ms
    webserver_start_0 on centos7 ‘unknown error’ (1): call=12, status=Timed Out, last-rc-change=’Wed Oct 1 11:06:50 2014′, queued=40015ms, exec=0ms

    PCSD Status:
    centos7: Online
    centos7p: Online

    Daemon Status:
    corosync: active/enabled
    pacemaker: active/disabled
    pcsd: active/enabled

    Many thanks,
    -Ashwin

  4. Hi Ashwin,

    Do you see the same output on the second node? (centos7p)
    If your website is reachable via the virtual IP, I expect it to be started over there.

    You could have a look in the log of Apache (/var/log/httpd/error_log) and look there for errors. My guess is that, while Apache is started on the first node, it is still trying to start it on the second node, which causes problems since Apache can’t listen on the virtual IP and port.

    Let me know if you have some more info.

  5. Hi Jensd,

    It is fixed. I had actually bounced both the nodes, when It came back, this time webserver started fine. Looks like reboot fixed it.

    I would like to thank you for this wonderful post and for keeping it simple. I read other posts as well, and it seems there is plenty of useful information on Linux. I enjoyed it, Great job!

    I have book marked your blog :)

    Many thanks
    -Ashwin

  6. Hi,

    I managed to get 2 nodes working with your guide, with some minor modifications. However, I was having trouble getting DRBD to work, since it doesn’t seem to have a version for CentOS 7. I then tried to use a CentOS 6.5 system, but apparently PCS doesn’t work on CentOS 6.5, so now I’m kinda stuck. How can I get a Linux HA setup with DRBD working on CentOS?

    Thanks!

  7. DRDB should work on CentOS 7 in a very similar way as it does on 6.5. A while ago, I started to work on a blog post about DRDB but haven’t found the time to finish it yet. I’ll do my best to get it online this week :)

  8. Hi,

    Thanks for the info shared to the world. It helped lot …
    I too managed to get 2 nodes working with your guide. I am also trying to bringup DRBD on the same CentOS 7.
    If you have completed similar guide for the DRBD . please share the link.

    Thanks in adv,
    Patil

  9. Hi,

    I would like to thank you for this post. It saved my life :) so easy and comprehensive tutorial. Its working for me with slight modification.

    I would like to follow your tutorial for glusterFS too. Let me know if you have published or working on glusterFS too.

    Once again thanks alot :)
    Niki

  10. After running “pcs cluster start –all”, pacemaker runs fine on both nodes. But on node1 it says the other is offline and on node2 it says node1 is offline. Any ideas?

    • Are you sure that Corosync and PCS are started too?
      Could you post me the exact output of sudo pcs status?

    • Sure. Thank you. I was able to get this setup to work just fine locally but I’m having a hard time with my VPS provider. I have a local IP and a public IP. In etc/hosts “nginx1” and “nginx2” map to their respective public IPs.

      ——
      [root@nginx1 network-scripts]# pcs status
      Cluster name: rocketcluster
      Last updated: Thu Dec 4 23:10:45 2014
      Last change: Thu Dec 4 05:44:15 2014
      Stack: cman
      Current DC: nginx1 – partition with quorum
      Version: 1.1.11-97629de
      2 Nodes configured
      0 Resources configured

      Online: [ nginx1 ]
      OFFLINE: [ nginx2 ]

      Full list of resources:
      —–
      [root@nginx2 cluster]# pcs status
      Cluster name: rocketcluster
      WARNING: no stonith devices and stonith-enabled is not false
      Last updated: Thu Dec 4 23:10:56 2014
      Last change: Thu Dec 4 05:45:13 2014
      Stack: cman
      Current DC: nginx2 – partition with quorum
      Version: 1.1.11-97629de
      2 Nodes configured
      0 Resources configured

      Node nginx1: UNCLEAN (offline)
      Online: [ nginx2 ]

      Full list of resources:
      —–

      • This looks like a firewall issue (or something like SELinux or Apparmor). Can you telnet to the ports that I opened in the firewall between the hosts? Maybe test it by temporarily disabling security measures.

  11. Hi Jensd,

    Excellent publishing, thanks for sharing you knowledge.
    I’ve followed the steps and it runs perfect but I’m getting some weird messages in /var/log/messages

    tail -f /var/log/messages
    ———————
    Jan 15 11:00:24 web01.int01.com systemd: pacemaker.service: Got notification message from PID 2996, but reception only permitted for PID 2749
    Jan 15 11:00:24 web01.int01.com systemd: pacemaker.service: Got notification message from PID 2996, but reception only permitted for PID 2749
    Jan 15 11:00:33 web01.int01.com pacemakerd[6329]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
    ——————

    ps -e | grep -E ‘2996|2749’
    ——————-
    2749 ? 00:00:00 pacemakerd
    2996 ? 00:00:00 httpd
    ——————-

    I have already configured the status-server in apache.

    I found these links about:
    https://access.redhat.com/solutions/1198023
    http://comments.gmane.org/gmane.linux.highavailability.pacemaker/19905

    also I have asked to redhat support about a bugzilla case (it looks like the same problem) and they told me it’s not solved yet.

    Do you know why it is happening? and if do you have some temporal solution?
    I got this message every 10 secs, I would like to remove them.

    Thanks in advance.

  12. Very good tutorial.

    In my CentOS 7 sandbox, I have Firewalld instead of iptables, which means that I couldn’t use the instructions in this article for setting the firewall. To enable the firewall for the required ports, I have run:

    sudo firewall-cmd –permanent –add-service=high-availability
    sudo firewall-cmd –add-service=high-availability

    These commands created the following configuration file:

    /usr/lib/firewalld/services/high-availability.xml

    Red Hat High Availability
    This allows you to use the Red Hat High Availability (previously named Red Hat Cluster Suite). Ports are opened for corosync, pcsd, pacemaker_remote and dlm.

    I also needed to run the following commands on both machines in order to have the cluster running after a reboot:

    sudo systemctl enable pcsd.service
    sudo systemctl enable pacemaker.service
    sudo systemctl enable corosync.service
    sudo systemctl enable corosync-notifyd.service

  13. Great tutorial Jens, really appreciate all the effort! As I’m on CentOS 7, I needed the firewall-cmd commands from Jorge to make it all work (thanks for that!) I also see the same weird messages Miranda mentioned, but apart from that the cluster seems to be working fine.

  14. Thanks for very useful post.
    But I have a question: how can I create resource and monitoring a daemon instead of agent list in (pcs resource agents ocf:heartbeat) as you did with apache
    sudo pcs resource create webserver ocf:heartbeat:apache configfile=/etc/httpd/conf/httpd.conf statusurl=”http://localhost/server-status” op monitor interval=1min

    For example I want pacemaker monitoring an netxms daemon

    Thanks!

    • Hi,

      You can get a list of possible resources supported with: pcs resource list. If there’s no direct support for the type you want to monitor, you can monitor the systemd service if needed. It depends a little on the type of service you have.

      • Thanks so much,
        I have it run now with resource like this with “myinitd” is startup scrip in /etc/init.d
        sudo pcs resource create server lsb:myinitd op monitor interval=”30″ timeout=”60″ \
        op start interval=”0″ timeout=”60″ \
        op stop interval=”0″ timeout=”60″ \
        meta target-role=”Started”
        But with this config, pacemaker doesn’t see when service stop. That’s it doesn’t know when I type
        /etc/init.d/myinitd stop/start
        here pacemaker just know some how at node level. How to config it at service level?

        • Hi, I am also facing similar issue with my custom “ARcluster” in /etc/init.d. service is not being failed over to another node when /etc/init.d/ARCluster stop/start.
          pcs resource show ARCluster
          Resource: ARCluster (class=lsb type=arcluster)
          Meta Attrs: target-role=Started
          Operations: monitor interval=60s (ARCluster-monitor-interval-60s)

          Please let me know what changes to be done If you were able to solve your issue

  15. Hi,

    I could success after first try, thanks for tutorial.
    I will go on with DRBD.

    But I have question;
    We got little webservre-webpages, where there are few crontabs which is due webpage content. Querying data from other source to show then in webpage.
    So is it possible inert these crontabs into cluster. So if node01 is down then node02 could go on with crons.

    Thanks.

    • Don’t know if it’s possible to control cron with corosync but it seems like a weird idea. In your place, I would let the cronjob check if it’s on the active node (for example something with the response code of curl http://localhost/server-status) and then execute or skip execution.

      • Hi and thanks for great tutorial. As an information I am thinking deploy two ESXi hypervisor (free :) ) and install 1 node vm in each Hypervisor and go on with your tutorials. Now I need DRBD :).

        Anyway, you meant I should check node status, let’s say if it is no active then some script should disable cron on node1 and enable node2, am I right?

        Thanks.

        • I wouldn’t disable cron because it’s used for other usefull stuff too (lik logrotate). Personally, I would add the cron script on both nodes and include the check for an active node in the beginning of the script. Both scripts would get executed by cron but only one actually makes some changes because it would pass the check.

          For example:
          #!/bin bash

          curl -I http://localhost/server-status
          if [ $? -eq 0 ]; then
          execute something
          fi
          exit 0

          • Hi,

            Oh great that is simple and useful idea.
            All crons will run, but will go on if localhost is the main node (I mean active).

            Well let me make some dynamic webpage which will be changed by cron task.
            Let’s see result.

            I will update you.

            THANK YOU.

          • One additional question:
            If I will have virtual hosts in apache will it affect cluster? I mean I have to just add vhost line in apache conf for both nodes, have not I?

  16. RovshanP, that’s correct. Just make sure that you keep the config equal on both nodes. With DRBD this is also possible.

    • Hi, again. Sorry seems I am replying to you too much.
      Firstly crontab option worked greatly – with curl check.

      I will check vhost option later on.
      Also I am sure I will be able integrate tomcat to pcs also.

    • You should check if there’s a resource for postfix or use a generic systemd or init-module that checks the status of the service.

  17. I, how you could run the command “pcs cluster start” at startup because pacemaker corosync and start well in centos 7 but lack the command to start the cluster and operating properly.

    Thanks.

    Regards.

    • Hi,

      When starting corosync automatically at boot, this shouldn’t be necessary. Did you read and apply the remark about the bug with systemd?

  18. Good guide, but, I’m running both servers and all were fine until step where I must start the cluster: “pcs cluster start –all”, besides of long waiting time for, on terminal appear the following error:
    node01: Starting Cluster…
    Redirecting to /bin/systemctl start corosync.service
    Job for corosync.service failed. See ‘systemctl status corosync.service’ and ‘journalctl -xn’ for details.
    node02: Starting Cluster…
    Redirecting to /bin/systemctl start corosync.service
    Job for corosync.service failed. See ‘systemctl status corosync.service’ and ‘journalctl -xn’ for details.
    I had googled this error, but can’t find any fix or solution.
    I disabled firewall and SELinux, changed the corosync file in system folder and reload the daemons, but the errors persist.
    I appreciate any help to continue with the steps.
    Thanks in advance!

    • You should check the output of the daemon with systemctl status corosync -l ,check in /var/log/pacemaker.log or look for something related in /var/log/messages.

      • Hi,
        I resolved the problem. The error appeared why I hadn’t configured in nodes /etc/hosts its own ip-hostames, only I had configured the ip-hostaname of each node with respect to the other.
        So, I have another question. I wanna to know why disable stonith in a simple cluster (2 nodes)?
        Thanks in advance!

  19. Hi, your tutorial is great and it saved me. I use server status for crontab schedules too.
    So, my question is there any tutorial for mysql (mariadb), let’s say I will user NFS for datastore and manage mysql resource with pcs…

    Thanks you in advance.

  20. Hi! Firstly, thank a lot to your post and I would like to ask you some questions!
    Number 1. After the node 1 dead, the service will switch to node 2 for using –> It’s ok but after node 1 up again, the system switch back to node 1 for using. So how can I configure to keep node 2 for using?
    Number 2. I would like to add one more node to the cluster to make a judge. I mean the third node can judge who is the primary node and who can be using. How can I get it?
    Thanks sir so much! :)

  21. Thx for this great tutorial, it really well written.
    BTW people, what are your experiences with corosync?
    I must say it’s really disappointing.
    First thing I’ve noticed is that for some reason you can’t do a hb_takeover you have to play around with un/standby or cluster stop commands.
    And if you try to do that nothing happens, why? Because somewhere there is a default timeout of around 5 minutes, still have to figure it out where it is…..
    And after it finally dies, did the other node takeover, noooooo it just sits there playing stupid and saying it’s online. Does the log file say why it didn’t take over? No ;)

    So out of the box, you would expect a service that is rock solid, doesn’t freeze, and does a takeover in a second or so at best, but you get basically get another version of heartbeat with all its flaws….

  22. Hi.

    I would want the pacemaker to restart my apache server when my server goes down. Over here my node is active and has not undergone a failure, if my apache server goes down i want the pacemaker to restart it. How can i do it?

  23. Hi nice tutorial

    l have one question what if one server goes done , example node01 goes doen , will node02 will take over

    Regards

    • Hi,

      If all is configured as it should, node02 should take over and clients shouldn’t notice any downtime.

  24. Hi,

    This might seem like a silly question, but i am new to Linux administration. I am trying to set up a high availability cluster on redhat. I have tried our tutorial and I am currently stuck at a particular point.
    I am trying to add a few custom ocf resource agents, these were not coded by me, there were provided as part of a suite by IBM. Now according to this link http://www.linux-ha.org/doc/dev-guides/_installing_and_packaging_resource_agents.html I have to place the ocf resource files in the following location “/usr/lib/ocf/resource.d/” I placed it under /usr/lib/ocf/resource.d/ibm .

    When i run the command pcs resource providers it lists the ibm folder along with heartbeat and when i run pcs resource agents ocf:ibm it lists all the resource agents under that folder.

    However, when i try to add a resource to the cluster using pcs resource create using the agents i installed under ibm, it gives me an error Unable to create resource , it is not installed on this system (use --force to override)

  25. hi
    very good tutorial….thank you
    when i get
    [root@node1 /]# pcs status cluster
    Error: cluster is not currently running on this node
    and
    [root@node1 /]# systemctl status pacemaker.service
    pacemaker.service – Pacemaker High Availability Cluster Manager
    Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled)
    Active: failed (Result: start-limit) since Sat 2015-09-19 03:56:42 EDT; 35s ago
    Process: 13395 ExecStart=/usr/sbin/pacemakerd -f (code=exited, status=127)
    Main PID: 13395 (code=exited, status=127)

    Sep 19 03:56:42 node1systemd[1]: pacemaker.service: main process exited, code=exited, status=127/n/a
    Sep 19 03:56:42 node1: Unit pacemaker.service entered failed state.
    Sep 19 03:56:42 node1[1]: pacemaker.service holdoff time over, scheduling restart.
    Sep 19 03:56:42 node1: Stopping Pacemaker High Availability Cluster Manager…
    Sep 19 03:56:42 node1: Starting Pacemaker High Availability Cluster Manager…
    Sep 19 03:56:42 node1: pacemaker.service start request repeated too quickly, refusing to start.
    Sep 19 03:56:42 node1[1]: Failed to start Pacemaker High Availability Cluster Manager.
    Sep 19 03:56:42 node1[1]: Unit pacemaker.service entered failed state.

  26. [root@node2 ~]# systemctl start pacemaker.service

    but

    [root@node2 ~]# systemctl status pacemaker.service pacemaker.service – Pacemaker High Availability Cluster Manager Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled) Active: failed (Result: start-limit) since Sat 2015-09-19 04:08:26 EDT; 5s ago Process: 11697 ExecStart=/usr/sbin/pacemakerd -f (code=exited, status=127) Main PID: 11697 (code=exited, status=127) Sep 19 04:08:26 node2[1]: pacemaker.service: main process exited, code=exited, status=127/n/a Sep 19 04:08:26 node2[1]: Unit pacemaker.service entered failed state. Sep 19 04:08:26 node2[1]: pacemaker.service holdoff time over, scheduling restart. Sep 19 04:08:26 node2[1]: Stopping Pacemaker High Availability Cluster Manager… Sep 19 04:08:26 node2[1]: Starting Pacemaker High Availability Cluster Manager… Sep 19 04:08:26 node2[1]: pacemaker.service start request repeated too quickly, refusing to start. Sep 19 04:08:26 node2[1]: Failed to start Pacemaker High Availability Cluster Manager. Sep 19 04:08:26 node2[1]: Unit pacemaker.service entered failed state.

  27. Thanks for this tuto.
    Like Ashwin , I got the same problem. The error message :
    Failed Actions:
    * webserver_start_0 on node01 ‘unknown error’ (1): call=11, status=Timed Out, exitreason=’none’,
    last-rc-change=’Sat Sep 26 17:37:23 2015′, queued=0ms, exec=40003ms
    * webserver_start_0 on node02 ‘unknown error’ (1): call=13, status=Timed Out, exitreason=’none’,
    last-rc-change=’Sat Sep 26 17:38:02 2015′, queued=0ms, exec=40004ms
    In fact, when I finished the tutorial it works, but when i rebooted i hav this problem. I add the Corosync delay. But it’s the same.
    Do you have an idea?
    Thanks for all,
    Ed.

    • Hi,

      Well you will have to “locate” both server in same network, mean layer 2. So for example you will have servers with IP addresses of 10.64.38.68 and 10.64.38.69. If 10.64.38.70 is free, then you can use it for Virtual-IP.

  28. Hi Jen,

    What if my cluster nodes are located in different network and they can reach each other, how then Virtual IP will work?

    Thanks

    • The “real” IP’s have to be in the same subnet. As far as I know (it’s been a while) the active node responds to the ARP-request for the virtual IP with it’s MAC-address. In case the nodes would be in a different subnet/VLAN, they wouldn’t receive the ARP-request for the virtual IP…

    • You just need to pick a free IP in the same subnet as the real IP’s of both your hosts.

      So for example, your hosts have IP’s 192.168.0.10/24 & 192.168.0.11/24, you could pick any free IP in that range (for example 192.168.0.12/24).

  29. Hi,

    I’m following the tutorial but I get this after performing this command:
    sudo pcs status nodes:

    (on master1)
    Online: master1
    Standby:
    Offline: master2

    (on master2)
    Online: master2
    Standby:
    Offline: master1

    • Hi,

      It looks like master1 can’t communicate with master2. Can you ping between hosts on hostname? Are you sure all necessary ports are open? (Maybe temporarily unload iptables just to be sure).

  30. Hi,
    I need some help. My cluster is working perfectly for incoming traffic through VIP. There is a query from other side that My Nodes should respond to the request received using same VIP.
    Thanks

  31. Very clear and many Thanks, but I face issue with WEBSERVER, I configured drbd resource first and return back to add WEBSERVER resource, but can’t start it

    [status=Timed Out, exitreason=’Failed to access httpd status page.’]

    with constraint or without same result.

  32. Works like a dream ! My only issue was with the webserver resource not working – turned out I had a type in apache config.

    So all good.

    Thank you for a well thought out and useful tutorial

  33. Can i use this approach for Serial (RS232) inputs too?
    I have 2 machines getting Serial inputs from a fan-out unit and i want to make one machine a hot-standby. How can i go about it?

    Regards

  34. Pingback: Cluster Ativo/Passivo com PostgreSQL 9.3, Red Hat 6, pcs, Pacemaker, Corosync e DRBD – Aécio Pires

  35. Hi,
    the best instructions what i find and ist run.
    But i Have one Question.
    Can i take 2 VirtualIPs connectet to the Nodes ??

  36. Having the error when starting apache. Please help.
    Failed Actions:
    * webserver_start_0 on node02 ‘unknown error’ (1): call=38, status=Timed Out, exitreason=’none’,
    last-rc-change=’Wed Apr 27 22:35:55 2016′, queued=0ms, exec=40002ms
    * webserver_start_0 on node01 ‘unknown error’ (1): call=44, status=complete, exitreason=’Failed to access httpd status page.’,
    last-rc-change=’Wed Apr 27 22:36:41 2016′, queued=0ms, exec=3235ms

  37. Pingback: Add Apache Web Server to HA Cluster – PaceMaker + CoroSnyc + CentOS 7 – Free Software Servers

  38. Hi,
    I’ve implemented a cluster (active/pasive) with apache in centos 7 using pacemaker & corosync. Exactly described in these guide (excellent material). Everything related to the cluster operation is working fine. But, I have a requirement that I don’t know how to implement… I’ve made a post in centos forum, but anybody answered.. perhaps here somebody could resolve these:
    The problem is that these apache is going to be also our load balancer. We need hi availability and the flexibility of the reload operation of the apache. On daily basis part of our tasks are to add sites, add nodes to the balanced applications, etc. So we need a cluster operation that permit us to invoke the apache reload (graceful) anytime we need without disrupt the service (user established connections, etc.)

    We are using sevice ocf::heartbeat:apache, and these operations aren’t available ..

    Does anyone can help us with that?

    Thanks in advance!
    José Pedro

  39. I’ve implemented a cluster (active/pasive) with apache in centos 7 using pacemaker & corosync. Exactly described in these guide (excellent material). Everything related to the cluster operation is working fine, until the need to reboot the primary node.

    once this happens the VIP fails over to the second node, however once the primary node comes back online , the VIP doesnt fail back and the clusters appears to have a communication issue.

    issue: pcs cluster stop –force on both nodes, then pcs cluster start –all ( from the first node ) clears the issue.

    The is with /usr/lib/systemd/system/corosync.service editted

    ExecStartPre=/usr/bin/sleep 10

    Am I missing something as regards recovery from failure? do any commands need to be issued in order for node1 to come( it use to automatic)

    pcsd seems to just go wrong on the second node and wont work with the cluster.

    failover and failback
    el6 pacemaker with crm/pcsd/corosync
    el6 heartbeat with crm
    both work fine

  40. I had to specify the interface to add the virtual IP resource:
    pcs resource create virtual_ip ocf:heartbeat:IPaddr2 ip=192.168.1.3 cidr_netmask=32 nic=eno16777984 op monitor interval=30s

  41. I am facing below error on Centos 6.5 .

    [root@eave ~]# sudo pcs status
    Cluster name: cluster_web
    Last updated: Mon Nov 7 15:13:26 2016 Last change: Mon Nov 7 15:04:57 2016 by root via cibadmin on eave
    Stack: classic openais (with plugin)
    Current DC: eave (version 1.1.14-8.el6_8.1-70404b0) – partition with quorum
    2 nodes and 1 resource configured, 2 expected votes

    Online: [ eave node1 ]

    Full list of resources:

    virtual_ip (ocf::heartbeat:IPaddr2): Stopped

    Failed Actions:
    * virtual_ip_start_0 on eave ‘unknown error’ (1): call=13, status=complete, exitreason=’none’,
    last-rc-change=’Mon Nov 7 15:04:57 2016′, queued=0ms, exec=59ms
    * virtual_ip_start_0 on node1 ‘unknown error’ (1): call=47, status=complete, exitreason=’none’,
    last-rc-change=’Mon Nov 7 15:05:38 2016′, queued=0ms, exec=34ms

    PCSD Status:
    eave: Online
    node1: Online

    [root@eave ~]#

  42. Hi,
    really nice article. I follow these procedures and I’m facing this issue:
    In the beginning, having one export, everything works fine. I cold reset the nodes one after another and the services are migrated each time successfuly to the other node. But when I add another export directory (with different fsid than the first one), after the first reboot of the active node, the NFS server does not start on one node or the other. The error that I get is that “rpcbind is not running”. While tailing /var/log/messages I see a repeating message of:
    nfsserver: INFO: Start: rpcbind i:1
    nfsserver: INFO: Start: rpcbind i:2
    nfsserver: INFO: Start: rpcbind i:3
    nfsserver: INFO: Start: rpcbind i:4

    and so one.

    After this, the nfs service never starts again on neither node.
    After a fresh restart of both nodes, when I try to add an nfs server resource again. the error that I get is:
    “Failed to start NFS server: /proc/fs/nfsd/threads”. In the /var/log/messages folder I get: ERROR: nfs-mountd is not running.

    Thanks
    George

  43. Hi Guys

    I create a service for pcs. So if I restart the centos7 servers I have the HA as long as 1 of them are on.

    vim /etc/init.d/pcsservice

    #! /bin/bash
    #
    # chkconfig: 2345 20 80
    # description: some startup script
    #
    # source function library

    #. /etc/rc.d/init.d/functions

    start() {
    pcs cluster start node2
    /usr/share/corosync/corosync start
    }
    stop() {
    pcs cluster stop node2
    /usr/share/corosync/corosync start
    }
    status() {
    pcs status
    }

    case “$1” in
    start)
    start
    ;;
    stop)
    stop
    ;;
    status)
    status
    ;;
    restart)
    stop
    start
    ;;
    *)
    echo “Usage: $0 {start|stop|status|restart}”
    exit 1
    ;;
    esac

    touch /var/lock/subsys/

    chmod 755 /etc/init.d/pcsservice

    chkconfig –add pcsservice
    chkconfig pcsservice on
    chkconfig –list

    Also ensure the ports are open. I had to use
    firewall-cmd –zone=public –add-port=2224/tcp –permanent
    firewall-cmd –zone=public –add-port=5405/udp –permanent
    firewall-cmd –zone=public –add-port=5404/udp –permanent
    firewall-cmd –reload

  44. Hi,

    Very useful tutorial. I have one request to you that could you please create and share the tutorial for span the cluster. Like i am having one machine at remote location and i wanted to add that machine to my existing cluster.

    Thanks,
    Jitendra

Leave a Reply

Your email address will not be published. Required fields are marked *