Personal tools
You are here: Home DB2 How To's HADR and TSA
Navigation
Log in


Forgot your password?
 
Document Actions

HADR and TSA

Implement DB2 high availability disaster recovery in a Tivoli System Automation cluster domain

The step-by-step implementation process

Girish Sundaram (gisundar@in.ibm.com), Senior Database Consultant, IBM India Software Lab Services and Solutions
Girish Sundaram photo
Girish Sundaram is a Senior Database Consultant with the IBM India Software Lab Services and Solutions team and works closely with various strategic ISV Accounts for implementing mission critical DB2 solutions across various domains. Skilled in multiple database technologies, he specializes in performance tuning, migration, application architecture, optimization and design.

Summary:  Learn step-by-step how to implement IBM® DB2® for Linux®, UNIX®, and Windows® high availability disaster recovery in an IBM Tivoli® System Automation for Multiplatforms cluster domain. You'll derive the twin benefits of a highly reliable and available cluster and a robust, seamless database failover in case of a database crash. This configuration provides the foolproof, 24/7 system availability essential for business critical installations. The example setup uses Red Hat Linux and DB2 Universal Database™ V8.2 but applies also to later versions of DB2.

Tags for this article:  nodes, tsa-hadr-3

Date:  12 Apr 2007
Level:  Advanced
Also available in:   Chinese

Activity:  7497 views
Comments:   0 (View | Add comment - Sign in)

Average rating 4 stars based on 19 votes Average rating (19 votes)
Rate this article

Introduction and overview

In today's world, where businesses are serving customers from around the world 24 hours a day, 7 days a week, customers expect their computing systems to be 100% reliable. DB2 for Linux, UNIX, and Windows has always been in the forefront of databases in providing such industrial strength reliability. In DB2 UDB V8.2, DB2 introduced two new features that further provide customers with options for implementing high availability, disaster recovery (HADR) and automatic client rerouting capabilities. By duplicating the workload of the database to a separate site, these features protect users from production downtime in the event of a local hardware failure or a catastrophic site failure. These features are shipped as part of the DB2 UDB Enterprise Server Edition or the DB2 Enteprise 9 standard package.

HADR as a technology has been available in Informix Dynamic Server (IDS) since the mid-1990s. With the acquisition of Informix, it made its way into DB2 in version 8.2. The easiest way to understand HADR is as a pair of servers that are kept in sync with each other at the database level. The primary server of the pair interacts with the end user's application and receives transactions, while the standby server of the pair keeps itself in sync with the primary by applying the transactions directly from the primary server's log buffer to itself. If the primary server fails, the standby can take over the workload very quickly (most cases in under 30 seconds). It also supports rolling upgrades of the database or OS software, allowing you to apply fixes without significantly impacting your production system.

Tivoli System Automation (TSA) for Multiplatforms is designed to provide high availability for critical business applications and middleware through policy-based self-healing that is easily tailored to your individual application environment. It includes plug-and-play automation policy modules for many IBM® and non-IBM middleware and applications such as DB2, WebSphere®, Apache, and mySAP Business Suite. With TSA for Multiplatforms, you can establish one operation and automation team responsible for z/OS®, Linux, and AIX® applications, to help greatly simplify problem determination and resolution.


Figure 1. DB2 HADR in a TSA cluster domain topology
Sample figure containing an image

Software configuration

This is the actual software configuration used to set up the environment for this article:

  • Operating system: Red Hat Linux Enterprise Server 2.4.21-27 GNU/Linux
  • DB2: DB2 UDB Enterprise Server Edition (ESE) Version 8.1.0.96 at Fixpak 10
  • TSA: TSA 1.2.0 FP0005 Linux

Hardware configuration

Below is the actual hardware configuration used to set up the environment for this article.
Two IBM eServer pSeries® 690 server machines in the cluster domain, each with the following configuration:

  • Processors: Intel Xeon MP 4 CPU 2.70 GHz
  • Memory: 8 GB
  • Network adapters: Two Intel PRO/1000 Ethernet Adapters


One IBM eServer pSeries 690 server machine at the disaster recovery site with the following configuration:

  • ProcessorsWsIntel Xeon CPU 3.0 GHz
  • Memory: 8 GB
  • Network adapters: Two Intel PRO/1000 Ethernet Adapters

External shared storage

There are four IBM FastT600 Fiber Channel Disks at the cluster side and four IBM DS4300 Fiber Channel Disks at the disaster recovery site.

Installation and configuration instructions

The following section documents a three-node topology, as depicted in Figure 1. In this example, there is a Active-Passive TSA cluster domain consisting of two nodes (Node1 and Node2) that share a common shared storage consisting of the actual DB2 database files and software. The third node (Node3) in this topology consists of the disaster recovery site existing in a remote location and hosting the standby database for the primary database mentioned earlier. The TSA cluster domain and the standby server are linked together through leased lines. The primary and standby database name for the HADR setup is jsbmain.

NODE1: One of the machines of the Active-Passive TSA cluster domain setup. In the current setup, this node is the active one and owns the resources of the cluster.
NODE2: The second machine of the TSA cluster domain setup. In the current setup, this node is the passive node and acts like a standby node for the cluster.
NODE3: This machine is the HADR standby server for DB2 failover and does not fall under the TSA cluster domain setup.
Detailed below are the steps to successfully configure DB2 HADR on a TSA cluster domain. This setup assumes that the TSA cluster domain is already set up properly. For more information on how to set up a basic TSA cluster domain and the related commands, please refer to Appendix A.


Step 1: Basic network setup

1. Add the appropriate IP address to the hostname mappings in the /etc/hosts file of each node. The hosts file on each node should look like this:
10.1.1.5 NODE1
10.1.1.6 NODE2
10.1.1.2 NODE3
2. Execute the ping hostname or IP Address command on each of the Nodes to make sure that all three nodes (for example, Node1, Node2, and Node3) are able to communicate with each other through TCP/IP protocol.
3. Make sure that /etc/services file should have identical entries for the ports where the HADR service is listening on all the nodes of the cluster (Node1 and Node2) as well as the standby machine (Node3).
Sample output from the /etc/services files should look like this on all the three machines.
DB2_HADR_15 55001/tcp
DB2_HADR_16 55005/tcp
In this case DB2_HADR_15 is the name of the HADR service running on the primary node of the cluster while DB2_HADR_16 is the name of the HADR service running on the standby server.

Step 2: RSH setup

Note: Many of the TSA commands used in the setup require RSH to be set up on all three nodes. RSH allows a user from one node to run commands on another remote node. For more information on setting up RSH on Red Hat Linux, please refer to the Resources section of this article.

Configure RSH to allow the root user to issue remote commands on each node (NODE1, NODE2 AND NODE3) by adding the following lines to the file /root/.rhosts.
Node1 root
Node2 root
Node3 root
Login as root user and issue the following commands on each of the three nodes:
# rsh Node1 ls
# rsh Node2 ls
# rsh Node3 ls
You should see the directory listing of /root on the NODE1, NODE2 AND NODE3.

Step 3: TSA setup

Please refer to Appendix A for a basic two-node TSA cluster domain setup. Also, get more information on related TSA commands in Appendix A.


Step 4: HADR setup

Note: In this setup the database is stored in an external shared storage /jsbdata which is a fastT600 fiber channel disk array. The instances are different on the two machines (but having the same name db2inst1) of the cluster but the database is the same. The default TSA scripts that are shipped comes with DB2 does not support the primary and the standby servers (at TSA level) to have the same name. These scripts need to be modified to support this configuration.
The following catalog command was used to register the database information at the two instances:
db2 CATALOG DATABASE jsbmain AS jsbmain ON /JSBDATA

Issue the following commands from the command line processor (CLP):

On the Primary database server (Node1):

db2 CONNECT RESET
db2 UPDATE DB CFG FOR jsbmain USING INDEXREC RESTART LOGINDEXBUILD ON LOGARCHMETH1 "DISK: /jsbmain/ARCHFILES" LOGPRIMARY 100 LOGSECOND 50 LOGFILSIZ 5000
db2 BACKUP DATABASE jsbmain TO "/jsbmain/JSBBAK" WITH 2 BUFFERS BUFFER 1024 PARALLELISM 4 WITHOUT PROMPTING

The directory where the backup of the primary server is stored (/jsbmain/jsbbak) should be accessible from the standby server (Node3) or it should be copied to a local drive on the standby server so that the restore process can complete it.

Note: Doing a local restore by copying the backup file to a local drive on the standby server is recommended since a remote restore takes more time because the restore buffers have to be shipped through the network.

On the standby server (Node3):

db2 RESTORE DATABASE jsbmain FROM "/jsbmain/JSBBAK" REPLACE HISTORY FILE WITHOUT PROMPTING

Step 5: Configure databases for automatic client reroute

On the primary server (Node1), execute the following command from the db2 CLP to enable the automatic client reroute feature of HADR:


db2 UPDATE ALTERNATE SERVER FOR DATABASE jsbmain USING HOSTNAME 10.1.1.2 PORT 45000
Where 10.1.1.2 is the IP address of the standby server (NODE3) and 45000 is the port number where the db2inst3 instance of the standby server is listening.
On the standby server (NODE3) execute the following command from the db2 prompt to enable the automatic client reroute feature of HADR.
db2 UPDATE ALTERNATE SERVER FOR DATABASE jsbmain USING HOSTNAME 10.1.1.1 PORT 50000
IMPORTANT: When specifying the hostname for the alternate server for the standby server always make sure that you specify the virtual IP address of the TSA cluster domain (in this case the virtual IP address is 10.1.1.1).
50000 is the port number where the db2inst1 instance is listening. Make sure that db2inst1 on Node2 of the TSA cluster domain also listens on the same port number as db2inst1 on Node1. Otherwise in the scenario of a HADR failover, db2inst3 on server Node3 tries to communicate with port 50000 of db2inst1 (which will not be active in case of a disaster). All the clients should connect at least once to the primary server to pick up the alternate server information in case of a disaster.
To learn more about the automatic client reroute feature of HADR, please refer to the Resources section of this article.


Step 6: Update the HADR configuration parameters

Execute the following commands on the database of the active node of the cluster (Node1 in this case) to make this database the primary database for the HADR setup:

db2 UPDATE DB CFG FOR jsbmain USING HADR_LOCAL_HOST 10.1.1.1
db2 UPDATE DB CFG FOR jsbmain USING HADR_LOCAL_SVC DB2_HADR_15
db2 UPDATE DB CFG FOR jsbmain USING HADR_REMOTE_HOST 10.1.1.2
db2 UPDATE DB CFG FOR jsbmain USING HADR_REMOTE_SVC DB2_HADR_16
db2 UPDATE DB CFG FOR jsbmain USING HADR_REMOTE_INST DB2INST3
db2 UPDATE DB CFG FOR jsbmain USING HADR_SYNCMODE NEARSYNC
db2 UPDATE DB CFG FOR jsbmain USING HADR_TIMEOUT 120

Note: Special care should be taken to ensure that you always specify the virtual IP address of the TSA cluster domain in the HADR_LOCAL_HOST for the primary server in the HADR setup to enable HADR to function normally in this environment.

Execute the following commands on the standby server (Node3) to make this database the standby database for the HADR setup:

db2 UPDATE DB CFG FOR jsbmain USING HADR_LOCAL_HOST 10.1.1.2
db2 UPDATE DB CFG FOR jsbmain USING HADR_LOCAL_SVC DB2_HADR_16
db2 UPDATE DB CFG FOR jsbmain USING HADR_REMOTE_HOST 10.1.1.1
db2 UPDATE DB CFG FOR jsbmain USING HADR_REMOTE_SVC DB2_HADR_15
db2 UPDATE DB CFG FOR jsbmain USING HADR_REMOTE_INST DB2INST1
db2 UPDATE DB CFG FOR jsbmain USING HADR_SYNCMODE NEARSYNC
db2 UPDATE DB CFG FOR jsbmain USING HADR_TIMEOUT 120

Ensure that both servers on the TSA domain and the standby server have the TCPIP protocol enabled for DB2 communication. Execute the db2set DB2COMM=TCPIP command on both the servers of the TSA domain, as well as on the standby server.


Step 7: Starting HADR

As standby instance owner (db2inst3), start HADR on the standby node (Node3) as follows:
db2 DEACTIVATE DATABASE jsbmain
db2 START HADR ON DATABASE jsbmain AS STANDBY
As primary instance owner (db2inst1), start HADR on the primary node (Node1) as follows:
db2 DEACTIVATE DATABASE jsbmain
db2 START HADR ON DATABASE jsbmain AS PRIMARY
Note: While starting HADR, always start the HADR services on the standby first and then the primary. Similarly, when stopping HADR, stop the services first on the primary and then the standby.

Step 8: Verifying the HADR setup

Now, since you are done with the entire HADR setup, verify whether it is really working.

Execute the following command at the primary server (Node1):

db2 GET SNAPSHOT FOR DB ON jsbmain

The output should be similar to the one shown below:

The output should be similar to the one shown below:

HADR Status Role = Primary
State = Peer
Synchronization mode = Nearsync
Connection status = Connected, 11/24/2006 03:43:39.044650
Heartbeats missed = 0
Local host = 10.1.1.1
Local service = DB2_HADR_15
Remote host = jsbdr
Remote service = DB2_HADR_16
Remote instance = db2inst3
timeout (seconds) = 120
Primary log position (file, page, LSN) = S0000139.LOG, 0, 000000003C8E0000
Standby log position (file, page, LSN) = S0000139.LOG, 0, 000000003C8E0000
Log gap running average (bytes) = 0

Execute the following command at the standby server (Node3):

db2 GET SNAPSHOT FOR DB ON jsbmain

The output in the standby server should be similar to one shown below:

HADR Status
Role = Standby
State = Peer
Synchronization mode = Nearsync
Connection status = Connected, 11/24/2006 03:41:59.782744
Heartbeats missed = 0
Local host = jsbdr
Local service = DB2_HADR_16
Remote host = 10.1.1.1
Remote service = DB2_HADR_15
Remote instance = db2inst1
timeout (seconds) = 120
Primary log position (file, page, LSN) = S0000139.LOG, 0, 000000003C8E0000
Standby log position (file, page, LSN) = S0000139.LOG, 0, 000000003C8E0000
Log gap running average (bytes) = 0

To get more information on the various states of HADR pair and the actual working of HADR, please refer to the Resources section of this article.


Step 9: Testing failover of HADR

The last step of the setup procedure is to test the failover capability of HADR. Follow these steps:

a. Manually shut down the primary server by the db2_kill command.
b. Execute the takeover command at the standby server
db2 TAKEOVER HADR ON jsbmain
c. If the normal takeover does not work the BY FORCE option needs to be specified to forcefully db2 takeover HADR on the standby server
d. Taking a snapshot as specified earlier would now show that the standby server is performing the role of a primary server. It may take some time for the status to get reflected due to the network latency in applying the log buffers during which the standby server will show the status as Remote catch up pending.

Now you are all set to leverage the power of DB2 HADR on TSA Cluster!!



Appendix A


Commands used for setting up a two-node TSA cluster domain

The following commands were used to set up a two-node TSA cluster domain:

preprpnode: This command prepares the security settings for the node to be included in a cluster. When issued, public keys are exchanged among the nodes, and the RMC access control list (ACL) is modified to enable access to cluster resources by all the nodes of the cluster.
mkrpdomain: This command creates a new cluster definition. It is used to specify the name of the cluster, and the list of nodes to be added to the cluster.
lsrpdomain: This command lists information about the cluster to which the node where the command runs belongs.
startrpdomain / stoprpdomain: These commands are used to bring the cluster online and offline, respectively.
addrpnode: Once a cluster has been defined and is operational, this command is used to add new nodes to the cluster.
startrpnode / stoprpnode: These commands are used to bring individual nodes online and offline to the cluster. They often used when performing maintenance to a particular system. The node is stopped, repairs or maintenance is performed, then the node is restarted, at which time it rejoins the cluster.
lsrpnode: This command is used to view the list of nodes defined to a cluster, as well as the operating state (OpState) of each node. Note that this command is useful only on nodes that are Online in the cluster; otherwise it will not display the list of nodes.
rmrpdomain: This command removes a defined cluster.
rmrpnode: This command removes one or more nodes from a cluster definition.

For detailed descriptions of these commands, refer to these manuals, all of which you can find on the IBM TSA CD:

IBM Reliable Scalable Cluster Technology for Linux, Administration Guide, SA22-7892

IBM Reliable Scalable Cluster Technology for Linux, Technical Reference, SA22-7893

IBM Reliable Scalable Cluster Technology for AIX 5L: Administration Guide, SA22-7889

IBM Reliable Scalable Cluster Technology for AIX 5L: Technical Reference, SA22-7890

Please refer to the Resources section at the bottom of the article for TSM references.


Defining and administering a cluster

The following scenarios show how you can create a cluster, add nodes to the cluster, and how you can check the status of the IBM TSA daemon (IBM.RecoveryRM).


Creating a two-node TSA cluster domain

To create this cluster, you need to:

1. Log in as root on each node in the cluster.

2. Set the environment variable CT_MANAGEMENT_SCOPE=2 on each node:

export CT_MANAGEMENT_SCOPE=2

3. Issue the preprpnode command on all nodes to allow communication between the cluster nodes.

preprpnode node01 node02

4. You can now create a cluster with the name "SA_Domain" running on Node1 and Node2. The following command can be issued from any node:

mkrpdomain SA_Domain node01 node02

Note: When creating RSCT peer domains (clusters) using the mkrpdomain command, the characters used for the peer domain name are limited to the following ASCII characters: A-Z, a-z, 0-9, (Period), and _ (underscore).

5. To look up the status of SA_Domain, issue the lsrpdomain command: lsrpdomain

Output:
Name-------OpState-------RSCTActiveVersion-------MixedVersions-------TSPort-------GSPort
SA_Domain--Offline-------2.3.3.0---------------------No--------------------12347--------12348

The cluster is defined but offline.

6. Issue the startrpdomain command to bring the cluster online:

startrpdomain SA_Domain

When you run the lsrpdomain command again, you see that the cluster is still in the process of starting up, the OpState is Pending Online.


Output:
Name-------OpState-----------------RSCTActiveVersion-------MixedVersions-------TSPort-------GSPort
SA_Domain--Pending Online-------2.3.3.0---------------------No--------------------12347--------12348

Notes:
1. You may get an error message similar to this one:
"2632-044 the domain cannot be created due to the following errors that were detected while harvesting information from the target nodes:
node1: 2632-068 this node has the same internal identifier as node2 and cannot be included in the domain definition."
This error most often occurs if you have cloned Linux images. Something went wrong with the cluster, and the entire configuration should be reset. You can resolve such problems by running the /usr/sbin/rsct/install/bin/recfgct command on the node which is named in the error message in order to reset the node ID.
Continue with the preprpnode command.
2. You may also get an error message like this:
"2632-044 The domain cannot be created due to the following errors that were detected while harvesting information from the target nodes:
node1: 2610-418 Permission is denied to access the resources or resource class specified in this command."
To resolve this issue, check your hostname resolution. Make sure that all entries for each node of the cluster in your local /etc/hosts files on all nodes and the nameserver entries are identical.

Adding a node to an existing cluster

After creating a two-node cluster, you can add a third node to SA_Domain in this way:

1. Issue the lsrpdomain command as a root user to see if your cluster is online:

Output:
Name-------OpState-------RSCTActiveVersion-------MixedVersions-------TSPort-------GSPort
SA_Domain--Online-------2.3.3.0---------------------No--------------------12347--------12348

2.Issue the lsrpnode command to see which nodes are online:

Name OpState RSCT Version
node02 Online 2.3.3.0
node03 Offline 2.3.3.0
node01 Online 2.3.3.0

3. Issue the following preprpnode commands as a root user to allow communication between the existing nodes and the new node.

Log on to Node3 as a root user and enter:

preprpnode node01 node02

Log on to Node2 as a root user and enter:

preprpnode node03

Log on to Node1 as a root user and enter:

preprpnode node03

Make sure that you execute the preprpnode command on each node. It is strongly recommended.

4. In order to add Node3 to the cluster definition, issue the addrpnode command as a root user on Node1 or Node2, which are already online on the cluster:

addrpnode node03
Issue the lsrpnode command as a root user to see the status of all nodes:
Name OpState RSCT Version
node02 Online 2.3.3.0
node03 Offline 2.3.3.0
node01 Online 2.3.3.0

4. As a root user, start Node3 from an online node:

startrpnode node03

After a short time Node3 should be online, too.


Acknowledgements:

I would like to acknowledge Priti Desai, Database Consultant, IBM Silicon Valley Lab, for her valuable input and for proofreading, which greatly helped in bringing this article to successful completion.

Security Awareness
Would you like your company to implement gamification into your security awareness program?





Polls