Backup of RAC instance causing interruption of service

Last post 11-12-2010, 6:57 AM by Dannie. 6 replies.
Sort Posts: Previous Next
  • Backup of RAC instance causing interruption of service
    Posted: 11-06-2010, 8:45 PM

    Hello my fellow backup admins.

    Since a couple of days, I'm having a problem which just completely baffles me.

    I have 2 machines, running 5 RAC databases. Each of those machines is running the 5 instances, so server1 is running db1instance1, db2instance1, db3instance1, db4instance1 and db5instance1, and server2 is running db1instance2, db2instance2, etc.

    These servers (or actually the RAC pseudo clients) have been running happily for months. But, since last week, a backup running on either server., is interrupting service. All databases have been configured to use only 1 data and log stream.

    When I start a backup of a single database, on 1 of the servers, the application server connecting to it complains it's lost it's connection. When starting 2 backups simultaniously, the application server just gives up and craps out completely.

    I used to start all 5 databases at once, without any problem. Something seems to be changed, but I'm unsure what.

    We looked at the obvious culprits; network load (interface and switch), CPU load, memory, SAN load, etc. but everything seems to be running within "safe" parameters.

    I therefor can only conclude it's "something" in the database, which is causing this DoS. more specificly, RMAN seems to be blocking other queries / modifications to the database.

    The errors always occur 3-4 minutes after starting the backup. It never happens during the "logs backup", but always during the "database backup", specificly, the start of it (allocate channel / backup command?). Mind you, these errors are only visible to the end user. The Oracle alert log, or log for the backup job shows nothing out of the ordinairy.

    Has any of you ever seen a RAC database refuse queries / modifications during backup, while there were ample resources available? Does anyone have any idea how to fix this, or at least, debug further?


    If it jams, force it.
    If it breaks, it needed replacing anyway.
  • Re: Backup of RAC instance causing interruption of service
    Posted: 11-08-2010, 9:28 AM
    • efg is not online. Last active: 08-06-2020, 1:41 PM efg
    • Top 10 Contributor
    • Joined on 02-02-2010
    • CommVault Tinton Falls NJ
    • Master
    • Points 1,732

    What happens if you switch the order of the nodes in the subclient?  For instance, with a 5 node RAC iDA (Assuming that all 5 nodes are configured in the iDA), under the "Storage Device" tab of the subclient you can change which node will "host" the RMAN session.  If you select "highlight" one of the nodes in the window of the storage devices tab, you can then click on the "up/down" arrows to the right of that window to change the order of these nodes in the window.  That will dictate which node gets the RMAN session when the backup gets triggered through the commserver.  Try placing one of the other nodes at the top of the list to see if that makes a difference.  If the backups run OK after switching the nodes, then it could point to a problem with just that node. 

     

    Also you can also try running a "backup validate" of the database to see if the problem occurs when RMAN is scanning/traversing through the database for the backup, or if the problem only occurs when transferring the data to the CV back-end.  There is an option in the subclient under the "Backup Arguments" tab.  underneatht that is another tab for "options".  Near the middle of the pop up window is a check box for "validate".  This runs the rman command "backup validate" which causes RMAN to perform a "dry-run" backup which scans all the data in the DB without actualy transfering any data.  (Be sure to remove the checkbox when complete!)

    Do you run incremental backups?  If the problem does not occur during the full backup and only on the incremental, perhaps you may want to enable "block change tracking".

    http://download.oracle.com/docs/cd/B19306_01/backup.102/b14192/bkup004.htm#i1032148

     

    This will help lighten the load and increase performance when running incremental RMAN backups.

     

    Let us know how you make out.

    Ernie


    Ernst F. Graeler
    Senior Engineer III
    Development
  • Re: Backup of RAC instance causing interruption of service
    Posted: 11-10-2010, 4:32 PM

    Thank you for this insightfull and usefull reply.

    I have configured 5 seperate RAC pseudo clients for these 5 databases. However, I never noticed the arrows on the right side of the screen.

    Is it *always* the top one which hosts the RMAN session?

    To try and resolve this, I've done the following (on the "Data Storage Policy" and the "Logs backup" tabs)

    Log / Data threshold streams: 1
    Instance 1: OPEN: 0
    Instance 2: OPEN: 1

    I assumed all data would be backuped via the Instance 2. But if I read your reply correctly, the RMAN session will be hosted on the node with instance 1 on it, and it will pull all data over the LAN from the instance 2 node?

    Could you please confirm? This would explain a lot of strange LAN issues we're having over the last couple of days.


    If it jams, force it.
    If it breaks, it needed replacing anyway.
  • Re: Backup of RAC instance causing interruption of service
    Posted: 11-10-2010, 5:03 PM
    • efg is not online. Last active: 08-06-2020, 1:41 PM efg
    • Top 10 Contributor
    • Joined on 02-02-2010
    • CommVault Tinton Falls NJ
    • Master
    • Points 1,732

    In your current configuration the RMAN session will be hosted on node 1, but the RMAN session issues a connect (Through the allocate command) to node 2, so the data transfer will occur directly between node 2 to the MediaAgent.  What you will see (Looking at the CV logs on the clients) is on node1 the ClOraAgent.log will be generating logs on node 1 (Where the RMAN session is running), but the ORASBT.log will be generating logs on node 2 (Where the API is connecting and moving data to the MediaAgent).  You probably won't see any logging in the ORASBT.log on node 1.  Node 2 will have ClOraAgent.log logging when the backup starts as the discovery phase of the backup does connect to ALL the nodes to deterine the status of the RAC.

     

    Hope this helps explain a little more.  Smile


    Ernst F. Graeler
    Senior Engineer III
    Development
  • Re: Backup of RAC instance causing interruption of service
    Posted: 11-10-2010, 7:31 PM

    Nice. So only a bit of control traffic is going through the cluster interconnect. The bulk of data is just going straight from node2 to the MA.

    Good to know :)

    Thank you for your reply.

    Today, I got some networking stats, and there seems to be a giant peek in network traffic at the time there is a backup running. So much so, that the trunk seems to be unable to handle it all. Also, filesystem backups seem to be causing this (to a lesser extend).

    Is there anything I can do about this in Commvault, exept for rate limiting the MAs / clients?


    If it jams, force it.
    If it breaks, it needed replacing anyway.
  • Re: Backup of RAC instance causing interruption of service
    Posted: 11-11-2010, 9:50 AM
    • efg is not online. Last active: 08-06-2020, 1:41 PM efg
    • Top 10 Contributor
    • Joined on 02-02-2010
    • CommVault Tinton Falls NJ
    • Master
    • Points 1,732

    Not that I can think of.  Cv backup will try and pump as much through the network as possible and will consume the entire pipe if possible.   You may want to consider using/creating a backup subnet, and then configure DIP (Data interface pairs) to route the backup data through that subnet.  Another alternative (That a number of customers choose) is to go lanless and connect the library directly to the Oracle server (Via SAN or iSCSI) and install a MA directly on the Oracle server.  We have customers achieving throughputs (Multi-stream RAC) of over 1.6 TB/Hr with this type of configuration.

     

    Ernie


    Ernst F. Graeler
    Senior Engineer III
    Development
  • Re: Backup of RAC instance causing interruption of service
    Posted: 11-12-2010, 6:57 AM

    We are already using a number of Windows servers with this "lanless" configuration.

    Thank you for your help efg.


    If it jams, force it.
    If it breaks, it needed replacing anyway.
The content of the forums, threads and posts reflects the thoughts and opinions of each author, and does not represent the thoughts, opinions, plans or strategies of Commvault Systems, Inc. ("Commvault") and Commvault undertakes no obligation to update, correct or modify any statements made in this forum. Any and all third party links, statements, comments, or feedback posted to, or otherwise provided by this forum, thread or post are not affiliated with, nor endorsed by, Commvault.
Commvault, Commvault and logo, the “CV” logo, Commvault Systems, Solving Forward, SIM, Singular Information Management, Simpana, Commvault Galaxy, Unified Data Management, QiNetix, Quick Recovery, QR, CommNet, GridStor, Vault Tracker, InnerVault, QuickSnap, QSnap, Recovery Director, CommServe, CommCell, SnapProtect, ROMS, and CommValue, are trademarks or registered trademarks of Commvault Systems, Inc. All other third party brands, products, service names, trademarks, or registered service marks are the property of and used to identify the products or services of their respective owners. All specifications are subject to change without notice.
Close
Copyright © 2020 Commvault | All Rights Reserved. | Legal | Privacy Policy