Hello my fellow backup admins.
Since a couple of days, I'm having a problem which just completely baffles me.
I have 2 machines, running 5 RAC databases. Each of those machines is running the 5 instances, so server1 is running db1instance1, db2instance1, db3instance1, db4instance1 and db5instance1, and server2 is running db1instance2, db2instance2, etc.
These servers (or actually the RAC pseudo clients) have been running happily for months. But, since last week, a backup running on either server., is interrupting service. All databases have been configured to use only 1 data and log stream.
When I start a backup of a single database, on 1 of the servers, the application server connecting to it complains it's lost it's connection. When starting 2 backups simultaniously, the application server just gives up and craps out completely.
I used to start all 5 databases at once, without any problem. Something seems to be changed, but I'm unsure what.
We looked at the obvious culprits; network load (interface and switch), CPU load, memory, SAN load, etc. but everything seems to be running within "safe" parameters.
I therefor can only conclude it's "something" in the database, which is causing this DoS. more specificly, RMAN seems to be blocking other queries / modifications to the database.
The errors always occur 3-4 minutes after starting the backup. It never happens during the "logs backup", but always during the "database backup", specificly, the start of it (allocate channel / backup command?). Mind you, these errors are only visible to the end user. The Oracle alert log, or log for the backup job shows nothing out of the ordinairy.
Has any of you ever seen a RAC database refuse queries / modifications during backup, while there were ample resources available? Does anyone have any idea how to fix this, or at least, debug further?
If it jams, force it.
If it breaks, it needed replacing anyway.