Improvements to recovery functionality on large Dedupe partitions

Last post 02-01-2017, 5:30 PM by AUBackupGuy. 5 replies.
Sort Posts: Previous Next
  • Improvements to recovery functionality on large Dedupe partitions
    Posted: 11-28-2016, 9:49 PM

    We've had a fair number of issues with our DDB's resulting in multiple DDB reconstructions not all of which work.  Probably 3 times in the last year we've had to run a full DDB reconstruction reading from disk.

    Our latest case is 161128-545 if any CV guys want to check our account history.

    Chucking out some ideas that would make things better from a customer experience side.

    Failure domain is too large

    We've got a ~1PB application size running on an 2x partitioned DDB GDSP.  When a DDB issue happens on one of those, there's either a whole bunch of pending deletes to process before it can come back online or in the worst case ~500tb of data to process on a full DDB reconstruction from disk. 

    These were in place before 4 partitioned DDB was released.  Ideally we could "expand" the partition to and all new jobs would round robin between 4 at a minimum (6 or 8 partitioned MA's would be better ) evening out the application data per MA over time as previous jobs aged out.  250tb or 125tb full recon is a lot better than 500tb.

    Allow multiple DDB backups over time to be used for DDB restores

    We had an issue where our DDB backups aged off (160815-3).  In the same case we discovered DDB backups resulting in corruption (See case email at 19 September 2016 11:41 AM).  Given these things have definately happened in our environment already it'd be nice to be able to go back to an earlier backup for the DDB.  However at present only the most recent DDB is used and if that's corrupted, you stuck with a full rebuild.

    From the customer point of view, if I have the choice of reconstructing 500tb of data > DDB or recovering the DDB from an extra day ago and copping an extra 20tb of backups to process in the DDB recovery "add records phase", I'd take the latter as a much faster option.

    I'd be quite happy to throw a couple of extra TB at multi-day DDB backups than risk full DDB recons.

    Data verification on a per mount path basis rather than per GDSP

    A couple of times, latest example "161027-17" we've had to run data verification.  For the same reason as above, verifiying 1PB of data in a single chunk is very bad from a customer experience perspective.  In CV11 you can run jobs, but not data aging..  This is a big problem.

    You're running data verification because there's a mismatch between your DDB and disk media, you run data verification preventing aging for say 5 days.  "Pending deletes" build up to 300 million ~40tb, aside from the disk space issues, the DDB you've already got issues and thus is more likely to need a rebuild will then have to spend days clearing out the 300 million pending deletes in the recon "prune' phase before it'll come back online.  Customers will be much more likely to complete requested data verification activities if it can be done in a more granular format.

    Don't process pending deletes in the DDB recon

    The priority is to get the MA back online as quickly as possible, spending 3+ days processing pending deletes does not accomplish this, why can't these pending deletes just be transferring into the usual pending deletes queue to be worked down while backups are ongoing.  I can understand needing to process new data before getting online for not the need to truncate old data here. 

    Minor thing:
    Add the size of pending deletes to the DDB stats view under dedupe engines

    Saves me logging onto the MA to grep this from SIDBEngine.  Add a column with human readable size.

    If anyone else finds working out the size a pain, run this grep in your linux MA logs dir

    cat SIDBEngine*.log | grep -i Total] | grep -E -o "Pending Deletes.{30}" | tr "[]-" " " | numfmt --to=iec --field 4

    Pending Deletes  85491472              17T
    first number = no of outstanding deletes, 2nd no = human readable size.



  • Re: Improvements to recovery functionality on large Dedupe partitions
    Posted: 12-15-2016, 12:42 AM

    Another thing for the list:

    Multithreaded scan:

    See case (161107-1).  We've got a glusterfs file system that every night spends ~7 hours scanning (this is using optimised scan) and then 10 mins actually backing up files.

    My current workaround is multiple subclients based on wildcard directory paths.  This works better than the base method, but it's annoying to keep track of and if people move files out of the carefully defined paths it unbalances the 10 subclients.

    If you can multithread the scan using multiple subclients, you can multithread the scan using 1, just needs some logic to define the scope of each scan process.  We've done some testing of block level backups but so far it doesn't look like something you'd want to put on a box where you don't control th reboots thereof..

  • Re: Improvements to recovery functionality on large Dedupe partitions
    Posted: 01-02-2017, 7:10 PM

    Adding another case to the list for suggestion:

    "Data verification on a per mount path basis rather than per GDSP


    170102-141  - Partitioned DDB MA rebooted, pruning got stuck on an index file on each MA.  Suspect it's going to come down again to renaming dodgy index files as corrupt (ala: 161115-34) or again running a full verify accross the whole 1PB+ App, 300TB+ size on disk. 


  • Re: Improvements to recovery functionality on large Dedupe partitions
    Posted: 01-24-2017, 10:15 PM

    Update and some good news, for suggestion:

    "Allow multiple DDB backups over time to be used for DDB restores"

    A CMR has been accepted by CV dev to add this functionality.  Will take a while but it's in the pipeline.

  • Re: Improvements to recovery functionality on large Dedupe partitions
    Posted: 02-01-2017, 8:45 AM
    • Ali is not online. Last active: 07-03-2019, 12:32 PM Ali
    • Top 10 Contributor
    • Joined on 08-05-2010

    Hello, thanks for the posts and tracking these suggestions.

    If you could send me your CommCell ID offline we can tag you in these MRs and update you on the progress as well.  Will be submitting another one on your behalf soon as we have this info as well.

  • Re: Improvements to recovery functionality on large Dedupe partitions
    Posted: 02-01-2017, 5:30 PM

    Hi Ali,

    PM'd the commserve details, cheers for the update, marking as resolved.


The content of the forums, threads and posts reflects the thoughts and opinions of each author, and does not represent the thoughts, opinions, plans or strategies of Commvault Systems, Inc. ("Commvault") and Commvault undertakes no obligation to update, correct or modify any statements made in this forum. Any and all third party links, statements, comments, or feedback posted to, or otherwise provided by this forum, thread or post are not affiliated with, nor endorsed by, Commvault.
Commvault, Commvault and logo, the “CV” logo, Commvault Systems, Solving Forward, SIM, Singular Information Management, Simpana, Commvault Galaxy, Unified Data Management, QiNetix, Quick Recovery, QR, CommNet, GridStor, Vault Tracker, InnerVault, QuickSnap, QSnap, Recovery Director, CommServe, CommCell, SnapProtect, ROMS, and CommValue, are trademarks or registered trademarks of Commvault Systems, Inc. All other third party brands, products, service names, trademarks, or registered service marks are the property of and used to identify the products or services of their respective owners. All specifications are subject to change without notice.
Copyright © 2019 Commvault | All Rights Reserved. | Legal | Privacy Policy