HyperScale on HP reference hardware - problems after upgrade to v1.5

Last post 03-24-2020, 10:31 AM by Ken_H. 20 replies.
Sort Posts: Previous Next
  • HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 01-30-2020, 11:17 AM

    Hello everyone,

    We've been running a HyperScale media agent on HP reference hardware for about 14 months now.  Last month we added two additional drives for added capacity and upgraded to HyperScale 1.5.  I now have 8 large form factor drives on the front of each server which have a maximum capacity of 12 drives so I've still got room for another 50% increase in capacity.

    My question is:  Since adding the new drives and upgrading the version of HyperScale, I've had multiple outages due to high internal temperature faults.  Each node has had this issue at various times, no two nodes have failed at the same time.  During the day when no backups are running the fans used to run at 12% of maximum so we've tweaked the BIOS for increased cooling and now see the fans runing at 32% of maximum.  I'm not sure what things are like when backups run as that's the middle of the night.  Also, I've checked with my facilities people and all data center AC units are running normally and there's been no change to the temperature setting.

    We're a HP shop and have several DL380 servers and none have ever had a shutdown due to heat in the 15 years I've been here (Disclaimer:  As far as I know). 

    So my question is:  For anyone else running HyperScale on HP reference hardware... did you notice any issues following the upgrade to HyperScale 1.5?

    At this time my next step is to change the BIOS to run at max cooling where all fans run at 100% all the time.  This is fine, I just thought I'd check to see if anyone else had any similar issues.

    Ken

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 01-31-2020, 7:08 AM

    Hi Ken_H

    The purpose of the HyperScale 1.5 is to upgrade the kernel to the latest Redhat 7.7, to leverage various fix an enhancement both at the kernel and Gluster layer. 

    The upgrade of the kernel should not have any specific behavioral changes to how the hardware operates. 

    Have you engaged HPE and check whether there could be any compatibility issue with any of the new firmware and drivers introduced as part of Redhat 7.7 ?

    I suspect that there could be potential compatibility issue between specific firmware introduced as part of Redhat 7.7 that could be not playing well with the Bios/Hardware. HPE could advised more on this area if this is the case

    Regards

    Winston 

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-14-2020, 10:54 AM

    We've upgraded the firmware in all three servers AND set all fans to run at their maximum speed yet we still have a problem with shutdowns due to high internal temperature.  My sysadmin is on vacation right now and when he returns we'll consider our options for dealing with this.

    On the bright side, the CommVault HyperScale software appears to be very resilient.  The loss of an entire server does not stop backups from running and within 45 minutes of coming back online, everything within CommVault appears to be back to normal.

    Ken

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-16-2020, 3:27 AM

    Hi Ken_H

    I think HP engagement will be required to understand what is causing the high internal temperature

    In saying that Commvault HyperScale solution is built with resiliency and redundancy, so even if a Node foes down the Data Blocks will be distributed to the active Node. 

    Once the other Node comes back online the Gluster will trigger a healing process to rebuild the blocks required on the offline Node

    Regards

    Winston

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-17-2020, 10:09 AM

    Hmmm.


    Shouldn't Commvault have verified that their 1.5 upgrade doesn't cause issues with its approved reference hardware?

    I have Cisco C240M5 reference architecture.  How do I know the same thing won't happen to mine if I go to 1.5?


    Regards

    Guy
  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-17-2020, 6:26 PM

    Hi Guy 

    For all HyperScale Reference Architecture the Design specification is already listed on Books Online (http://documentation.commvault.com/commvault/v11_sp18/article?p=102061.htm) and provides detail information on the configuration and recommended firmware version to be on. 

    From the current documentation C240 M5, there are no additional firmware requirement for the upgrade, however please do reach out to Support to vet further details before proceeding. 

    Regards

    Winston 

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-18-2020, 1:20 PM

    My HyperScale servers are the bottom three servers in the rack and the air coming out of the perforated tile directly in front of them measuers 18.9 C / 66 F (behind the rack is obviously much warmer).  For anyone following this thread, please post your server room temperature so I can get a feel whether I should to keep my data center cooler.

    Ken

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-18-2020, 5:59 PM

    FYI:  We now have a ticket open with HP about this.  I'll update this thread as I learn more.

    Ken

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-18-2020, 10:52 PM

    Hi Ken 

    Thanks for the update 

    Would be interesting to see what HP identifies here.

    Regards

    Winston 

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-19-2020, 10:32 AM

    The most recent outage was caused by one of the internal 960GB SSD drives getting to 107 C / 225 F.  I believe this drive is used by the deduplication databases but am not 100% sure.  The drive will be replaced under warranty.

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-19-2020, 6:55 PM

    Ok that is definitely very concerning if the heat is causing impact on the Disk, especially the DDB/Index Disk which are usually configured as LVM from the phsyical SSD or NVME. 

    While HPE is doing the Hardware replacement could they provide any analysis on why the Nodes are overheating ?

    Regards

    Winston 

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-20-2020, 10:07 AM

    Yesterday I had a problem and had to do a full rebuild of my deduplication database which took just over two hours.  After this process completed, the temperature of the one SSD was 65 C (149 F).  What's odd is later in the afternoon I checked and found nothing has been running on that server for the past 20 or so minutes as shown by load average of the Linux uptime command

    [root]# uptime

    17:55:13 up  8:22,  2 users,  load average: 0.01, 0.04, 0.18

    Despite being idle for quite a while [Edit] and the fans running at 100% [/Edit], the temperature reading on the SSD has only dropped by 1 degree to 64 C (147 F).  I've shared thsi with HPE support.  I'll let you know what happens.

    Ken

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-23-2020, 3:18 AM

    Hi Ken

    Thank you for the update 

    It definitely sounds like something is not playing well on the hardware perspective which has resulted in these weird phenomenon where the hardware is overheating. 

    Would be great to see what HPE comes back with. 

    Keep us posted

    Regards

    Winston 

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-26-2020, 4:53 PM

    At HP's request, we have opened three support tickets, one for each node of the HyperScale.  Each ticket is assigned to a different analyst and each analyst came up with a different solution.  One node had a dedupe SSD replaced, one node had a NIC replaced, and one node got a new motherboard.  Within 24 hours the node with the new motherboard encountered an overheat condition.

    I am now of the opinion that the HyperScale 1.5 upgrade is not the source of the problem.  I now suspect that the configuration of the HP DL380 servers with 4 internal small form factor drives for OS and DDB plus up to 12 large form factor drives on the front panel (I'm running 8) simply generates more heat than can be dealt with effectively.  I suspect we were just under the threshold for overheating and the HyperScale 1.5 update pushes the system just a bit harder leading to the problems I'm seeing now.  Unfortunately the ILO record only goes back 60 days so I'm not able to see if we had any high temperature warnings before the 1.5 upgrade. 

    The tickets with HP remain open at this time.  I'll update if I learn anything new.

    Ken

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-28-2020, 2:03 PM

    Hey Ken, I have the same drive configuration in a Dell R740XD 3 node solution. It is recently configured as of yesterday. I am interested in seeing how your SSD's and HDD's perform regarding any more tempurature issues.

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 02-28-2020, 4:16 PM

    Hear's a heat map obtained through the ILO on the servers.  The back corners are the internal SSDs and show temperatures in the 63 to 65 C range during the day while no backups are running.  (Recall that I've set my fans to run at 100% max speed.)  It was this discovery that led me to believe that the HyperScale 1.5 upgrade was not the true cause of the problem but rather just a contributing factor.  

    It is the very rear, leftmost HDD temperature sensor that caused the most recent shutdown when it hit 103 C.

     

     


    Attachment: Server_temps.jpg
  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 03-01-2020, 1:31 AM

    Hi Ken 

    Thanks for sharing the details

    Definitely a very interesting scenario, did HP ever mention whether the physical location of where the Nodes reside could play a factor?

    Would moving the Nodes further up the Rack help, as this will allow for better air flow?

    Regards

    Winston 

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 03-02-2020, 10:38 AM

    There's been no discussion about moving the servers farther up the rack.  Personally, since warm air rises because it is less dense, I had assumed being lower in the rack would be better.  In any event, HP has seen the physical servers and made no comment about moving them. 

    Ken

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 03-03-2020, 4:16 PM

    We are focussing on the server with the most recent high-temperature issue and HP has decided to replace the four internal SSDs.  They are configured as RAID 5 so we will replace one then let the array rebuild, replace another and let the array rebuild, etc.  Since we're not sure how long the rebuild process will take, we'll only do one drive per day.  Unfortunately, due to staff availability, this process will end up taking more than four work days.

    After all drives are replaced, our plan is to let the server run with the fans at 100% for two weeks.  If we don't see any high temperature issues during that time, we'll reduce the fan setting from High to Enhanced.  If the servers behave, I think we'll just leave the fans at the Enhanced setting as previous experience found the fans run at 30 to 40% of maximum (during the day when no backups are running) which is fine.  The tiny bit of power and wear of Enhanced over Normal (fan speed at 15 to 30% of maximum) is negligible.

    The next update will be some time next week.

    Ken

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 03-19-2020, 8:20 AM

    Hi Ken

    Any further update on the investigation with HP 

    Would be great to hear some good news after the drive reconfiguration

    Regards

    Winston 

  • Re: HyperScale on HP reference hardware - problems after upgrade to v1.5
    Posted: 03-24-2020, 10:31 AM

    The server in which we replaced all four 960GB SSDs again shut down due to a high temperature situation on Mar 21 (Saturday).  Note:  The fans are still configured to run at 100% of max speed.  We are escalating the issue with HP support.  *Sigh*

    Ken

The content of the forums, threads and posts reflects the thoughts and opinions of each author, and does not represent the thoughts, opinions, plans or strategies of Commvault Systems, Inc. ("Commvault") and Commvault undertakes no obligation to update, correct or modify any statements made in this forum. Any and all third party links, statements, comments, or feedback posted to, or otherwise provided by this forum, thread or post are not affiliated with, nor endorsed by, Commvault.
Commvault, Commvault and logo, the “CV” logo, Commvault Systems, Solving Forward, SIM, Singular Information Management, Simpana, Commvault Galaxy, Unified Data Management, QiNetix, Quick Recovery, QR, CommNet, GridStor, Vault Tracker, InnerVault, QuickSnap, QSnap, Recovery Director, CommServe, CommCell, SnapProtect, ROMS, and CommValue, are trademarks or registered trademarks of Commvault Systems, Inc. All other third party brands, products, service names, trademarks, or registered service marks are the property of and used to identify the products or services of their respective owners. All specifications are subject to change without notice.
Close
Copyright © 2020 Commvault | All Rights Reserved. | Legal | Privacy Policy