The error message that you will see in the session log is unpredictable, but it will often look something like this:
[Major] From: BSM@cellmgr.ifost.org.au "IFOST backup" Time: 9/11/2015 1:18:44 PM
[61:3003] Lost connection to B2D gateway named "DataCentrePrimary"
on host storeonce.ifost.org.au.
Ipc subsystem reports: "IPC Read Error
System error: [10054] Connection reset by peer
"
One of the ways of detecting the problem was that the command "StoreOnceSoftware --list_stores" would hang.
I created the following three batch files and scheduled CheckStoreOnceStatus.cmd to run once per hour:
CheckStoreOnceStatus.cmd
start /b CheckStoreOnceStatusController.cmd
start /b CheckStoreOnceStatusChild.cmd
waitfor /t 600 fiveminutes
exit /b
CheckStoreOnceStatusChild.cmd
StoreOnceSoftware --list_stores
WAITFOR /SI StoreOnceOK
CheckStoreOnceStatusController.cmd
WAITFOR /T 30 StoreOnceOK && (
REM StoreOnce OK
exit /b
)
REM StoreOnce failure
net stop StoreOnceSoftware
waitfor /t 120 GiveItTime
net start StoreOnceSoftware
exit /b
Actually, I also added a call out to blat to send an email after the net start command.
So, CheckStoreOnceStatus spawns off *Controller, which will wait for 30 seconds for a signal to arrive from *Child as soon as child has been able finish StoreOnceSoftware --list_stores.
Greg Baker is an independent consultant who happens to do a lot of work on HP DataProtector. He is the author of the only published books on HP Data Protector (http://www.ifost.org.au/books/#dp). He works with HP and HP partner companies to solve the hardest big-data problems (especially around backup). See more at IFOST's DataProtector pages at http://www.ifost.org.au/dataprotector
No comments:
Post a Comment