Monday 9 November 2015

Checking StoreOnce stores on Windows

In Data Protector 9.04, I've encountered a problem occasionally where the StoreOnce software store on Windows is completely unresponsive.

The error message that you will see in the session log is unpredictable, but it will often look something like this:

[Major] From: "IFOST backup"  Time: 9/11/2015 1:18:44 PM
[61:3003]      Lost connection to B2D gateway named "DataCentrePrimary"
    on host
    Ipc subsystem reports: "IPC Read Error
    System error: [10054] Connection reset by peer

One of the ways of detecting the problem was that the command "StoreOnceSoftware --list_stores" would hang.

I created the following three batch files and scheduled CheckStoreOnceStatus.cmd to run once per hour:

start /b CheckStoreOnceStatusController.cmd
start /b CheckStoreOnceStatusChild.cmd
waitfor /t 600 fiveminutes
exit /b
StoreOnceSoftware --list_stores

WAITFOR /T 30 StoreOnceOK && (
  REM StoreOnce OK
  exit /b
REM StoreOnce failure
net stop StoreOnceSoftware
waitfor /t 120 GiveItTime
net start StoreOnceSoftware
exit /b

Actually, I also added a call out to blat to send an email after the net start command.

So, CheckStoreOnceStatus spawns off *Controller, which will wait for 30 seconds for a signal to arrive from *Child as soon as child has been able finish StoreOnceSoftware --list_stores.

