Search This Blog

Monday 23 June 2014

Unknown error 1053 starting hpdp-idp-cp

Today I was installing HP Data Protector on a Linux cell server. During the installation, I saw an error message as it tried to install the internal database connection pooling process. This is what it said:

ERROR: Unable to Start IDB CP (Return code = 1)For more detail please refer to /var/opt/omni/server/log/DPIDBsetup_5216.logerror: %post(OB2-CS-A.08.10-1.x86_64) scriptlet failed, exit status 3

(The 5216 is a process ID, it changes on each invocation.)

Running omnisv -start produces the delightfully unhelpful

Cannot start "hpdp-idb-cp" service, system error:[1053] Unknown error 1053

Error 1053 seems to be a Windows error message that someone has decided a Linux-based cell-manager needs to be compatible with!

Digging a bit deeper, and running the SYSV / upstart / init start-up script with "/etc/rc.d/init.d/hpdp-idp-cp start" was slightly more helpful:

FATAL Cannot load config filehpdp-idb-cp started

It hadn't actually started, of course, the init script just blindly assumes that it has without checking $? for an error code.

Walking through the init script, there's a line

su hpdp -c "LD_LIBRARY_PATH=/opt/omni/idb/lib:$LD_LIBRARY_PATH /opt/omni/idb/bin/pgbouncer -d /etc/opt/omni/server/idb//hpdp-idb-cp.cfg"

That makes sense, the IDB connection pooler runs as hpdp, and hpdp-idb-cp.cfg is the configuration file which says what port number to connect on, and various other useful parameters.

The file itself was readable, but for some reason the installer failed to set the right permissions on /etc/opt/omni/server (it was unreadable to anyone but root).

So with a quick
chmod a+rx /etc/opt/omni/server

And then I could restart the installation...
./omnisetup.sh ... -IS
See also: unknown errror 1053 starting hpdp-as.

Greg Baker is an independent consultant who happens to do a lot of work on HP DataProtector. He is the author of the only published book on HP Data Protector (http://x.ifost.org.au/dp-book). He works with HP and HP partner companies to solve the hardest big-data problems (especially around backup). See more at IFOST's DataProtector pages at http://www.ifost.org.au/dataprotector

Thursday 19 June 2014

VMware, Data Protector and virtual machines which won't consolidate

Working on a customer's systems recently,  there were a large number of virtual machines with the following error message:

Configuration Issues
Virtual machine disks consolidation is needed.

But if I tried to right-click in vCenter and select Snapshots -> Consolidate, what I got was "unable to access file <unspecified filename> since it is locked".

This was also causing error messages in the backup log, because HP Data Protector attempts to consolidate disks at the start of a full backup.

The VMware KB articles suggested various things to identify the lock. I ssh'ed in and ran
tail -f vmware.log | grep lock
to identify what the lock could be. As it turned out, it wasn't quite a lock. The file that couldn't be opened was a .vmdk file - no surprises there. So I ran
lsof | grep the-vmdk-file
This showed that two different processes had it open.
ps | grep process-id-from-the-previous-step
showed that the two processes were both /bin/vmx, but it was possible to distinguish them by their child vmx-vthread processes.

One of them was the process running the virtual machine (no surprises there), and the other was a process belonging to the hostname of the computer that runs their HP DataProtector VEPA agent.

This customer has a virtual machine inside their VMware environment which runs their VMware backups. They don't have to worry about correctly presenting LUNs or having an extra device attached to their SAN fabric. They do source-side deduplicated backups from this virtual machine, so it doesn't generate as much network traffic as it otherwise would.

What had happened was that some backup had failed spectacularly leaving the snapshots mounted on the VEPA agent virtual machine. Looking at the settings for the agent virtual machine it proudly said that it had 13 virtual disks - when it should only have had one, its boot disk.

Naturally, VMware couldn't consolidate the snapshots because as far as it was concerned, those snapshots were still in use. VMware also couldn't delete the virtual disks off the agent machine either, because there were snapshots depending on them.

So the solution was:

  • Remove the snapshots on the agent machine.
  • Remove the extraneous disks from the agent machine.
  • Run the snapshot consolidation from the vCenter GUI.


Friday 13 June 2014

What a server-less retailer looks like

I've been helping a retail company with a few stores around Sydney. They don't have dedicated IT staff, so every support call is expensive, and having their own servers is very hard.

I got involved because they needed an internal ordering system, which I implemented as a collection of Google Spreadsheets. This was only a temporary fix, as I'm not a great fan of long-term data or business processes sitting in spreadsheets. But one thing led to another and they brought their email over to Google Apps.

To access the spreadsheets and their mail there is a Chromebook in each store. These have been very, very reliable. Months go past without any need for any IT support to fix anything.

They switched over to Saasu for their accounting mainly because they had multiple locations and needed staff to have access to invoicing at the same time. Xero would have been a possibility, but Saasu was (and still is) probably the easier product to use.

Recently they deployed Kounta as a point-of-sale system using ipads together with some Epson POS printers. This was mostly because Kounta supported Saasu; the alternative would have been Vend. In retrospect, Kounta isn't really designed for retailers: their home turf is restaurants and cafes. 
Having the sales report directly into Saasu automatically means that a whole chunk of head-office book-keeping has been eliminated. It's hard to imagine how this could have been done as efficiently if they were still using a on-premises copy of MYOB.
Reporting on what has been sold when has been helpful. For example, they've discovered that some of their frivolous accessory items are actually their best sellers, and they have been able to tweak their pricing as a result.
Internet access is crucial; most of the stores have an ADSL service which can fall back to a 4G service if the ADSL is not working. The original plan was to use Fritz! Boxes to do this, but the 4G modems that were available didn't actually work. It probably would have worked on something more up-market (e.g. a Huawei or Cisco) and might have provided some better diagnostic tools and would let the network be supported by a third party, but they couldn't justify the extra cost.
The procedure now is "in the event of an ADSL failure, turn off the router, and take the 4G modem out of the cupboard". Each store only has three important devices: the chromebook (mentioned before), the ipad (which is the point-of-sale system) and the receipt printer. The receipt printer is not wireless, so during ADSL outages they can't print receipts. Very few of their customers need receipts, so this isn't a big deal. Everything else will pick up on the other router.
Finally, they use Shopify for their internet ordering, again mostly because of its integration with Saasu. They use a combination of Shopify's iphone app and email to notify their staff about the order. This is then manually recorded in their point-of-sale system. Shopify also have a point-of-sale system and this might have been the best option, but Shopify charge a percentage of sales on their low-end plans and this ruled it out. No problems about this for internet sales, but for walk-in retail this seemed a bit greedy.
While there are perhaps more recurring costs (Google Apps, Kounta, Shopify, Saasu and the double internet connection at each store) than a more traditional solution, their savings do appear to more than make up for it. The end result is not enterprise-grade equipment by any means. There has been more cost-cutting and compromises than I am happy with. But the up-front costs are extremely low: lower than it would have been even to put a single good-quality server in at each store, let alone having fail-over server pairs.
Backup being one of my main interests, they only have four data repositories of any importance:
  • Accounting data in Saasu. Xero have an ecosystem of backup providers, Saasu doesn't seem to. However, if Saasu failed for any reason for the long-term, they would have a very reasonable excuse to the tax office for their lack of records. (There would in fact be thousands of businesses affected, so the tax office would have to make some declaration about how to handle it.)
  • Email and files in Google Apps. Spanning have a backup product for this with a remarkable recovery guarantee.
  • Shopify sales. They are duplicated into the point-of-sale system and into Saasu, so if Shopify disappeared there is no financial information lost. There would be customer contact information lost, so perhaps they should use the integration with Mailchimp or feed it into another CRM.
  • Sales records from Kounta. This is a bit of an area of weakness; the only option is manual exports. Vend seems to have better options here.
There are a few Windows desktops in head-office, but as the retail arm is just one part of the larger group, it's fair enough to say that the retailer is essentially server-less. While the transition away from mainframes was momentous in its time ("no tier 1 company can operate without a mainframe") and filled with multi-million dollar replacement projects, the transition away from Windows servers and maintaining internal Active Directory systems seems to be more of a slow slide. It won't be a big occasion: one day we'll wake up to discover that we've turned off the last Windows server and that nobody noticed.

Wednesday 11 June 2014

An odd thought about Michael O. Church's writings

I've never met Michael O. Church. The only things we have in common (as far as I know) is that we both left Google under less-than-happy circumstances and both do big-data / large-scale IT / artificial intelligence work. But I find myself reading every one of his blog posts. For the most part, they are some of the most existentially depressing posts of any author I've ever read. And yet, I'm morbidly fascinated.

For a quick summary, most of his postings can be grouped into one of two major themes:
  • The way technology companies are run is fundamentally unsound, inefficient and unjust.
  • Silicon Valley culture has many, many problems because of this.
He does write about other matters as well, but these are the themes which seem to be the most divisive. The response from the tech community on these kinds of posts is one of two polar extremes:
  • "That is exactly the problem; Michael has expressed something that I was aware of but was never able to put into words".
  •  "What rubbish. Michael is an idiot."
As I'm getting older, I keep thinking about how future generations will see this current time -- on the cusp of the computer intelligence era -- and how future historians will understand our responses to it. It occurred to me today that Michael O. Church is probably going to be one of "those references" that future students will be forced to read.

I say that because modern-day historians are expected to be at least passingly-familiar with Marx.

I don't mean that Michael O. Church is a marxist. (Perhaps he is; I don't know. I'm not a marxist either and haven't been bothered to learn all that much about it.) What I mean is that Marx was writing incitefully and insightfully about the interactions between capital and labour at a time of vast technological and social change, desperately trying to say "does it really have to be this way?" and "are there any alternatives?"

Step back to 1844 when Marx was writing Das Kapital. Railways were being built across Europe. These were completely over-priced investments to move people from one end of a line to another that should never have been profitable. But incredibly, one minor innovation (putting train stations along the journey so that people and goods could get on and off and complete a short fraction of the journey) turned the over-priced bubble into very successful investments and an engine for innovation, communication and economic development.

The closest analogy in modern times would be the dotcom boom of the late 90's which -- despite wasting vast quantities of money -- turned out OK for enough investors, and brought us the google search engine, a large amount of Linux infrastructure, a semi-pervasive internet and various other goodies.

The second railway mania (in the 1850-1860s) didn't end so well, but that was still in the future. 

The second mania was in fact dominated by many problems that would fit into a Michael O. Church blog: colonisation of the engineering investment by the lowest of low-life hucksters, exploitative working conditions for the engineers, cushy positions for management. (It was considered impossible that any company could run something so complicated as a long railway track. It had to be privately owned by the wealthy initially. And company management should only ever be a minor, part time task.) Financial sanity was long since left behind. It was in fact the bloggers of the day (journalists writing at the scrap-ends of the newspaper) that started doing financial analysis. Their writings led to the invention of audited financial statements and the rise of the big accounting firms. 

Leading into this era, Karl Marx wrote (in his spare time while holding down a job):
  • Vast numbers of words on topics to do with the way companies run that were essentially being ignored by everyone else. (So does Michael O. Church).
  • A model for how the excess of productive capacity gets distributed. (So does Michael O. Church).
  • Understandings of the inner workings of capitalism. (So does MOC).
  • Some ideas that might work on how to resolve some obvious problems. (MOC writes about open allocation, for example).
  • A lot of stuff that a lot of people disagree with and take issue with. (MOC likewise). 
I keep finding more analogies, but I wonder whether I'm stretching it at this point.

Try as I might, though, I can't think of any other blog author who would be deeply historically interesting to anyone living in the 2150s. But MOC's blog (if it is still accessible by then) would surely end up as a footnote or two in a thesis on the end of the Silicon Valley era.

And depressingly, almost nothing I've ever written so far in my life will be of any interest to anyone. At best (or perhaps worst), some future English literature students might be footnoting "geek poetry of the early cyber era".

Greg Baker is an independent consultant who writes, programs, thinks and fixes things to do with computers, IT and all things technical for customers who don't want to pay for expensive consulting firms. Contact him (gregb@ifost.org.au) if you have challenging problems you need solved.

Tuesday 10 June 2014

When VMware NBD and NBDSSL backups fail

I was working on some VMware backups when I ran into this strange sequence of messages: a backup which is showing that that is quite possible to backup a VMX file, but not the VMDK files. And the error messages in the session aren't very informative!


[Normal] From: BSM@cell-manager.ifost.org.au "NBD test backup"  Time: 19/05/2014 10:30:16 AM
        Backup session 2014/05/19-5 started.

[Normal] From: BSM@cell-manager.ifost.org.au "NBD test backup"  Time: 19/05/2014 10:30:16 AM
        OB2BAR application on "vepa-agent.ifost.org.au" successfully started.

[Normal] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:30:17 AM
        Resolving objects for backup on vCenter 'vcenter.ifost.org.au' ... 

[Normal] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:30:33 AM
        Add Virtual Machine to the backup ... 
                Name: VM1
                Path: /DC/Discovered virtual machine/VM1
                InstanceUUID: 52dbf234-252e-c5dd-9df5-51c304bcf312

[Normal] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:30:35 AM
        Virtual Machine 'VM1': Locking vMotion ... 

[Warning] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:30:35 AM
         Virtual Machine 'VM1': vMotion is in Progress.

[Warning] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:31:08 AM
        Virtual Machine 'VM1': Could not lock vMotion.


Everything's pretty much fine. There are lots of reasons for a vMotion lock to fail.

[Normal] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:31:13 AM
        Creating folder /var/opt/omni/tmp/55fd5af8-f853-403e-bedf-2d1e60e418dd ... 

[Normal] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:31:19 AM  
        Virtual Machine 'VM1': Backing up configuration file VM1.vmx ... 

[Normal] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:31:20 AM
        Virtual Machine 'VM1': Creating snapshot ... 

[Normal] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:31:59 AM
        Virtual Machine 'VM1': Optimizing disk scsi0:0 ... 

[Normal] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:32:00 AM
        Virtual Machine 'VM1': Backing up VSS manifest  VM1/VM1-vss_manifests11.zip.

And now for the interesting part:


[Major] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:33:15 AM
        Virtual Machine 'VM1': Could not backup disk scsi0:0 ... 

[Major] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:33:15 AM
[172:162]       Virtual Machine 'VM1': No disk backed up ... 

[Critical] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:33:15 AM
        Backup of object failed.
                Name: VM1
                Path: /DC/Test_VMs/VM1
                InstanceUUID: 52dbf234-252e-c5dd-9df5-51c304bcf312

[Normal] From: VEPALIB_VMWARE@vepa-agent.ifost.org.au "/DC"  Time: 19/05/2014 10:33:16 AM
        Virtual Machine 'VM1': Removing snapshot ... 

[Normal] From: VEPALIB_VMWARE@vepa-agent.ifost.or.gau "/DC"  Time: 19/05/2014 10:33:24 AM
        Virtual Machine 'VM1': Unlocking vMotion ... 

Deleted directory /var/opt/omni/tmp/564d01b7-7910-97a0-d54d-85c11ff8becd-vm-58/nbd
Deleted directory /var/opt/omni/tmp/564d01b7-7910-97a0-d54d-85c11ff8becd-vm-58/nbdssl
Deleted directory /var/opt/omni/tmp/564d01b7-7910-97a0-d54d-85c11ff8becd-vm-58/hotadd

[Normal] From: BSM@cell-manager.ifost.org.au "NBD test backup"  Time: 19/05/2014 10:33:53 AM
        OB2BAR application on "vepa-agent.ifost.org.au" disconnected.

I've truncated the rest of the messages.
Turning up the debugging level, the debug logs showed this:
[110] [VddkUtil::diskLibLog] NBD_ClientOpen: attempting to create connection to vpxa-nfcssl://[ESXi-MGMT-VMFS-1] VM1/VM1.vmdk@esxi1.ifost.org.au:902

[110] [VddkUtil::diskLibLog] Started up WSA

[110] [VddkUtil::diskLibLog] CnxOpenTCPSocket: Cannot connect to server esxi1.ifost.org.au:902: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

[110] [VddkUtil::diskLibLog] CnxAuthdConnect: Returning false because CnxAuthdConnectTCP failed

[110] [VddkUtil::diskLibLog] CnxConnectAuthd: Returning false because CnxAuthdConnect failed

[110] [VddkUtil::diskLibLog] Cnx_Connect: Returning false because CnxConnectAuthd failed

[110] [VddkUtil::diskLibLog] Cnx_Connect: Error message: Failed to connect to server esxi1.ifost.org.au:902

[ 20] [VddkUtil::diskLibWarning] [NFC ERROR] NfcNewAuthdConnectionEx: Failed to connect to peer. Error: Failed to connect to server esxi1.ifost.org.au:902

[110] [VddkUtil::diskLibLog] NBD_ClientOpen: Couldn't connect to esxi1.ifost.org.au:902 Failed to connect to server esxi1.ifost.org.au:902
The clue is the failed connection to esxi1.ifost.org.au. The VEPA backup agent obviously has to connect to the Vcenter server in order to start a backup, but because there was no SAN connectivity between the VEPA agent and the LUNs supporting the VM1 virtual machine's VMDK files, the VEPA agent ends up having to talk to the ESX server directly as well.

There can be many reason for this connection to fail: a firewall could be blocking the connection between the vepa agent and the esx server. Or in this case, there was no DNS entry for esxi.ifost.org.au didn't exist.


Greg Baker is an independent consultant working on HP DataProtector, LiveVault and many other technologies. He is the author of the only published book on HP Data Protector (http://x.ifost.org.au/dp-book). See more at IFOST's DataProtector pages at http://www.ifost.org.au/dataprotector