One of the problems with virtualised infrastructure -- especially cloud servers -- is that your programs don't always get the CPU time that they need. The CPU time for your application is shared between CPU time given to other virtual machines on the same hardware. This is good for cost savings, but it would be nice to know how badly you are being affected by this.
I had a customer who had some really interesting VMware scheduling problems. They had a large number of multi-cpu virtual machines. VMware can't schedule a 4-cpu virtual machine unless there are 4 physical CPUs free. So even if you only have one tiny job to run on one CPU, VMware can't just schedule that one CPU -- it has to wait until at least 4 are available. (Incidentally, those other 3 CPUs that are scheduled but do nothing count towards co-stop% which is another interesting performance-and-tuning metric.)
As it turns out, almost all their virtual machines had some small tasks going on in the background (e.g. cluster heartbeats), so VMware tried very hard to schedule them all as best as it could, but the result wasn't pretty. How bad was it?
I wrote a little program that slept for a second, woke up, recorded the time, and then went back to sleep again. It kept statistics about how delayed it was. There were some horror stories -- there were virtual machines that received no CPU time for more than 90 seconds on occasions! No wonder their clusters kept crashing -- the cluster heartbeat time was only 30 seconds, so there was no way the cluster could stay up with VMware starving it of CPU time like that.
Anyway, I tidied up that program ("am-i-scheduled") and packaged it for RHEL7; the binary is so simple that it will also run on Ubuntu unchanged. I suspect it will run almost anywhere.
If you run AWS EC2 servers or other cloud-hosted servers, you really want to install this. This is the most convenient way you can find out how much CPU time you aren't getting: if your virtual machine is co-located with another virtual machine that is being used for Bitcoin mining, or password cracking, or a deep learning problem, you might want to terminate and try again. With am-i-scheduled, you can detect this easily, and measure the impact you are experiencing.
Here's the source: https://bitbucket.org/solresol/am-i-scheduled and the binaries (including RPMs) can be downloaded from here: https://bitbucket.org/solresol/am-i-scheduled/downloads/
No comments:
Post a Comment