Wednesday, March 27, 2013

xen_guest_start.pl

We've got a Xen cluster (well, more than one, but the same issues exist on all of them) that has been problematic over the years.  If the cluster started having issues, we'd do whatever we could to prevent it going down.  Sometimes that worked, sometimes it just bought us more time...and sometimes, well, it would go down anyway.  Recovery time kept increasing, to the point where we'd start blocking out a day, then often two days, before we'd turn the environment back over for testing and further recovery (on the application side).  Some servers, services and applications could be recovered in relatively short periods of time, but the way things worked, everything more or less needed to be up ASAP.

In the QA and Dev environments, that'd be bad enough.  If we stopped developers from developing code for half a day to two days (or more), things could slip noticeably.  Same if we stopped QA from being able to do their jobs.  It was often not just that they could switch to something else, as when individual servers or services had issues...cluster issues meant everything was affected.

We knew why the clusters would go down - multicast traffic is in heavy use in our apps and also what kept the clusters talking...but sooner or later, something would misbehave and the bullets would start flying.

Cluster recovery was, then, a huge pain point for Ops.  Without going into the details, this wasn't something that was easily solved via the cluster software, hardware, apps or the like.  As an admin, there really wasn't too much that could be done to prevent the issues (without replacing the clusters/etc.) but I figured I could take a whack on the recovery side.

What made recovery take so long?  Well, it was a very manual process, for one.  Of course, no one was ever left alone, either, so in recovering many dozens of nodes, you'd likely get interrupted quite a bit (no small number of requests relating to the recovery, of course, but also the usual interruptions as well).  There were classes of servers and services that were to be managed so as not to occupy the same physical box, too, so if you were sufficiently interrupted, you might have to backtrack to see where you were...

I started by querying each of the physical servers in the cluster, asking each one how many CPUs it had, what its total RAM was, etc.  The first line is using a script that we have that allows us to run arbitrary commands on classes of servers (one option is --subenv (logical environment), another is --loc (physical location)).


@xenhosts=`/opt/prodsa/bin/ssh-hosts --command 'hostname; xentop -i 1 -b|grep Mem' --subenv $virt`;
@xenstate=`ssh $xenhost clustat`;
sub setup { foreach $line (@xenhosts) { chomp ($line); if ($line=~/Mem/) { $line=~s/\D+(\d+)\w\stotal, (\d+)\w\sused, (\d+)\w\sfree\s+CPUs:\s(\d+).*/$1 $2 $3 $4/; $mem{$hostname}=$1; $free{$hostname}=$3; $cpu{$hostname}=$4; 


Yeah, I grab the info from remote xentop commands and then parse out the data.  If that format changes, the code needs to change.  Remember, though, this is all a hack ;)  Ideally, I shouldn't even need to do any of this...

I skip $2 (used RAM) because $1 minus $3 is $2.  I was initially displaying it, just because...but rather than go through and change the regex, I just didn't reference $2.  Welcome to my code ;)



  } else {
   $hostname=$line;
   @xenguests=`ssh $hostname 'egrep "(name|maxmem|vcpu)" /xenconfig/*'` unless (@xenguests);
  }
 }



This portion above attempts to go through the local /xenconfig/ directory on each physical host and grab the VM name, requested amount of RAM and requested number of VCPUs for each virtual server.  If it already has compiled that list (@xenguests), it skips doing it again.




 @sorted = sort { $mem{$b} <=> $mem{$a} } keys %mem; 



Simple sort of the physical host hash from most total (free) RAM to least.



 foreach $vm (@xenstate) {
  chomp ($vm);
  $vm=~s/.*:(\S+)\s+(\S+)\s+(\S+)/$1 $2 $3/;
  ($xvm,$xhost,$state)=($1,$2,$3);
  $xenstate{$xvm}=$state;
  print "VM $xvm on $xhost is $state\n" if $xvm;
 }
}


Another kludgey hash, capturing the state of every VM on every physical host (migrating, running, disabled, etc.). This script will also recover random nodes that are down, so it's not just limited to recovering a completely down cluster...that's just how we wound up using it most of the time ;)


sub sort_vms {
 while (@xenguests) {
  $name=shift(@xenguests); chomp $name;
  $mem=shift(@xenguests); chomp $mem;
  $cpu=shift(@xenguests); chomp $cpu;
  $name=~s/.*:name = "(\w+)".*/$1/;
  $mem=~s/.*:maxmem = (\d+).*/$1/; $mem=$mem*1000;
  $cpu=~s/.*:vcpus = (\d+).*/$1/;
  if (($xenstate{$name}=~/disabled/)||($xenstate{$name}=~/failed/)) {
I don't know why I didn't shift ($name,$mem,$cpu) in one whack.  Hell, I don't know why I didn't rip everything out and structure the data differently to begin with.  Again, welcome to my code.  It's sort of stream-of-consciousness.  Forget "sort of"...it is stream-of-consciousness programming.

This is why I'm doing this blog as an exercise to revisit my thought processes and try and improve...because holy hell, can I use it ;)  I try to tackle problems on the fly, here.  Once things work, any intention to go back and pretty things up or do things right...yeah, if it's not in in the first iteration, there's rarely a second.  Elsewhere, I had time and support to properly address issues.  Here?  It just had to work...

  # Uncomment the next line to weigh host choice by free memory - leave commented out for round robin
           #@sorted = sort { $mem{$b} <=> $mem{$a} } keys %mem;

I had wanted numerous options in doling out the VMs to the physical hosts.  Realistically, though, spreading them out via round-robin physical server selection was the cheap way of ensuring that conflicting classes of servers (those that used a lot of multicast) would automatically not land on the same host as others within the same class.

...so long as there were X physical hosts and less-than-X conflicting VMs ;)  Otherwise, they'd wrap around in the assignment queue.

   $target=shift(@sorted);
   print "Trying to fit $name on $target...";
   $count=1;
   unless ($free{$target} gt $mem) {
    push(@sorted,$target);
    print "$count/".scalar(@sorted).": Not enough RAM for $name ($mem) on $target ($free{$target}), trying again...\n";
    $target=shift(@sorted);
    $count=$count+1;
    die "\nNeed SA intervention: cannot find a home for $name due to lack of free space on physical hosts\n" if ($count > scalar(@sorted));
   }

This is where the selected VM is being tested for fit (with the VM's requested RAM being compared to the server's available RAM), physical host by physical host.  I haven't made it clear, yet, but no action is actually being taken to start the VMs...this is all just trying to make those assignments ahead of actually doing the work.  If a VM can't fit on a host, it moves to the next...if it has checked all available physical hosts, the script dies and asks for manual intervention.

   $free{$target}=$free{$target}-$mem;
   $guests{$target}=$guests{$target}." $name";
  # Comment the next line if the @sorting line is uncommented (commented out = sort by free memory; uncommented = round robin)
   push(@sorted,$target);
   print "...fitted\n";
  } else {
   $skipped=$skipped."$name ";
  }
 }
} 

Subtract the available RAM from the target physical host since the VM didn't fail to fit above.  Add the VM name to a list of what's running on the target host (in a hash).  If a host was skipped for whatever reason (already running, etc.), add it to a string tracking skipped VMs.

sub show_layout {
 foreach $key (sort keys %guests) {
  print "$key => $guests{$key}\n";
  @vms=split(/ /,$guests{$key});
  foreach $vm (@vms) {
   print "Starting $vm...\n";
   print "ssh $master clusvcadm -e vm:$vm -m $key\n";# if ($vm);
   print "\nHit <ENTER>:\n";
#   $userinput =  <STDIN>;
  }
 }
 print "Skipped: $skipped\n";
 print "I would've used $master\n";
}

This subroutine is labeled "show layout"...it goes through the list of VMs and their assigned target hosts, then prints the ssh command that'll start it up on the host.  The idea was that you'd have a list of commands that you could manually walk through or run as a batch.


Then I changed the print statement to just be backticks with the ssh command, followed by the <STDIN> input, waiting for the user to hit <ENTER> to cycle through VMs, one by one.


Then I commented out the input line, so the ssh commands would just run sequentially, as fast as the VMs would start up.


And that's how I got the recovery process automated down to about 2 hours ;)  VMs sorted out about as well as could be expected, no host attempting to fit more VMs than it could handle (RAM-wise, anyway...VCPUs we often over-committed on, often egregiously).


It's not quite no-nonsense, but there's a whole lot that I'm not doing, too.  Gathering the data could be done differently, starting up the VMs could be done differently, there could be a hell of a lot more code wrapped around everything to make it more resilient and safer...but I was coding for an average use case that has so far been exactly what we needed.

Fast and loose.


No comments:

Post a Comment