Friday, March 29, 2013

Another script to address efficiency...

We have a sort-of-unfunded-mandate archive server that started out small and quickly grew once people saw it existed.  Product Owners started telling customers that we could keep data (both raw input and processed output) which could prove useful, especially in determining whether what they (and their customers) saw matched what we were pushing to them (and/or where it differed, if it did, between (source ->) us -> them ( -> their customers)).

It quickly grew from about 1TB to 18TB.  We go through a lot of data every day.  Few projects were actually willing to pay for the storage, though, let alone for proper supporting infrastructure (including any redundancy or the like) and any sort of client interface to the data or administrative tools to manage everything.

As things grew (rather quickly), we were keeping apace on the archiving side of things.  We do staggered archival rsyncs from a few dozen hosts to the (single) archive server.  Easy enough.  Those are spread out throughout the hour and recur every hour.  Easy enough.  System load never really hits above 5, and that's during an average heavy period with numerous rsyncs going at once (even spreading the syncs out across an hour means that 6+ will overlap at any given time).

One big problem, though, was in cleaning up.  At first, someone was culling data via stages of find commands, intending to cut a some amount of the data over each iteration (which was 7 days or something similar).

Well, there were problems with that.

If I remember correctly, the cleanup worked on timestamps.  First round would cut anything matching even-numbered minutes, second round would cut anything with a timestamp ending in 0 or 5...

One problem was that the files weren't exactly evenly distributed that way (we might have 10 files in the 03 minute and 1 file in the 04 minute).  Another issue was that a lot more data was being cut than intended, since half of the "0 or 5" data was already cut  (with the *0 minutes data being cut in the first round).  Often, clients would wonder why they had a lot less to work with...

So, when I was working on the archive server, I went to pure timestamp purges.  We would retain as much as we could (space allowing) and then tweaked down from there to allow for some growth, weighing retention levels between data size, criticality and so forth.  Some feeds had 7 day retentions, others could get 14 days or 40 days (which sort of just wound up being a magical number that worked rather than 30 or 45).

Except things kept growing.  If you've ever tried to do finds on large amounts of files in arbitrary numbers of directories across, say, 18TB of storage...

In recent months, it was taking the better part of a *day* for the cleanup scripts to run.  These scripts ran once per day as it was...so taking that long was just untenable.  It also pointed to a future issue where, with data (and storage) likely to increase, those 14 hour finds would start taking even longer, pretty quickly.  Not to mention that system load was starting to average in the 10-30 range, for most of the day.

So I came up with an idea.  We're already rsyncing the data...why not take the rsync output and use it as a manifest?  Then come around every day and _just check the timestamp on those manifests_.  If a feed has a retention policy of 7 days, then, it'd just pick up any manifest for that feed that is over 7 days old and iterate over the _list of files in the manifest_ and delete those.  The find command wouldn't need to start digging down into 5 or more levels of directories, looking for old data.  It'd only have to check a single directory which houses all of the manifests.

14 hour find commands purging old data became mere seconds.  System load is down to under 1, steady state.

Wednesday, March 27, 2013

xen_guest_start.pl

We've got a Xen cluster (well, more than one, but the same issues exist on all of them) that has been problematic over the years.  If the cluster started having issues, we'd do whatever we could to prevent it going down.  Sometimes that worked, sometimes it just bought us more time...and sometimes, well, it would go down anyway.  Recovery time kept increasing, to the point where we'd start blocking out a day, then often two days, before we'd turn the environment back over for testing and further recovery (on the application side).  Some servers, services and applications could be recovered in relatively short periods of time, but the way things worked, everything more or less needed to be up ASAP.

In the QA and Dev environments, that'd be bad enough.  If we stopped developers from developing code for half a day to two days (or more), things could slip noticeably.  Same if we stopped QA from being able to do their jobs.  It was often not just that they could switch to something else, as when individual servers or services had issues...cluster issues meant everything was affected.

We knew why the clusters would go down - multicast traffic is in heavy use in our apps and also what kept the clusters talking...but sooner or later, something would misbehave and the bullets would start flying.

Cluster recovery was, then, a huge pain point for Ops.  Without going into the details, this wasn't something that was easily solved via the cluster software, hardware, apps or the like.  As an admin, there really wasn't too much that could be done to prevent the issues (without replacing the clusters/etc.) but I figured I could take a whack on the recovery side.

What made recovery take so long?  Well, it was a very manual process, for one.  Of course, no one was ever left alone, either, so in recovering many dozens of nodes, you'd likely get interrupted quite a bit (no small number of requests relating to the recovery, of course, but also the usual interruptions as well).  There were classes of servers and services that were to be managed so as not to occupy the same physical box, too, so if you were sufficiently interrupted, you might have to backtrack to see where you were...

I started by querying each of the physical servers in the cluster, asking each one how many CPUs it had, what its total RAM was, etc.  The first line is using a script that we have that allows us to run arbitrary commands on classes of servers (one option is --subenv (logical environment), another is --loc (physical location)).


@xenhosts=`/opt/prodsa/bin/ssh-hosts --command 'hostname; xentop -i 1 -b|grep Mem' --subenv $virt`;
@xenstate=`ssh $xenhost clustat`;
sub setup { foreach $line (@xenhosts) { chomp ($line); if ($line=~/Mem/) { $line=~s/\D+(\d+)\w\stotal, (\d+)\w\sused, (\d+)\w\sfree\s+CPUs:\s(\d+).*/$1 $2 $3 $4/; $mem{$hostname}=$1; $free{$hostname}=$3; $cpu{$hostname}=$4; 


Yeah, I grab the info from remote xentop commands and then parse out the data.  If that format changes, the code needs to change.  Remember, though, this is all a hack ;)  Ideally, I shouldn't even need to do any of this...

I skip $2 (used RAM) because $1 minus $3 is $2.  I was initially displaying it, just because...but rather than go through and change the regex, I just didn't reference $2.  Welcome to my code ;)



  } else {
   $hostname=$line;
   @xenguests=`ssh $hostname 'egrep "(name|maxmem|vcpu)" /xenconfig/*'` unless (@xenguests);
  }
 }



This portion above attempts to go through the local /xenconfig/ directory on each physical host and grab the VM name, requested amount of RAM and requested number of VCPUs for each virtual server.  If it already has compiled that list (@xenguests), it skips doing it again.




 @sorted = sort { $mem{$b} <=> $mem{$a} } keys %mem; 



Simple sort of the physical host hash from most total (free) RAM to least.



 foreach $vm (@xenstate) {
  chomp ($vm);
  $vm=~s/.*:(\S+)\s+(\S+)\s+(\S+)/$1 $2 $3/;
  ($xvm,$xhost,$state)=($1,$2,$3);
  $xenstate{$xvm}=$state;
  print "VM $xvm on $xhost is $state\n" if $xvm;
 }
}


Another kludgey hash, capturing the state of every VM on every physical host (migrating, running, disabled, etc.). This script will also recover random nodes that are down, so it's not just limited to recovering a completely down cluster...that's just how we wound up using it most of the time ;)


sub sort_vms {
 while (@xenguests) {
  $name=shift(@xenguests); chomp $name;
  $mem=shift(@xenguests); chomp $mem;
  $cpu=shift(@xenguests); chomp $cpu;
  $name=~s/.*:name = "(\w+)".*/$1/;
  $mem=~s/.*:maxmem = (\d+).*/$1/; $mem=$mem*1000;
  $cpu=~s/.*:vcpus = (\d+).*/$1/;
  if (($xenstate{$name}=~/disabled/)||($xenstate{$name}=~/failed/)) {
I don't know why I didn't shift ($name,$mem,$cpu) in one whack.  Hell, I don't know why I didn't rip everything out and structure the data differently to begin with.  Again, welcome to my code.  It's sort of stream-of-consciousness.  Forget "sort of"...it is stream-of-consciousness programming.

This is why I'm doing this blog as an exercise to revisit my thought processes and try and improve...because holy hell, can I use it ;)  I try to tackle problems on the fly, here.  Once things work, any intention to go back and pretty things up or do things right...yeah, if it's not in in the first iteration, there's rarely a second.  Elsewhere, I had time and support to properly address issues.  Here?  It just had to work...

  # Uncomment the next line to weigh host choice by free memory - leave commented out for round robin
           #@sorted = sort { $mem{$b} <=> $mem{$a} } keys %mem;

I had wanted numerous options in doling out the VMs to the physical hosts.  Realistically, though, spreading them out via round-robin physical server selection was the cheap way of ensuring that conflicting classes of servers (those that used a lot of multicast) would automatically not land on the same host as others within the same class.

...so long as there were X physical hosts and less-than-X conflicting VMs ;)  Otherwise, they'd wrap around in the assignment queue.

   $target=shift(@sorted);
   print "Trying to fit $name on $target...";
   $count=1;
   unless ($free{$target} gt $mem) {
    push(@sorted,$target);
    print "$count/".scalar(@sorted).": Not enough RAM for $name ($mem) on $target ($free{$target}), trying again...\n";
    $target=shift(@sorted);
    $count=$count+1;
    die "\nNeed SA intervention: cannot find a home for $name due to lack of free space on physical hosts\n" if ($count > scalar(@sorted));
   }

This is where the selected VM is being tested for fit (with the VM's requested RAM being compared to the server's available RAM), physical host by physical host.  I haven't made it clear, yet, but no action is actually being taken to start the VMs...this is all just trying to make those assignments ahead of actually doing the work.  If a VM can't fit on a host, it moves to the next...if it has checked all available physical hosts, the script dies and asks for manual intervention.

   $free{$target}=$free{$target}-$mem;
   $guests{$target}=$guests{$target}." $name";
  # Comment the next line if the @sorting line is uncommented (commented out = sort by free memory; uncommented = round robin)
   push(@sorted,$target);
   print "...fitted\n";
  } else {
   $skipped=$skipped."$name ";
  }
 }
} 

Subtract the available RAM from the target physical host since the VM didn't fail to fit above.  Add the VM name to a list of what's running on the target host (in a hash).  If a host was skipped for whatever reason (already running, etc.), add it to a string tracking skipped VMs.

sub show_layout {
 foreach $key (sort keys %guests) {
  print "$key => $guests{$key}\n";
  @vms=split(/ /,$guests{$key});
  foreach $vm (@vms) {
   print "Starting $vm...\n";
   print "ssh $master clusvcadm -e vm:$vm -m $key\n";# if ($vm);
   print "\nHit <ENTER>:\n";
#   $userinput =  <STDIN>;
  }
 }
 print "Skipped: $skipped\n";
 print "I would've used $master\n";
}

This subroutine is labeled "show layout"...it goes through the list of VMs and their assigned target hosts, then prints the ssh command that'll start it up on the host.  The idea was that you'd have a list of commands that you could manually walk through or run as a batch.


Then I changed the print statement to just be backticks with the ssh command, followed by the <STDIN> input, waiting for the user to hit <ENTER> to cycle through VMs, one by one.


Then I commented out the input line, so the ssh commands would just run sequentially, as fast as the VMs would start up.


And that's how I got the recovery process automated down to about 2 hours ;)  VMs sorted out about as well as could be expected, no host attempting to fit more VMs than it could handle (RAM-wise, anyway...VCPUs we often over-committed on, often egregiously).


It's not quite no-nonsense, but there's a whole lot that I'm not doing, too.  Gathering the data could be done differently, starting up the VMs could be done differently, there could be a hell of a lot more code wrapped around everything to make it more resilient and safer...but I was coding for an average use case that has so far been exactly what we needed.

Fast and loose.


Err...

I thought, earlier today, that I should dig into some of the stuff I've coded, not only to remember what I've done, but to also revisit and consider improvements.

Then I remembered that I had this blog.

Then I realized I had one post on it.  That was almost 2 years ago.

Oops.

So, yeah, look at those good intentions, eh?


Saturday, April 9, 2011

Kicking it off...

So, yeah...in addition to Twitter, Facebook, LJ, DW and whatever else is out there reflecting my thoughts or activities online, I now have this.

This blog, though, will just be about my forays into programming beyond what I already know...for now, likely Ruby.  I saw some nice API examples on my Motorola Xoom running Honeycomb.

We'll see ;)