Friday, March 29, 2013

Another script to address efficiency...

We have a sort-of-unfunded-mandate archive server that started out small and quickly grew once people saw it existed.  Product Owners started telling customers that we could keep data (both raw input and processed output) which could prove useful, especially in determining whether what they (and their customers) saw matched what we were pushing to them (and/or where it differed, if it did, between (source ->) us -> them ( -> their customers)).

It quickly grew from about 1TB to 18TB.  We go through a lot of data every day.  Few projects were actually willing to pay for the storage, though, let alone for proper supporting infrastructure (including any redundancy or the like) and any sort of client interface to the data or administrative tools to manage everything.

As things grew (rather quickly), we were keeping apace on the archiving side of things.  We do staggered archival rsyncs from a few dozen hosts to the (single) archive server.  Easy enough.  Those are spread out throughout the hour and recur every hour.  Easy enough.  System load never really hits above 5, and that's during an average heavy period with numerous rsyncs going at once (even spreading the syncs out across an hour means that 6+ will overlap at any given time).

One big problem, though, was in cleaning up.  At first, someone was culling data via stages of find commands, intending to cut a some amount of the data over each iteration (which was 7 days or something similar).

Well, there were problems with that.

If I remember correctly, the cleanup worked on timestamps.  First round would cut anything matching even-numbered minutes, second round would cut anything with a timestamp ending in 0 or 5...

One problem was that the files weren't exactly evenly distributed that way (we might have 10 files in the 03 minute and 1 file in the 04 minute).  Another issue was that a lot more data was being cut than intended, since half of the "0 or 5" data was already cut  (with the *0 minutes data being cut in the first round).  Often, clients would wonder why they had a lot less to work with...

So, when I was working on the archive server, I went to pure timestamp purges.  We would retain as much as we could (space allowing) and then tweaked down from there to allow for some growth, weighing retention levels between data size, criticality and so forth.  Some feeds had 7 day retentions, others could get 14 days or 40 days (which sort of just wound up being a magical number that worked rather than 30 or 45).

Except things kept growing.  If you've ever tried to do finds on large amounts of files in arbitrary numbers of directories across, say, 18TB of storage...

In recent months, it was taking the better part of a *day* for the cleanup scripts to run.  These scripts ran once per day as it was...so taking that long was just untenable.  It also pointed to a future issue where, with data (and storage) likely to increase, those 14 hour finds would start taking even longer, pretty quickly.  Not to mention that system load was starting to average in the 10-30 range, for most of the day.

So I came up with an idea.  We're already rsyncing the data...why not take the rsync output and use it as a manifest?  Then come around every day and _just check the timestamp on those manifests_.  If a feed has a retention policy of 7 days, then, it'd just pick up any manifest for that feed that is over 7 days old and iterate over the _list of files in the manifest_ and delete those.  The find command wouldn't need to start digging down into 5 or more levels of directories, looking for old data.  It'd only have to check a single directory which houses all of the manifests.

14 hour find commands purging old data became mere seconds.  System load is down to under 1, steady state.

No comments:

Post a Comment