|
April 14th, 2011
10:00 am - Monitoring If it's not worth getting up for a page, it's not worth paging for that host or service. If it's not worth paging for something, it's not worth monitoring it. If it's not worth monitoring, it's not worth running.
|
December 17th, 2010
03:10 pm - Monitoring in dynamic environments I've been thinking about monitoring in dynamic ("cloud") contexts. Increasingly, we're seeing enterprise environments where server instances will be cycled up during busy times and then shut down when the site they serve has settled down. Some white papers I've seen even power up and down node controllers (hypervisors) dynamically using wake-on-lan.
These new-style environments are more flexible and efficient than the old-style ones, but the open-source system administration world seems to be slow to catch up with supporting software. Specifically, I'm talking about service discovery, monitoring, and configuration management.
Service Discovery
Service Discovery is the term I'm using to describe the process that happens between the time a server is fully-configured and the time the load balancer or other application starts sending requests its way. In the old world, service discovery was completely manual-- the system administrator would modify a configuration file (or files) and restart the service using that configuration file, and then the service would be live. In the new, dynamic world, the network addresses of a particular service could change depending on site load and other factors. I've seen some stuff on IEEE about using dynamic DNS for service discovery, and there's uPnP, and there's avahi, but I haven't seen system administrators integrating any of these tools into their daily lives. Instead, the slow march of top-down administration continues.
Monitoring
The situation gets even more dire with monitoring. In a top-down configuration model like the one you see with Nagios, a central server must be configured to monitor each application that is running by network IP address and port. With dynamic environments, machines could be booting up and shutting down throughout the day and night, and a single IP address could be used by many different instances serving different purposes over time. While it is possible to write custom software that reconfigures the monitoring server(s) on every instance startup and shutdown, such software would be non-trivial and would add months to a dynamic environment's roll-out.
Configuration Management
This is the toughest nut to crack of all of them, and resolving this single issue satisfactorily could simultaneously resolve the first two. There are several hurdles to this problem: the configuration management server itself could be an instance, so nodes would need to know how to look up its current address; the servers being configured are unknown to the configuration management server at the time of instance creation, and most configuration management systems prefer client authentication via certificates as a security measure; and the configuration management system itself needs to know how to assign the clients a role so that they can be configured appropriately.
All of these problems are solvable, but none of them are easy, and I haven't seen the open-source community embrace them for a solution yet. Canonical (the ubuntu people) sells a service they call Landscape, which includes monitoring and some configuration management, for about $1200 per year, per server.
On a more general level, what we're witnessing is a transition from top-down computing, where multi-server environments have a clearly defined structure and servers have rarely-changing roles, to bottom-up computing, where servers join and leave roles dynamically according to the changing requirements of the network's users. My catch-phrase for this is "Not the foo server, a foo server!"
|
September 30th, 2010
04:40 pm - Tricky iCuke! I'm writing a set of feature specifications for iPhone integration testing using Cucumber on iPhone, aka iCuke. Today I lost about two hours debugging a strange error. My test:
When I type "errorval" in "Login"
Not much to it. The error I got was somewhat misleading:
No element labeled "e" found in: <screen>
What? Of course there isn't an element labeled "e"! You're supposed to find an element labeled "Login"!
As it turns out, "type 'errorval' in 'login'" means "find the field labeled 'login', tap it, wait a second, then tap out 'errorval' on the virtual keyboard that should have come up, character by character."
iCuke was finding a field named "Login" and sending it a tap, but the tap wasn't triggering the virtual keyboard. As a result, it only raised an exception when the first letter it was typing wasn't available on the screen.
|
August 20th, 2010
12:37 pm This post about old games made me very happy. I don't really care for his Super Mario clone, but the discussion of game evolution and programming reminded me of some familiar experiences.
There really was an entire universe inside those games (on the 2600).
|
July 4th, 2010
12:03 pm - Cooperative Distributed Request Processing in shared-cache situations Wow, that title seems so formal. I've been thinking about ways to avoid generating an item for cache more than once in situations where traffic rate and item generation time are both high.
Imagine, for a second, that you're LiveJournal. Someone posts a new entry to their blog, let's call it "Someone is wrong on the Internet." It takes a single dispatcher or thread on a single server one second to generate this page. The blog is very popular, however, so 50 requests for the page come during that time. Because the item has not yet completed generation and thus is not in cache, the page is generated 50 more times, tying up 50 dispatchers or threads for one second each. If LiveJournal only had capacity to serve 50 concurrent requests, the site would be down until the page was generated and served 51 times.
The above scenario is gloomy, but it happens all the time. I've been thinking about how to reduce it. It would be far better for the request to cause cache generation only once.
Initially, I was imagining that a lock server of some sort would do it. Request 1 comes in and dispatcher 1 locks the URL for writing. Request 2 comes in and dispatcher 2 tries to get a write lock, fails, and falls back to waiting for a read lock. When dispatcher 1 completes writing the page to cache, dispatcher 2 reads it and serves the cached data. That way, each page could be written only once to cache.
Though the first method eliminates cache write duplication, it irks me to think of all those dispatchers sitting around doing nothing while dispatcher 1 is working its butt off. If we added caching of intermediate results to the mix, dispatchers 2 through 51 could divvy up the remainder of the work and all generate the page cooperatively. It's really just the above method with finer-grained locking. Here's how that would work:
Request 1 comes in, dispatcher 1 begins page generation. The page generation process is organized into a series of 10 steps. Dispatcher 1 locks request 1, step 1. Request 2 comes in, so dispatcher 2 requests a write lock on step 1, fails, and requests a write lock on step 2. In that way, dispatchers (or at least the first ten of them) could cooperatively generate the page. There are some details I'm avoiding discussing here, but it could be made to work.
The next thing to consider is how to distribute the request processing load between machines. If we only have one core, it doesn't make much sense to have ten dispatchers compete for that core in their cooperative shared-cache page generation. It would be much faster to have one dispatcher per core across several machines perform the page generation. In that case, one could have a number of tokens given out per dispatcher-url. The token server would say "Thanks for trying, but we already have enough dispatchers on your machine working on this request. Please stand by and wait for us to finish."
That's all for now, but if you know something about this stuff, or if you have pointers to information about it, please post in the comments.
|
May 31st, 2010
11:39 am - My Comcast online viewing experience The goal: watch something via Comcast online. I'm paying for the service, so I might as well be able to use it.
Comcast, people steal content because you have created a situation where the following is the "right" way to watch content they paid for online:
First, go to comcast.com. Click on "watch online". So far, so good. This takes us to fancast.net. Who is that? Is it a different company or just some subsidiary? Is my information safe with them? Ugh, I'll just be a lemming and sign in with my comcast.net e-mail and password. Wait, what? What is comcast.net? Oh yeah, they gave me an ISP email address when I signed up. I have never used it. And it's different than my comcast.com signin--the one I use every month to pay my bill online. Great. Let's see if I can guess my username and password...nope. Let's get tech support on the line. Okay, got my password reset. Now I have to change my password. (ten clicks and two captchas later) Done. Now let's see... The Pacific. There it is. (click) Oh, I have to install some software called Comcast Access. Yuck. I wonder what that does (besides auth my computer to comcast). Download the 18MB software, install, 10 click authorization process... done. This computer is now authorized. Back to Fancast. Launch Comcast Access. Oh, I have to authorize the computer again? Great, now I'm using 2 of 3 slots with the same damn computer. I don't use this anywhere else though, I can just handle that later. (at this point, I'm 1 hour and 15 minutes into the ordeal). Launch Comcast Access. Hey, it wants me to authorize this computer again! Oh, I know! Maybe it wants me to use Internet Explorer (I was using Chrome before). Okay, back to Fancast on IE8, 64-bit. Flash doesn't work. Over to Adobe's website. Oh, flash doesn't work on 64-bit browsers! Good to know. I bet I have a 32-bit version somewhere though. Ah, there it is. Back to Fancast. Sign in. Browse to The Pacific. Click. Oh, I have to upgrade Flash to Player 10.(it's running Flash Player 10, but apparently not a _new enough_ version of Flash Player 10. Upgrade Flash. Gotta restart the browser. Done, and it helpfully re-navigates me to the page I was on when I had to close the browser. Thank you, Adobe. Fancast is now asking me to launch Comcast Access (which is running already), and the button doesn't do anything when I click it. Damnation, it's never going to give me this content, and my TV viewing window has closed (the wife is up and making want-coffee noises). Thank you, Comcast; I will never get that time back.
Versus the "bad" way: 1. go to one of the hundreds of sites out there offering shows for download. 2. type in "the pacific". 3. Click on the episode I want. 4. Watch it.
tl;dr: I wasted 90 minutes trying to watch a TV show online.
|
May 28th, 2010
04:40 pm - Some fun Ruby benchmarks I was writing some code that needs to be fast, so I'm back at the benchmarks. I'm using Ruby 1.8.6 (latest) on Linux.
One thing that has come up a lot recently is the question, "Is item X is in array Y?" It's a simple membership test. There are several ways to do it: you can use a case statement, another array, or a hash. Which is fastest? I wrote a quick benchmark to find out and was surprised by ( the results. )
|
April 13th, 2010
09:56 am - Bits of sysadmin usefulness We'll see how long this lasts, but I regularly remember best practice behaviors associated with system administration. Typically, I call it out to whoever is listening around my desk: "Sysadmin best practice #5267: blah blah blah" after which it disappears. Now, I'm going to try writing them down as posts to my journal.
Here's the first (because I dealt with it at 4am today):
When temporarily commenting lines out in a configuration file or script, comment out the entire block, including the comments. For example, let's say you need to disable a set of cron jobs for a few hours. Here's what I would recommend for the fictional situation where you want to stop something from updating for a while (for example, if the thing it is updating from is down and you don't want your mail server to waste cycles on sending error mail):
Before:
MAILTO="cronmail@example.com" # update stuff at 7 minutes past the hour and 43 minutes past the hour 7,43 * * * * /usr/local/sbin/update-stuff.sh # remove old stuff in /tmp every 4 hours at 30 minutes past the hour 30 */4 * * * /usr/local/sbin/tmp-cleanup.sh
After:
MAILTO="cronmail@example.com" # errorval commented 13 April 2010 due to frobnicator outage # # update stuff at 7 minutes past the hour and 43 minutes past the hour # 7,43 * * * * /usr/local/sbin/update-stuff.sh # end errorval 13 April 2010 comment block # remove old stuff in /tmp every 4 hours at 30 minutes past the hour 30 */4 * * * /usr/local/sbin/tmp-cleanup.sh
It's verbose, but when someone else brings back the frobnicator, they won't have any question about what to un-comment, even if they don't understand crontab.
|
March 9th, 2010
10:51 am
16:44:10 hazzard ~ $ sudo tw_cli
Password:
//hazzard> info c0
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-1 DEGRADED - - - 232.82 ON OFF
Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 233.76 GB 490234752 WD-WCANY2080094
p1 DEVICE-ERROR u0 233.76 GB 490234752 WD-WCANY2069612
p2 NOT-PRESENT - - - -
p3 NOT-PRESENT - - - -
//hazzard> maint remove u0 p1
Error: (CLI:003) Specified controller does not exist.
//hazzard> maint remove c0 u0 p1
Removing /c0/u0 ... Error: Could not open file /var/log/tw_mgmt.log
Done.
//hazzard> info c0
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 NOT-PRESENT - - - -
p1 NOT-PRESENT - - - -
p2 NOT-PRESENT - - - -
p3 NOT-PRESENT - - - -
//hazzard> maint rescan c0
Connection to hazzard-m closed by remote host.
Connection to hazzard-m closed.
FUUUUUUUUUUUUUUUUUUUUUUU
|
September 30th, 2009
08:16 am - Tricky Rails POST error Hi all,
I ran into a tricky Rails issue yesterday and wanted to share the solution.
I had a form along the lines of the following:
<form action="/example/myaction" method="POST">
<input type="hidden" name="object" value="123">
<input type="text" name="object[name]">
<input type="submit">
</form>
Clicking the "submit" button generated the following unhelpful exception:
Conflicting types for parameter containers. Expected an instance of Hash, but found an instance of String.
This can be caused by passing Array and Hash based paramters qs[]=value&qs[key]=value. What does THAT mean? As it turns out, Rails iterates through the posted variables and puts them into the params hash. With the example above, that means that it would try to create a hash named params[:object] (so my text input value would be addressable at params[:object]['name']) and also a string named params[:object] (for the hidden input). Since params[:object] cannot be both a hash and a string, the exception occurs. The fix is simple: change the name on one or the other. The following form works:
<form action="/example/myaction" method="POST">
<input type="hidden" name="object_id" value="123">
<input type="text" name="object[name]">
<input type="submit">
</form>
The Google was pretty weak in explaining the source of the exception, so maybe this will clear things up for someone, somewhere.
|
|
|
|