Oct 26 2009

Chef Shawn by day, Puppet master by night

After my first post about GoGrid and the lack of server images at the time I was kindly advised that real server admins use tools like puppet, cfengine and chef to handle this.  Now if I was a real server admin this would have been quite the blow to my ego, luckily however I’m not and just play one on the internet.  That said this advice did cause me to look into Puppet, Chef and cfengine and I liked what I saw.

I read through the project pages quickly and decided to start with Puppet for some reason.  So off I went setting everything up.  My first plan was to test with one “class” of server that I need to scale up and down often.  Creating my scripts was well somewhat of a challenge, a lot of a challenge actually.  I ended up buying a book on puppet which helped quite a bit and after a few days(Yes days!) I finally had everything working to load and update a server using puppet.  This wasn’t encouraging since I still didn’t feel like I really knew puppet well enough to go ahead and fly through the other server types.  In fact as sad as it sounds I was still pretty confused most of the time.

Puppet did run well though and interfaced nicely with our nagios monitoring so all was good but I was convinced I could do better.  Well not me exactly but rather the people that spend time building this great software and not better exactly but more suitable for my skills and mastery of point and click.

So along comes Chef.  Chef was EASY.  Really easy.  I felt like I flew through setting up my cookbooks and the interface made me very happy as I could show off not only my pointing and clicking skills to my colleagues but also my dragging and dropping capabilities.  Chef soon replaced puppet in our network and things were smooth sailing until the day.  I call it slow Thursday.

Slow Thursday started out normal enough.  Until about 3pm when suddenly the alerts started coming in “Server 1 slow response”, “Server 10 slow response”, etc.  All my servers under Chef were replying dreadfully slow or even not at all.  Now, not being a real admin I didn’t exactly put two and two together and get 4.  No, instead I spent until about 4 trying to figure it all out.  It was only when I decided to kill Chef on one of the servers(with the intention of restarting it) only to get an immediate speedup did I realize the culprit.  Chef was killing my servers.  I restarted the client on the server and it went slow again shortly after.  So I started looking at the server.  After much Googling and hmmm and thinking and restarting of all chef related servers I finally just restarted the whole server and started chef on it again.  Everything on the clients went back to normal.

That was weird I thought no big deal though until it happened again, and again.  Now Chef is still new software so I should and did expect some glitches and this was fine.  It just ment I couldn’t use it at the present time to manage my servers.  I still use it for the most important task the initial configuration.

Scorecard time:

Puppet

[CON] Hard for a newbie like me to configure and manage(I’ve heard good things from more experienced users though)

[PRO] Seemed solid while we ran it

[PRO] Easy to monitor with Nagios thanks to existing scripts

[PRO] Book exists!

Chef

[PRO] Cool interface

[PRO] Lots of easily found cookbooks

[PRO] Easy for a newbie to create cookbooks

[CON] Still has some stability issues

End result: Tie!

Seriously though they are both great pieces of fairly young software.  Many thanks to all the developers on both projects for taking the time to make my life a heck of a lot easier.  I look forward to them both maturing further over the years and getting to use them time and time again.


Oct 22 2009

Balancing with Zeus ZXTM

Recently I started to run into issues with my load balancing solution.  I was running HAProxy on a GoGrid instance and it was working pretty well.  Eventually though as our traffic went up problems started to appear.  After completing as many networking optimizations as I could it was clear that I needed to find another solution.  I was debating between HAProxy on a dedicated server at ServePath, a self managed hardware load balancer or a managed load balancer from ServePath when another contender entered the ring.

The nice people at Zeus set me up with a trial of their ZXTM load balancing solution to try so I figured it was worth a shot.  Setup was pretty easy well ok very easy on a fresh GoGrid instance, I made some networking configuration optimizations then went about setting up my pools.  To ensure it wouldn’t just completely die with our traffic I set it up as a server under our existing HAProxy setup then just gradually ramped up the percentage of requests going to it.  Once it was almost at 100% I made preparations and swapped it out with our HAProxy system.

There was immediately a noticeable improvement in latency which if you’ve read my blog before will know is very important to me.  The interface was a pleasure to work with as well enabling me to easily monitor traffic levels and issues on our operations screen.  After setup it was pretty much left to its own while I continued investigating the other solutions.

Performance wise it went very well.  During the testing period we peaked at I believe around 4000 requests per second and commonly ran at over 2000 per second for hours at a time.  While we had some slowdowns during this time it wasn’t anything dramatic and probably had more to do with running it on a virtual server than a problem with Zeus itself.

ZXTM also offers a pretty cool ability to move some application logic forward into the load balancer.  Though it wasn’t suitable for my needs I could certainly think of a lot of uses in other situations.

In the end my month with ZXTM was certainly a good experience and I can strongly recommend their software.  As for my setup, while I ended up going back to HAProxy only this time running it on dedicated hardware and it is still going strong.