Skip to main content

Monitoring Varnish for random crashes

I'm using Varnish to cache the frontend of a site that a client is busy promoting.  It does a great job of reducing requests to my backend but is prone to random crashes.  I normally get about two weeks of uptime on this particular server, which is significantly lower than other places that I've deployed Varnish.

I just don't have enough information to work with to try and solve why the random crash is occurring.  The system log shows that a child process doesn't respond to CLI and so is killed.  The child never seems to be able to be brought up again.

My /var/log/messages file looks like this:

 08:31:45 varnishd[7888]: Child (16669) not responding to CLI, killing it.  
 08:31:45 varnishd[7888]: Child (16669) died signal=3  
 08:31:45 varnishd[7888]: child (25675) Started  
 08:31:45 varnishd[7888]: Child (25675) said Child starts  
 08:31:45 varnishd[7888]: Child (25675) said SMF.s0 mmap'ed 1073741824 bytes of 1073741824  
 08:32:19 varnishd[7888]: Child (25675) not responding to CLI, killing it.  
 08:32:21 varnishd[7888]: Child (25675) not responding to CLI, killing it.  
 08:32:21 varnishd[7888]: Child (25675) died signal=3  

Which doesn't give me a lot to work with.  I couldn't find anything in the documentation about this sort of problem.  I don't want to uninstall Varnish so I decided to rather look for a way to monitor the process.

I first tried Monit but after about two weeks my site was down.  After sshing onto the box and restarting Varnish I checked the monit logs.  Although it was able to recognize that Varnish had crashed, it was not able to successfully bring it back up.

My Monit log looked like this:

 [BST Apr 23 09:07:24] error  : 'varnish' process is not running  
 [BST Apr 23 09:07:24] info   : 'varnish' trying to restart  
 [BST Apr 23 09:07:24] info   : 'varnish' start: /etc/init.d/varnish  
 [BST Apr 23 09:07:54] error  : 'varnish' failed to start  
 [BST Apr 23 09:08:54] error  : 'varnish' process is not running  
 [BST Apr 23 09:08:54] info   : 'varnish' trying to restart  
 [BST Apr 23 09:08:54] info   : 'varnish' start: /etc/init.d/varnish  
 [BST Apr 23 09:09:24] error  : 'varnish' failed to start  
 [BST Apr 23 09:10:24] error  : 'varnish' process is not running  
 [BST Apr 23 09:10:24] info   : 'varnish' trying to restart  
 [BST Apr 23 09:10:24] info   : 'varnish' start: /etc/init.d/varnish  
 [BST Apr 23 09:10:54] error  : 'varnish' failed to start  
 [BST Apr 23 09:11:54] error  : 'varnish' service restarted 3 times within 3 cycles(s) - unmonitor  

My problem sounded a lot like this one on ServerFault so I looked for another way to monitor the process other than using Monit.

Instead of using daemonize, supervisord, or another similar program I'm trying out a simple shell script that I found at http://blog.unixy.net/2010/05/dirty-varnish-monitoring-script/.  The author says it's dirty, and I suppose it is, but it has the advantage of being dead simple and easy to control.   I've set it up as a cron job to run every five minutes.  Hopefully this will be a more effective way to make sure that Varnish doesn't stay dead for very long.

In case the source file goes down I saved a copy as a Gist:

Comments

Popular posts from this blog

Separating business logic from persistence layer in Laravel

There are several reasons to separate business logic from your persistence layer.  Perhaps the biggest advantage is that the parts of your application which are unique are not coupled to how data are persisted.  This makes the code easier to port and maintain. I'm going to use Doctrine to replace the Eloquent ORM in Laravel.  A thorough comparison of the patterns is available  here . By using Doctrine I am also hoping to mitigate the risk of a major version upgrade on the underlying framework.  It can be expected for the ORM to change between major versions of a framework and upgrading to a new release can be quite costly. Another advantage to this approach is to limit the access that objects have to the database.  Unless a developer is aware of the business rules in place on an Eloquent model there is a chance they will mistakenly ignore them by calling the ActiveRecord save method directly. I'm not implementing the repository pattern in all its glory in this demo.  

Fixing puppet "Exiting; no certificate found and waitforcert is disabled" error

While debugging and setting up Puppet I am still running the agent and master from CLI in --no-daemonize mode.  I kept getting an error on my agent - ""Exiting; no certificate found and waitforcert is disabled". The fix was quite simple and a little embarrassing.  Firstly I forgot to run my puppet master with root privileges which meant that it was unable to write incoming certificate requests to disk.  That's the embarrassing part and after I looked at my shell prompt and noticed this issue fixing it was quite simple. Firstly I got the puppet ssl path by running the command   puppet agent --configprint ssldir Then I removed that directory so that my agent no longer had any certificates or requests. On my master side I cleaned the old certificate by running  puppet cert clean --all  (this would remove all my agent certificates but for now I have just the one so its quicker than tagging it). I started my agent up with the command  puppet agent --test   whi

Redirecting non-www urls to www and http to https in Nginx web server

Image: Pixabay Although I'm currently playing with Elixir and its HTTP servers like Cowboy at the moment Nginx is still my go-to server for production PHP. If you haven't already swapped your web-server from Apache then you really should consider installing Nginx on a test server and running some stress tests on it.  I wrote about stress testing in my book on scaling PHP . Redirecting non-www traffic to www in nginx is best accomplished by using the "return" verb.  You could use a rewrite but the Nginx manual suggests that a return is better in the section on " Taxing Rewrites ". Server blocks are cheap in Nginx and I find it's simplest to have two redirects for the person who arrives on the non-secure non-canonical form of my link.  I wouldn't expect many people to reach this link because obviously every link that I create will be properly formatted so being redirected twice will only affect a small minority of people. Anyway, here's