An Interim Solution for Elastic Beanstalk Deployment Failures - First Cloud Consulting

parachutist iconI ran into a problem recently that stumped both myself and the AWS Support team – in face I still have two open support tickets that AWS is investigating internally. This is in regard to a large CMS application that has been running on AWS for a few years now without issue and without any recent major changes.

We run multiple Production migrations daily – mostly minor enhancements, performance improvements, etc. The frameworks is Django and the application utilizes numerous AWS services – EC2, RDS, ElastiCache, S3, etc.

What was happening is that the Elastic Beanstalk deployment was “failing”. When I say “failing” the symptoms vary – all of the below situations have occurred despite no changes to the application or environment:

1. The migration successfully deploys to “current” but the application serves “Server 500 (Internal Server)” errors with various failures in the log.

2. The migrations fail and are stuck in “ondeck” without ever rolling over to “current”.

3. The migrations are reported as failures in Elastic Beanstalk but eventually succeed without error.

Initially, I did find some problems (so note, good things to check):

Problem 1

I have 30-40 dependencies to be loaded in requirements.txt – one of these items was failing due to a version upgrade and the previous version was no longer supported.

Resolution 1

This was an easy fix – upgrade to the latest version. Fortunately for me there were no compatibility issues and this was a quick fix. A better solution if you have many dependencies and knowing that upgrading won’t always be a quick fix – fork your dependencies into your own Git repository and update requirements.txt to pull from your repository rather than a public or third-party repository.

Problem 2

Elastic Beanstalk was not reporting failures. Why? This is a complication due to the health check. I basically had two options: TCP check on port 80 (no content, no database, etc.), or a basic GET request for the home page of one of your sites. My application is multi-domain and only responds to requests to the domain itself. As a result, it would not respond to GET request for the root of the individual EC2 instance. Instead it responded with a 301 permanent redirect to the root domain (which in itself is a successful response), despite that following the redirect resulted in a server 500 error.

Resolution 2

You have options here. In my case I could not easily modify the application to respond to requests to the individual hostname and a TCP check would not work. So, I wrote the script below. This is not ideal but certainly a good quick fix for intermittent deployment issues while we continue to investigate and devise a more permanent resolution:

# Configuration
APPLICATION_PATH="/opt/python/current" # Elastic Beanstalk symbolic link

# Local variables - get current path
DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )

# Check httpd status, start if stopped
if ps ax | grep -v grep | grep 'httpd' > /dev/null
    >&2 echo "Service httpd is running ..."

    >&2 echo "Service httpd is not running ..."


    if [ -d "$APPLICATION_PATH" ]; then
        actual_path=$(pwd -P "$APPLICATION_PATH")
        actual_script_path=$(pwd -P "$DIR/app/scripts") # Path where this script is housed

        if [ "$actual_path" == "$actual_script_path" ]; then
            service httpd start

# Tests and EC2 instance details
health_check_resp=$(cat $DIR/health_check.txt | nc localhost 80)
http_code=`echo $health_check_resp | head -1 | cut -d   -f 2`
public_hostname=`wget -q -O -`
instance_id=`wget -q -O -`

# Uniform report handler
function report_status() {
    # Argument 1 - Status Message (for stdout and stderr)
    if [ -z "$1" ]; then
        echo "No arguments passed."



    >&2 echo "$status_message"

    # Argument 2 - Console details (stdout only)
    if [ "$2" ]; then

        echo "$console_details"

    # Argument 3 - Email content (will create new or append until email is sent, first message is subject)
    if [ "$3" ]; then
        rm -f /tmp/email_content.txt

        while read -r line; do
            if [ ! -f "/tmp/email_content.txt" ]; then
                echo "Subject: $line" > /tmp/email_content.txt
                echo "$line" >> /tmp/email_content.txt

            if [[ $line = *":" ]]; then
                echo "" >> /tmp/email_content.txt
        done <<< "$3"

        sendmail $WEBTEAM_EMAIL < /tmp/email_content.txt

# Apache test
if [ "$http_code" = "200" ] || [ "$http_code" = "302" ]; then
    report_status "Passed health check ($http_code)" "$health_check_resp"
    report_email=$(printf "%sn" "Failed Health Check on Instance, $public_hostname" "Apache status: [Pr
evious] $httpd_status,
[Current] `service httpd status`" "$health_check_resp" "Recent logs:" "`tail /var/log/httpd/*`")

    report_status "Failed health check, response: ($http_code) $health_check_resp" "$health_check_resp" "$report_email"

    if [ "$http_code" = "500" ]; then
        ec2kill $instance_id

So what does this do? It’s run as the last “container command” in the .ebextensions configuration file and it first checks to ensure that Apache is running; if not, it starts the service. It then calls telnet on the localhost with the following input file (health_check.txt):

GET / HTTP/1.1

This allows telnet to execute a request on localhost and still specify the target domain, bypassing any potential redirects via the Elastic Load Balancer (a valuable feature that should be added to ELB host checks would be to execute a GET request via telnet for this very reason).

In this case (and of course you can modify to suit your own needs), the script generates useful debugging information whether executed manually from command line, or via email if run on deployment by Elastic Beanstalk. If the server does generate a 500 error then I know the deployment has failed and the instance self-terminates. Elastic Beanstalk will still recognize the need for an additional server based on your Auto Scaling configuration and any triggers (CPU Usage, etc. – whatever you have defined), and attempt another launch until one succeeds.

You can take this one step further if you are having intermittent server 500 errors after some time (e.g. the instance runs smoothly for 6 days then starts throwing server 500 errors for an unknown reason). To do so, simply run the script on a cron job at a predetermined interval.


This is not a final solution but rather an interim workaround. If you application is prone to such failures then you could end up in a loop of launching instances that self-terminate. Or, if your code is untested and you launch it then it could resolve in an entire environment that self terminates. Use at your discretion but while AWS Support and I continue to troubleshoot the root cause for our mutual client – this has proved to be a valuable interim fix.