Building a New Transcoder Service (Part 1)

The company I work for allows users to record videos of tests.  These tests are stored on Amazon S3, and users can download them, etc at their desire.

Unfortunately, the way this all works requires a lot of resources; VNC sessions are recorded and transcoded on the same boxes that also serve those VNC sessions to the users.  This can start to overwhelm those boxes, which also handle other services.   With usage that is just outside what we typically see in a day, the system can come to a grinding halt across the board.  Obviously, we don’t want this to happen.

So I built a system that offloads the transcoding – for performance reasons, we can’t move the recording itself, but that’s the lightest part of the whole thing, so not really a problem.

Videos get uploaded to S3 still, but there’s a feature that makes this whole process easier: S3 is able to create a message in SQS, with information about new files.  On top of that, we can have it only make these messages for certain file types; in the case, we only care about new .flv files.

So when an FLV file is uploaded, a message gets added to the message queue telling it that the file was uploaded.  On its own, this is actually pretty useless – so we know it exists, but we can’t do anything useful with it just yet.

I ended up writing a tiny Python application that uses this queue to do the transcoding.  It pulls from the queue, which contains the filename inside the S3 bucket, and then retrieves this file from the bucket, and runs it through ffmpeg to convert it to a web-ready mp4; this file is then uploaded to S3 beside the source FLV file, and a message is sent to our API to notify that the conversion has successfully completed, and the process loops.

Compared to other services that do transcoding, this is *insanely* cheap.  Amazon’s Elastic Transcoder service, for example, charge 3 cents per minute of HD video (all of our videos are at a resolution that is at least 720p).  Other services are between 60 and 80% as costly.  This doesn’t seem like that much at first, but take into account that our users generate roughly 300,000 videos in a month: even if every single video is only 30 seconds – which is a laughably low estimate, to be honest – that will cost $4,500 dollars a month just for the transcoding service.  A more reasonable estimate of 2 minutes, since many of our videos are at the extremes of 10 minutes or under a minute, gives us a total of $18,000/month.

This transcoder service runs on a few t2.xlarge instances.  Total price per month for us to keep up with 300k videos?  About $300/month.  I assure you, there are no numbers missing from that.  Each instance is right around $140/month.  Even given an extremely low estimate of 30 seconds per video, this gives us a savings of about 94%.  A week of development and testing, and we were able to offload a CPU-intensive task from sensitive infrastructure up to a place that, really, doesn’t care about how much we throw at it.  Those two boxes, despite being fairly small, churn through those videos and keep up quite handily.  Usually – and in the next part, I’ll show what happens when our videos come in too fast for those two boxes to keep up.

Writing a library in Go

At work, we do a lot with HAProxy.  I mean, a lot – our HAProxy configuration is over 6000 lines.  We’ve been looking into ways to pare this down, but it still leaves us with one issue: how can we even start to track what’s going in this system at any given, when there are so many moving parts?

We’ve started using the ELK stack (though it’s now called the Elastic Stack by Elastic, the company that really makes most of it), but only for logging API calls in our stack.

HAProxy allows you to create a “stats socket”, a unix domain socket that you can connect to and send commands to control parts of HAProxy itself, or, more usefully for this case, to get a list of statistics for each server, listener, backend, and frontend.  Problem is, in some setups, with multiple threads for HAProxy (used to handle high load), we can get multiple sockets, and these have to be aggregated.

I’ve been experimenting with Golang lately, and so I started writing a tool to handle this data.  I have written a library, HAProxyGoStat, that makes it easier to handle the data.  It handles parsing the CSV format that HAProxy outputs by default (other formats, such as the JSON output and the split out format it can use, are both more text and harder to parse, for the most part, in addition to being unsupported on older versions of HAProxy).  You can find the library here: https://github.com/hmschreck/HAProxyGoStat.

As a demonstration of just how fast this library is, in our current environment for testing, HAProxy has 4 stats sockets, and each one reports about 1550 stats (each server, listener, backend, and frontend presents a stat).  Each stat is composed of 82 attributes; this means that, total, there are over half a million records in a single set of ‘snapshots’ of sockets.

The aggregation combines each of the stats snapshots, either passing through, or returning average, max, or sum, and returns a single snapshot after the aggregation process.  In a test that creates a parser, reaches out to each of the four sockets, and creates a set of snapshots simultaneously, then filters it.

The entire program runs in 0.22-0.27 seconds, hovering around 0.25 seconds – keep in mind, this includes several steps for initialization, that a properly daemonized version wouldn’t need to do, such as creating the parser.

This processes half a million records down to size in a quarter of a second.  I can literally run this every second and it will not back up.  That’s *fast*.

Python Queues

Thanks to my job, I found myself needing to set up a way to have tasks go into and out of a Python script.  Especially when I’m trying to make something that we’ll be sending tasks to over a longer period, I need to have something a bit more solid.

Python queues work pretty simply.  You create a queue, then have a worker pull tasks off the top of the queue.  The worker works the tasks, then notifies the queue the task is complete, and the tasks pulls another queue.  At any point in this process, you can add more tasks. Continue reading “Python Queues”

Speaking @ Memphis Ruby Users Group

I recently spoke at the Memphis RUG about my most recently completed project.  Unfortunately, due to an ongoing concern at work regarding intellectual property deals with a competitor, I am unable to publish the slides in their entirety at this time.

I spoke on building an imaging system using Ruby, Rails, and Linux.  If anyone else would like to see the slides or have me present, feel free to contact me.

ROI on Tech Projects (Part 1)

Recently, I was asked to calculate the ROI of an internal project at work, and after I ran the numbers, I discovered that not only was the project incredibly financially sound, but the value of being a technical person who can do these sorts of things is remarkable.

In this section, I’ll start off with some basic terminology and give you a basic rundown of how ROI calculations work and what they’re useful for.  This is something that anyone with a technical inclination can easily pick up, and is the sort of knowledge that can improve a career, make work less stressful, and can help you make a case for why a project is a good (or bad) idea. Continue reading “ROI on Tech Projects (Part 1)”