Handling Media Libraries
At the core of Radiojar is the music library, the place where music tracks are stored. It has to be able to power anything from fully private user-managed libraries, to fixed-content shared libraries for thematic radio stations and everything in between, so it needs to be extra flexible and efficient. A music library is made up of three things: a database that stores the tracks’ meta-data and allows searching the library, the actual music tracks, and a set of additional related files (e.g. transcoded versions, waveform preview images). It needs to be private (some DJs don’t want anyone else to see their hard to find exclusive tracks!) and secure (it must shield audio files form being downloaded by end users, all media must only be played through a streaming server).
Right now, there are two different library architectures in use on Radiojar. One is using an SQLite farm and the other is using NoSQL technology. Our requirements were, as mentioned: security, search speed and scalability, but we also needed to find a solution that will be cost-effective as Radiojar grows: we want it to be able to handle TBs of data and hundreds of thousands of DJs.
We started using a simple farm of servers running SQLite and offering basic file storage. We used Python for the managing processes and the Django framework to implement RESTful access. We create one folder for every DJ or radio station in which both the database and the binary files are stored. A “farm manager” is responsible for distributing these files across server instances, in a process that is transparent to the end user.
SQLite is the fastest database provider for small numbers of records, so having a single database per user has quite impressive results in search speed. A nice side effect of this architecture is that any data loss would end up affecting only a few users. Also, this solution makes user libraries distinct and portable. Scaling up is also easy: to increase storage and user capacity, we just add more servers. Another advantage is that servers don’t need to be under the same cluster. They can be anywhere, which makes this solution flexible and cost-effective.
On the downside, the upload process is slow, because we need to process each track individually to create waveforms, normalize, analyze, verify etc. Also, taking backups is a complicated process (because of the binary files being distributed across a number of servers) but it’s been automated.
Our biggest issue with this architecture was that it could never support large shared libraries. This led us to our next solution which is cloud-based file storage combined with NoSQL meta-data storage.
Managing Rackspace Cloud Files with Google App Engine
Firstly, we needed to find a cloud storage provider. After much research we came to the conclusion that Rackspace Cloud Files is actually cheaper from other providers for our type of usage (big files, infrequent access). Since we had already started with Rackspace Cloud Servers for our live streaming radio servers, it was easy to integrate.
For the database we chose Google’s App Engine and its NoSql technology, mostly for its cost-effectiveness: It can be as scalable as we can afford. App Engine has many tricks but also many advantages, and since we had started working on Python wrappers the transition was quite easy.
The major challenges were in finding a way to process audio files (can’t do that on App Engine or Cloud Files) and in limiting the bandwidth used on Cloud Files and App Engines as much as possible to minimise cost. As previously mentioned, music tracks are analyzed, transcoded etc. upon uploading, so a middleware layer was introduced to handle this processing. In order to keep the system scalable, this middleware was created as a standalone system that also lives in the cloud. We are using Rackspace Cloud Server instances for that, which not only handle the file processing, but also act as proxies for the binaries stored in Cloud Files. This way we can simultaneously process a great number of files and also restrict any access to the binaries except from our own radio servers. These Rackspace Cloud instances are managed from a custom server manager built in Java that also runs on Google App Engine (but that’s a topic for another blog post).
This solution offers us unlimited storage that comes with a “pay as you go” price tag, so it’s great for bootstrapping. It can grow with us while providing a fail-proof, enterprise level infrastructure right from the start. The library search speed that we were getting from App Engine was an issue here, but we are talking about unlimited memcached 1GB pages, so is just a matter of good implementation and fine tuning. Our middleware allows us to add library features in a plug-in like way, for example Echonest data integration, or audio normalization.
We’ve been testing this architecture for 2 months and it was also used in our first commercial case study. Now it’s being used in all new Radiojar projects. We’re still using the SQLite farm architecture, for customers that want more closed solutions and don’t feel at ease with their data on the cloud.