This is the first technical post about Cull TV. We started with a simple question last year that had no apparent answer. “Why do I have to choose between the dead simple experience of turning on the television (and finding nothing interesting), or hunt and peck scavenging through mountains of online content to find something to watch?” Combining the ease of use of traditional television with modern approaches to content selection and discovery requires a lot of different moving parts interacting in real-time. We do targeted, continuous web crawling to find out what’s being talked about, blogged about, and watched. This feeds into both a manual curation path where we can keep the flow of content high quality and an automatic recommendation path (our Auto DJ) to deliver endless streams of video. And finally, the Cull app itself is delivering the content and feeding knowledge about what you love back into the cycle. I’m a strong advocate of using the right tool for the job, but node.js keeps proving itself as a solid multi-purpose tool.
Building a classic MVC web app in node.js is the most straightforward component. Start with a framework like express, add logging, SMTP, OAuth, cluster management, and you’re well on your way. The initial prototype of Cull was built with PHP and CodeIgniter. By the time we launched, we were a 100% node.js frontend and backend with dozens of new consumer-facing features and a 20% smaller codebase. Fewer unit tests were required, shorter learning curve for new engineers, and much more fine grained control of what ran synchronously or asynchronously to get the performance we needed.
Using node.js for web crawling was an obvious fit. Since we do targeted crawls of specific types of sites and extract much more semantic data than a general purpose crawler, the node.io library makes it easy to build lots of custom jobs on top of a common framework. Using a central storage cluster for crawled data and the work queue makes it easy to scale out across multiple machines as we examine more and more of the music-focused web.
So far, all of this is pretty standard fare for node.js development. But one area where we are treading new ground with node.js+JavaScript is with our real-time recommender. The common wisdom is to use node in computing-light, I/O-heavy workloads. Machine learning and recommendations are often thought of as very compute-heavy, CPU-bound tasks. In reality, when your working set grows large enough and you go beyond simple collaborative filtering to making content-based decisions (requiring access to more and more data) and applying several layers of rulesets as most production ready recommenders do, you are I/O-bound again and worrying more about building a massive data pipeline than shaving off the last 10% of clock cycles.
I hope this post gives some insight into the technology we use to power Cull and inspires more developers to give node.js a try. We have made a number of improvements to existing libraries on GitHub, and have released a few of our own that I will cover in an upcoming post. Happy hacking!
