Big sites and services like Yahoo, Facebook, Twitter and many others rely heavily on open source software to run their operations. Happily, this isn’t a one-way street. They are also giving back to the open source community, not just by contributing to existing projects, but sometimes by open sourcing their own internal projects, giving back something completely new.
And what these popular sites can contribute is often quite valuable. Since they tend to be very large, they run big operations and have been forced to create solutions for scalability and performance problems that most other sites simply don’t have to deal with.
This article lists a few of those projects, all made free and open source by companies like Facebook, Yahoo, LinkedIn, Twitter, and other big players.
Please note that this is not in any way a complete list of what these companies are contributing to open source.
Came out of: Facebook
What is it? Cassandra is a “NoSQL” distributed database management system designed to be able to handle data spread out over a very large number of servers. It’s now an Apache project with contributors (and users) like Facebook, Twitter, Rackspace and Digg. It looks like this might be one of the Next Big Things for scaling websites and is becoming a bit of a poster child for the NoSQL fans.
Project homepage: http://incubator.apache.org/cassandra/
HipHop for PHP
Came out of: Facebook
What is it? Released as open source as late as last month, HipHop transforms PHP code to C++ and compiles it so it will run faster. Facebook developed it because they use of PHP a lot, and being a scripting language it’s not ideal when it comes to performance. Improving PHP performance quickly adds up to some significant savings for bigger sites because fewer servers can be used to accomplish the same workload. For a site like Facebook which uses tens of thousands of servers, the savings are huge. For example, it lets Facebook’s API handle twice as many requests and still use 30% less CPU compared to before. The average CPU load on Facebook’s web servers has been cut in half.
Project homepage: http://wiki.github.com/facebook/hiphop-php/
Came out of: LiveJournal
What is it? Memcached is a distributed memory caching system, often used to speed up database-driven websites. It’s used by a TON of sites, for example YouTube, LiveJournal, Wikipedia, Amazon, Facebook, Digg, Twitter, Reddit, and many more. We here at Pingdom use it for our uptime monitoring service.
Project homepage: http://www.memcached.org/
Came out of: MySpace
What is it? Qizmt is a C# implementation of MapReduce running on Windows. As all MapReduce implementations it’s been designed to support distributed computing of big data sets on a large number of computers (clusters). It’s used internally by MySpace and has been made open source.
Project homepage: http://code.google.com/p/qizmt/
Came out of: Twitter
What is it? Kestrel is the distributed message queue used by Twitter. It’s based on Twitter’s previous message queue system, Starling, which it is very similar to. Kestrel was actually initially called “Scarling” (Starling ported to Scala).
Project homepage: http://github.com/robey/kestrel
Ruby on Rails
Came out of: 37signals
What is it? Ruby on Rails is a web application framework for the Ruby programming language, designed for rapid development. 37signals used it when developing their own apps (Basecamp, etc) but later released it publicly as open source. It’s no exaggeration to say that it’s been a resounding success, although unlike the other projects listed here this doesn’t have much to do with scalability, but rather ease of development.
Ruby on Rails is used for all of 37signals’s own web apps, such as Basecamp, Backpack and Campfire. Hulu, Scribd, Github and many others also use it. Another famous example is Twitter, which was originally a Ruby on Rails app (some of it still is).
Project homepage: http://rubyonrails.org/
Came out of: LinkedIn
What is it? Voldemort is a distributed key-value storage system (kind of a very simple database) that LinkedIn has developed internally to handle demanding high scalability storage needs for some of its functionality. It’s a relatively new project.
Project homepage: http://project-voldemort.com/
Came out of: Wikipedia (Wikimedia)
What is it? MediaWiki is a wiki software developed specifically with Wikipedia in mind, i.e. a very large wiki with a ton of content and users, but in line with Wikipedia in general it’s been made free and open source and is also used for other projects by the Wikimedia Foundation.
Project homepage: http://www.mediawiki.org/
Came out of: Yahoo (kind of, see below)
What is it? Hadoop is a Java implementation of Mapreduce and is widely used for scalable, distributed computing. The Hadoop project was actually started outside of Yahoo as part of a search engine project called Nutch and programmed by Doug Cutting. Yahoo hired Doug and became the driving force for the continued development of Hadoop, which however has remained an open source project at Apache. Hadoop was named after Doug’s son’s stuffed elephant.
Hadoop is used extensively inside Yahoo and by many other companies as well, for example Facebook, Twitter and Meebo.
Project homepage: http://hadoop.apache.org/
Came out of: Rambler (one of Russia’s biggest web portals)
What is it? Nginx is a lightweight, high-performance web server that can also be used as a load balancer and caching server. It was developed by Igor Sysoev for use with Rambler’s services and was designed to be able to handle a huge number of simultaneous connections effectively. Nginx has been gaining popularity rapidly and is used by millions of websites in one capacity or another, including WordPress.com and Hulu. We actually wrote about nginx last week if you want to learn more.
Project homepage: http://nginx.org/
This article looked at open source projects that stemmed from internal projects at big websites and services. It should be noted (once again) that many of these companies contribute to more projects than are mentioned here above, often in quite significant ways. For example, here is a page that lists Twitter’s open source contributions (don’t miss the wonderfully named Murder, Twitter’s distributed Bittorrent code deployment software), and here’s another one for Facebook.