What does Facebook consider an average day’s worth of data?

Well according to this article from gigaom.com. The average day looks something like this.

  • 2.5 billion content items shared per day (status updates + wall posts + photos + videos + comments)
  • 2.7 billion Likes per day
  • 300 million photos uploaded per day
  • 100+ petabytes of disk space in one of FB’s largest Hadoop (HDFS) clusters
  • 105 terabytes of data scanned via Hive, Facebook’s Hadoop query language, every 30 minutes
  • 70,000 queries executed on these databases per day
  • 500+terabytes of new data ingested into the databases every day

I also love this quote from the VP of Infrastructure.

“If you aren’t taking advantage of big data, then you don’t have big data, you have just a pile of data,” said Jay Parikh, VP of infrastructure at Facebook on Wednesday. “Everything is interesting to us.”

Big Data for Small Business

I have said it before and will say it again you don’t have to be fortune 500 company to use Big Data. Big Data is more about understanding your data, then it is about how big it is and understanding all your different data sources and gathering them into one place, so that you can analyze and understand it better.


CentOS 6.4 service virt-who won’t start – work around

Here is the problem.

[root@bob ~]# service virt-who start

Démarrage de virt-who : Traceback (most recent call last):

File “/usr/share/virt-who/virt-who.py”, line 33, in <module>

from subscriptionmanager import SubscriptionManager, SubscriptionManagerError

File “/usr/share/virt-who/subscriptionmanager.py”, line 24, in <module>

import rhsm.connection as rhsm_connection

ImportError: No module named rhsm.connection



There is a simple work around. Install the Scientific Linux 6 python-rhsm package.


Name : python-rhsm

Version : 1.1.8 Vendor : Scientific Linux

Release : 1.el6 Date : 2013-02-22 01:54:26

Group : Development/Libraries Source RPM : python-rhsm-1.1.8-1.el6.src.rpm

Size : 0.27 MB

Packager : Scientific Linux

Summary : A Python library to communicate with a Red Hat Unified Entitlement Platform

Description :

A small library for communicating with the REST interface of a Red Hat Unified

Entitlement Platform. This interface is used for the management of system

entitlements, certificates, and access to content.


First install python-simplejson


[root@bob ~]# yum install python-simplejson


Then pick a mirror from http://rpm.pbone.net/index.php3/stat/4/idpl/20813982/dir/scientific_linux_6/com/python-rhsm-1.1.8-1.el6.x86_64.rpm.html and download python-rhsm-1.1.8-1.el6.x86_64.rpm and install it


[root@bob ~]# rpm –install python-rhsm-1.1.8-1.el6.x86_64.rpm


Then start virt-who


[root@bob ~]# service virt-who start

Wrangling Customer Usage Data with Hadoop

Here is our session from the Hadoop Summit 2013.


Title: Wrangling Customer Usage Data with Hadoop

Slides: http://www.slideshare.net/Hadoop_Summit/hall-johnson-june271100amroom211v2


At Clearwire we have a big data challenge: Processing millions of unique usage records comprising terabytes of data for millions of customers every week. Historically, massive purpose-built database solutions were used to process data, but weren’t particularly fast, nor did they lend themselves to analysis. As mobile data volumes increase exponentially, we needed a scalable solution that could process usage data for billing, provide a data analysis platform, and inexpensively store the data indefinitely. The solution? A Hadoop-based platform allowed us to architect and deploy an end-to-end solution based on a combination of physical data nodes and virtual edge nodes in less than six months. This solution allowed us to turn off our legacy usage processing solution and reduce processing times from hours to as little as 15-min. This improvement has enabled Clearwire to deliver actionable usage data to partners faster and more predictably than ever before. Usage processing was just the beginning; we’re now turing to the raw data stored in Hadoop, adding new data sources, and starting to anlyze the data. Clearwire is now able to put multiple data sources in the hands of our analysts for further discovery and actionable intelligence.