A SMALL cross-section of BIG Data

Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.

IDC estimated the digital universe to be around 1.8 zettabytes by 2011.

How big is a zettabyte? It's one billion terabytes. The current world population is 7 billion - that is, if you give a hard disk of 250 billion GB for each person on the earth - still that storage won't be sufficient.

Many sources contribute to this flood of data...

1. The New York Stock Exchange generates about one terabyte of new trade data per day.
2. Facebook hosts approximately 10 billion photos taking up one petabytes of storage.
3. Ancestry.com, the genealogy site, store around 2.5 petabytes of data.
4. The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month.
5. The Large Harden Collider near Geneva will produce about 15 petabytes of data per year.
6. Everyday people create the equivalent of 2.5 trillion bytes of data from sensors, mobile devices, online transactions & social networks.

Facebook, Yahoo! and Google found themselves collecting data on an unprecedented scale. They were the first massive companies collecting tons of data from millions of users.

They quickly overwhelmed traditional data systems and techniques like Oracle and MySql. Even the best, most expensive vendors using the biggest hardware could barely keep up and certainly couldn’t give them tools to powerfully analyze their influx of data.

In the early 2000’s they developed new techniques like MapReduce, BigTable and Google File System to handle their big data. Initially these techniques were held proprietary. But they realized making the concepts public, while keeping the implementations hidden, will benefit them - since more people will contribute to those and the graduates they hire will have a good understanding prior to joining.

Around 2004/2005 Facebook, Yahoo! and Google started sharing research papers describing their big data technologies.

In 2004 Google published the research paper "MapReduce: Simplified Data Processing on Large Clusters".

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in this paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Google's implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable.

A typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers and the system easy to use. Hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

Doug Cutting who worked for Nutch, an open-source search technology project which are now managed through the Apache Software Foundation, read this paper published by Google and also another paper published by Google on Google's distributed file system [GFS]. He figured out GFS will solve their storage needs and MapReduce will solve the scaling issues they encountered with Nutch and implemented MapReduce and GFS. They named the GFS implementation for Nutch as the Nutch Distributed Filesystem [NDFS].

NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent sub project of Lucene called Hadoop and NDFS, became HDFS [Hadoop Distributed File System] - which is an implementation of GFS. During the same time Yahoo! extended their support for Hadoop and hired Doug Cutting.

At a very high-level, this is how HDFS works. Say we have a 300 MB file. [Hadoop also does really well with files of petabytes and terabytes.] The first thing HDFS is going to do is to split this up in to blocks. The default block size on HDFS right now is 128 MB. Once split-ed in to blocks we will have two blocks of 128 MB and another of 44 MB. Now HDFS will make 'n' number of ['n' is configurable - say 'n' is three] copies/replicas of each of these blocks. HDFS will now store these replicas in different DataNodes of the HDFS cluster. We also have a single NameNode, which keeps track of replicas and the DataNodes. NameNode knows where a given replica resides - whenever it detects a given replica is corrupted [DataNode keeps on running checksums on replicas] or the corresponding HDFS node is dowm, it will find out where else that replica is in the cluster and tells other nodes do 'n'X replication of that replica. The NameNode is a single point of failure - and two avoid that we can have secondary NameNode which in sync with the primary -and when primary is down - the secondary can take control. Hadoop project is currently working on implementing distributed NameNodes.

Again in 2006 Google published another paper on "Bigtable: A Distributed Storage System for Structured Data"

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size, petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. This paper describes the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and describes the design and implementation of Bigtable.

BigTable maps two arbitrary string values (row key and column key) and timestamp (hence three dimensional mapping) into an associated arbitrary byte array. It is not a relational database and can be better defined as a sparse, distributed multi-dimensional sorted map.

Basically BigTable discussed how to build a distributed data store on top of GFS.

HBase by Hadoop is an implementation of BigTable. HBase is a distributed, column oriented database which is using HDFS for it's underlying storage and supports both batch-style computation using MapReduce and point queries.

Amazon, published a research paper in 2007 on "Dynamo: Amazon’s Highly Available Key-value Store".

Dynamo, is a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. Apache Cassandra — brings together Dynamo's fully distributed design and BigTable's data model and written in Java - open sourced by Facebook in 2008. It is a NoSQL solution that was initially developed by Facebook and powered their Inbox Search feature until late 2010. In fact much of the initial development work on Cassandra was performed by two Dynamo engineers recruited to Facebook from Amazon. However, Facebook abandoned Cassandra in late 2010 when they built Facebook Messaging platform on HBase.

Also, besides using the way of modeling of BigTable, it has properties like eventual consistency, the Gossip protocol, a master-master way of serving the read and write requests that are inspired by Amazon's Dynamo. One of the important properties, the Eventual consistency - means that given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.

I used the term 'NoSQL' when talking about Cassandra. NoSQL (sometimes expanded to "not only SQL") is a broad class of database management systems that differ from the classic model of the relational database management system (RDBMS) in some significant ways. These data stores may not require fixed table schemas, usually avoid join operations, and typically scale horizontally.

The name "NoSQL" was in fact first used by Carlo Strozzi in 1998 as the name of file-based database he was developing. Ironically it's relational database just one without a SQL interface. The term re-surfaced in 2009 when Eric Evans used it to name the current surge in non-relational databases.

There are four categories of NoSQL databases.

1. Key-value stores : This is based on Amazon's Dynamo paper.
2. ColumnFamily / BigTable clones : Examples are HBase, Cassandra
3. Document Databases : Examples are CouchDB, MongoDB
4. Graph Database : Examples are AllegroGrapgh, Neo4j

As per Marin Dimitrov, following are the use cases for NoSQL databases - in other words following are the cases where relational databases do not perform well.

1. Massive Data Volumes
2. Extreme Query Volume
3. Schema Evolution

With NoSQL, we get the advantages like, Massive Scalability, High Availability, Lower Cost (than competitive solutions at that scale), Predictable elasticity and Schema flexibility.

For application programmers the major difference between relational databases and the Cassandra is it's data model - which is based on BigTable. The Cassandra data model is designed for distributed data on a very large scale. It trades ACID-compliant data practices for important advantages in performance, availability, and operational manageability.

If you want to compare Cassandra with HBase, then this is a good one. Another HBase vs Cassandra debate is here.

References :

[1]: MapReduce: Simplified Data Processing on Large Clusters
[2]: Bigtable: A Distributed Storage System for Structured Data
[3]: Dynamo: Amazon’s Highly Available Key-value Store
[4]: The Hadoop Distributed File System
[5]: ZooKeeper: Wait-free coordination for Internet-scale systems
[6]: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
[7]: Cassandra - A Decentralized Structured Storage System
[8]: NOSQL Patterns
[9]: BigTable Model with Cassandra and HBase
[10]: LinkedIn Tech Talks : Apache Hadoop - Petabytes and Terawatts
[11]: O'Reilly Webcast: An Introduction to Hadoop
[12]: Google Developer Day : MapReduce
[13]: WSO2Con 2011 - Panel: Data, data everywhere: big, small, private, shared, public and more
[14]: Scaling with Apache Cassandra
[15]: HBase vs Cassandra: why we moved
[16]: A Brief History of NoSQL

A SMALL cross-section of BIG Data

This Blog Is Not Updated Any More.

Check out my new blog on Medium!

Blog Archive