MapReduce with MongoDB

MapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. You can read about MapReduce from here.

MongoDB is an open source document-oriented NoSQL database system written in C++. You can read more about MongoDB from here.

1. Installing MangoDB.

Follow the instructions from the MongoDB official documentation available here. In my case, I followed the instructions for OS X and it worked fine with no issues.

I used sudo port install mongodb to install MongoDB and one issue I faced was regarding to the xcode version I had. Basically I installed xcode while I was in OS X Leopard and didn't update the xcode to the latest after moving to Lion. Once I updated the xcode, I could install mongodb with MacPort with no issue. Another hint - sometime your xcode installation doesn't work fine when you directly install it from the App Store - what you could do is, get xcode from the App Store and then go to the Launch Pad, find Install Xcode and install it from there.

2. Running MongoDB

Starting MongoDB is simple..

Just type mogod in the terminal or in your command console.

By default this will start the MongoDB server on 27017 and will use the /data/db/ directory to store data - yes, that is directory that you created in step - 1.

In case you want to change those default settings - you can do it while starting the server.

mongod --port [your_port] --dbpath [your_db_file_path]

You need to make sure that your_db_file_path exists and its empty when you start the server for the first time...

3. Starting MongoDB shell

We can start MongoDB shell - to connect it to our MongoDB server and run commands from there.

To start the MongoDB shell to connect to the MongoDB server running on the same machine with the default ports you only need to type mongo in the command line. If you are running MongoDB server on a different machine with a different port use the following.

mongo [ip_address]:[port]

e.g : mongo localhost:4000

4. Let's create a Database first.

In the MangoDB shell type the following...
> use library

The above is supposed to create a database called 'library'.

Now to see whether your database been created, just type the following - which is supposed to list all the databases.
> show dbs;

You will notice that the database that you just created is not listed there. The reason is, MongoDB creates databases on-demand. It will get created only when we add something to it.

5. Inserting data to MongoDB.

Let's first create two books with the following commands.
> book1 = {name : "Understanding JAVA", pages : 100}
> book2 = {name : "Understanding JSON", pages : 200}

Now, let's insert these two books in to a collection called books.
> db.books.save(book1)
> db.books.save(book2)

The above two statements will create a collection called books under the database library. Following statement will list out the two books which we just saved.
> db.books.find();

{ "_id" : ObjectId("4f365b1ed6d9d6de7c7ae4b1"), "name" : "Understanding JAVA", "pages" : 100 }
{ "_id" : ObjectId("4f365b28d6d9d6de7c7ae4b2"), "name" : "Understanding JSON", "pages" : 200 }

Let's add few more records.
> book = {name : "Understanding XML", pages : 300}
> db.books.save(book)
> book = {name : "Understanding Web Services", pages : 400}
> db.books.save(book)
> book = {name : "Understanding Axis2", pages : 150}
> db.books.save(book)

6. Writing the Map function

Let's process this library collection in a way that, we need to find the number of books having pages less 250 pages and greater than that.
> var map = function() {
var category;
if ( this.pages >= 250 ) 
category = 'Big Books';
else 
category = "Small Books";
emit(category, {name: this.name});
};

Here, the collection produced by the Map function will have a collection of following members.
{"Big Books",[{name: "Understanding XML"}, {name : "Understanding Web Services"}]);
{"Small Books",[{name: "Understanding JAVA"}, {name : "Understanding JSON"},{name: "Understanding Axis2"}]);

7. Writing the Reduce function.
> var reduce = function(key, values) {
var sum = 0;
values.forEach(function(doc) {
sum += 1;
});
return {books: sum};
};

8. Running MapReduce against the books collection.
> var count  = db.books.mapReduce(map, reduce, {out: "book_results"});
> db[count.result].find()

{ "_id" : "Big Books", "value" : { "books" : 2 } }
{ "_id" : "Small Books", "value" : { "books" : 3 } } 

The above says, we have 2 Big Books and 3 Small Books.

Everything done above using the MongoDB shell, can be done with Java too. Following is the Java client for it. You can download the required dependent jar from here.
import com.mongodb.BasicDBObject;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.DBObject;
import com.mongodb.MapReduceCommand;
import com.mongodb.MapReduceOutput;
import com.mongodb.Mongo;

public class MongoClient {

 /**
  * @param args
  */
 public static void main(String[] args) {

  Mongo mongo;
  
  try {
   mongo = new Mongo("localhost", 27017);
   DB db = mongo.getDB("library");

   DBCollection books = db.getCollection("books");

   BasicDBObject book = new BasicDBObject();
   book.put("name", "Understanding JAVA");
   book.put("pages", 100);
   books.insert(book);
   
   book = new BasicDBObject();  
   book.put("name", "Understanding JSON");
   book.put("pages", 200);
   books.insert(book);
   
   book = new BasicDBObject();
   book.put("name", "Understanding XML");
   book.put("pages", 300);
   books.insert(book);
   
   book = new BasicDBObject();
   book.put("name", "Understanding Web Services");
   book.put("pages", 400);
   books.insert(book);
 
   book = new BasicDBObject();
   book.put("name", "Understanding Axis2");
   book.put("pages", 150);
   books.insert(book);
   
   String map = "function() { "+ 
             "var category; " +  
             "if ( this.pages >= 250 ) "+  
             "category = 'Big Books'; " +
             "else " +
             "category = 'Small Books'; "+  
             "emit(category, {name: this.name});}";
   
   String reduce = "function(key, values) { " +
                            "var sum = 0; " +
                            "values.forEach(function(doc) { " +
                            "sum += 1; "+
                            "}); " +
                            "return {books: sum};} ";
   
   MapReduceCommand cmd = new MapReduceCommand(books, map, reduce,
     null, MapReduceCommand.OutputType.INLINE, null);

   MapReduceOutput out = books.mapReduce(cmd);

   for (DBObject o : out.results()) {
    System.out.println(o.toString());
   }
  } catch (Exception e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
 }
}