MapReduce with MongoDB

MapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. You can read about MapReduce from here.

MongoDB is an open source document-oriented NoSQL database system written in C++. You can read more about MongoDB from here.

1. Installing MangoDB.

Follow the instructions from the MongoDB official documentation available here. In my case, I followed the instructions for OS X and it worked fine with no issues.

I used sudo port install mongodb to install MongoDB and one issue I faced was regarding to the xcode version I had. Basically I installed xcode while I was in OS X Leopard and didn't update the xcode to the latest after moving to Lion. Once I updated the xcode, I could install mongodb with MacPort with no issue. Another hint - sometime your xcode installation doesn't work fine when you directly install it from the App Store - what you could do is, get xcode from the App Store and then go to the Launch Pad, find Install Xcode and install it from there.

2. Running MongoDB

Starting MongoDB is simple..

Just type mogod in the terminal or in your command console.

By default this will start the MongoDB server on 27017 and will use the /data/db/ directory to store data - yes, that is directory that you created in step - 1.

In case you want to change those default settings - you can do it while starting the server.

mongod --port [your_port] --dbpath [your_db_file_path]

You need to make sure that your_db_file_path exists and its empty when you start the server for the first time...

3. Starting MongoDB shell

We can start MongoDB shell - to connect it to our MongoDB server and run commands from there.

To start the MongoDB shell to connect to the MongoDB server running on the same machine with the default ports you only need to type mongo in the command line. If you are running MongoDB server on a different machine with a different port use the following.

mongo [ip_address]:[port]

e.g : mongo localhost:4000

4. Let's create a Database first.

In the MangoDB shell type the following...
> use library

The above is supposed to create a database called 'library'.

Now to see whether your database been created, just type the following - which is supposed to list all the databases.
> show dbs;

You will notice that the database that you just created is not listed there. The reason is, MongoDB creates databases on-demand. It will get created only when we add something to it.

5. Inserting data to MongoDB.

Let's first create two books with the following commands.
> book1 = {name : "Understanding JAVA", pages : 100}
> book2 = {name : "Understanding JSON", pages : 200}

Now, let's insert these two books in to a collection called books.

The above two statements will create a collection called books under the database library. Following statement will list out the two books which we just saved.
> db.books.find();

{ "_id" : ObjectId("4f365b1ed6d9d6de7c7ae4b1"), "name" : "Understanding JAVA", "pages" : 100 }
{ "_id" : ObjectId("4f365b28d6d9d6de7c7ae4b2"), "name" : "Understanding JSON", "pages" : 200 }

Let's add few more records.
> book = {name : "Understanding XML", pages : 300}
> book = {name : "Understanding Web Services", pages : 400}
> book = {name : "Understanding Axis2", pages : 150}

6. Writing the Map function

Let's process this library collection in a way that, we need to find the number of books having pages less 250 pages and greater than that.
> var map = function() {
var category;
if ( this.pages >= 250 ) 
category = 'Big Books';
category = "Small Books";
emit(category, {name:});

Here, the collection produced by the Map function will have a collection of following members.
{"Big Books",[{name: "Understanding XML"}, {name : "Understanding Web Services"}]);
{"Small Books",[{name: "Understanding JAVA"}, {name : "Understanding JSON"},{name: "Understanding Axis2"}]);

7. Writing the Reduce function.
> var reduce = function(key, values) {
var sum = 0;
values.forEach(function(doc) {
sum += 1;
return {books: sum};

8. Running MapReduce against the books collection.
> var count  = db.books.mapReduce(map, reduce, {out: "book_results"});
> db[count.result].find()

{ "_id" : "Big Books", "value" : { "books" : 2 } }
{ "_id" : "Small Books", "value" : { "books" : 3 } } 

The above says, we have 2 Big Books and 3 Small Books.

Everything done above using the MongoDB shell, can be done with Java too. Following is the Java client for it. You can download the required dependent jar from here.
import com.mongodb.BasicDBObject;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.DBObject;
import com.mongodb.MapReduceCommand;
import com.mongodb.MapReduceOutput;
import com.mongodb.Mongo;

public class MongoClient {

  * @param args
 public static void main(String[] args) {

  Mongo mongo;
  try {
   mongo = new Mongo("localhost", 27017);
   DB db = mongo.getDB("library");

   DBCollection books = db.getCollection("books");

   BasicDBObject book = new BasicDBObject();
   book.put("name", "Understanding JAVA");
   book.put("pages", 100);
   book = new BasicDBObject();  
   book.put("name", "Understanding JSON");
   book.put("pages", 200);
   book = new BasicDBObject();
   book.put("name", "Understanding XML");
   book.put("pages", 300);
   book = new BasicDBObject();
   book.put("name", "Understanding Web Services");
   book.put("pages", 400);
   book = new BasicDBObject();
   book.put("name", "Understanding Axis2");
   book.put("pages", 150);
   String map = "function() { "+ 
             "var category; " +  
             "if ( this.pages >= 250 ) "+  
             "category = 'Big Books'; " +
             "else " +
             "category = 'Small Books'; "+  
             "emit(category, {name:});}";
   String reduce = "function(key, values) { " +
                            "var sum = 0; " +
                            "values.forEach(function(doc) { " +
                            "sum += 1; "+
                            "}); " +
                            "return {books: sum};} ";
   MapReduceCommand cmd = new MapReduceCommand(books, map, reduce,
     null, MapReduceCommand.OutputType.INLINE, null);

   MapReduceOutput out = books.mapReduce(cmd);

   for (DBObject o : out.results()) {
  } catch (Exception e) {
   // TODO Auto-generated catch block

Enabling SAML2 SSO for web apps deployed on Tomcat with WSO2 Identity Server IdP

1. Download the sample web app from here and copy it to [CATALINA_HOME]\webapps

2. Extract the sso-webapp.war and search for [IS_HOME] in sso-webapp\WEB_INF\web.xml and change it appropriately pointing to the WSO2 Identity Server extracted location.

e.g : /Users/prabath/releases/wso2is-3.2.3/repository/resources/security/wso2carbon.jks

3. Start Apache Tomcat [This post assumes Tomcat runs on port 8080]

4. Download the latest WSO2 Identity Server from here.

5. Start WSO2 Identity Server [IS] - I assume here Identity Server is running on the default port - 9443. If not you need to change the corresponding entry in [CATALINA_HOME\webapps\sso-webapp\WEB_INF\web.xml.

sh [IS_HOME]\bin\

6. Login to the IS management console with admin/admin

7. Go to Main/Manage/SAML SSO

8. Fill the form with following values and press Add.

Issuer : ssowebapp
Assertion Consumer URL : http://localhost:8080/sso-webapp/acs [This is where your sample web app is running]
Enable Single Logout : Checked

Keep the rest as default.

10. That's it... To try out visit http://localhost:8080/sso-webapp/index.jsp

You can find further details on this use case from here.

The Twitter API Management Model

The objective of this blog post is to explore in detail the patterns and practices Twitter has used in it's API management.

Twitter comes with a comprehensive set of REST APIs to let client apps talk to Twitter.

Let's take few examples...

If you use following with cUrl - it returns the 20 most recent statuses, including retweets if they exist, from non-protected users. The public timeline is cached for 60 seconds. Requesting more frequently than that will not return any more data, and will count against your rate limit usage.


The example above is an open API - which requires no authentication from the client who accesses it. But keep in mind... it has a throttling policy associated with the API. That is the rate limit. For example the throttling policy associated with the ..statuses/public_timeline.json API could say, only allow maximum 20 API calls from the same IP address.. like wise.. so.. this policy is a global policy for this API.

1. Twitter has open APIs - where anonymous users can access.
2. Twitter has globally defined policies per API.

Let's take another sample API - statuses/retweeted_by_user - returns the 20 most recent retweets posted by the specified user - given that the user's timeline is not protected. This is also another open API.

But, what if I want to post to my twitter account? I could use the API statuses/update. This updates the authenticating user's status and this is not an open API. Only the authenticated users can access this.

How do we authenticate our selves to access the Twitter API?

Twitter supported two methods. One way is to use BasicAuth over HTTPS and the other way is OAuth 1.0a.

BasicAuth support was removed recently and now the only remaining way is with OAuth 1.0a. As of this writing Twitter doesn't support OAuth 2.0.

Why I need to use the APIs exposed by Twitter - is that I have some external applications that do want to talk to Twitter and these applications use the Twitter APIs for communication.

If I am the application developer - following are the steps I need to follow to build my application to access protected APIs from Twitter.

First the application developer needs to login to Twitter and creates an Application.

Here, the Application is an abstraction for a set of protected APIs Twitter exposes outside.

Each Application you create, needs to define the level of access it needs to those underling APIs. There are three values to pick from.

- Read only
- Read and Write
- Read, Write and Access direct messages

Let's see what these values mean...

If you pick 'Read only' - that means a user who is going to use your Application needs to give it the permission to read. In other words - the user will be giving access to invoke the APIs defined here which starts with GET, against his Twitter account. The only exception is Direct Messages APIs - with Read only your Application won't have access to a given user's Direct Messages - even GETs.

Even you who develop the application - the above is valid for you too as well. If you want to give your application, access your Twitter account - there also you should give the application the required rights.

If you pick Read and Write - that means a user who is going to use your application needs to give it the permission to read and write. In other words - the user will be giving access to invoke the APIs defined here which starts with GET or POST, against his Twitter account. The only exception is Direct Messages APIs - with Read and Write, your application won't have access to a given user's Direct Messages - even GETs or POSTs.

3. Twitter has an Application concept that groups APIs together.
4. Each API declares the actions those do support. GET or POST
5. Each Application has a required access level for it to function[Read only, Read and Write, Read Write and Direct Messages]

Now lets dig in to the run-time aspects of this. I am going to skip OAuth related details here purposely for clarity.

For our Application to access the Twitter APIs - it needs a key. Let's name it as API_KEY [if you know OAuth - this is equivalent to the access_token]. Say I want to use this Application. First I need to go to Twitter and need to generate an API_KEY to access this Application. Although there are multiple APIs wrapped in the Application - I only need a single API_KEY.

6. API_KEY is per user per Application[a collection of APIs].

When I generate the API_KEY - I can specify what level of access I am going to give to that API_KEY - Read Only, Read & Write or Read, Write & Direct Message. Based on the access level I can generate my API_KEY.

7. API_KEY carries permissions to the underlying APIs.

Now I give my API_KEY to the Application. Say it tries to POST to my Twitter time-line. That request also should include the API_KEY. Now once the Twitter gets the request - looking at the API_KEY it will identify the Application is trying to POST to 'my' time-line. Also it will check whether the API_KEY has Read & Write permissions - if so it will let the Application post to my Twitter time-line.

If the Application tries to read my Direct Messages using the same API_KEY I gave it - then Twitter will detect that the API_KEY doesn't have Read, Write & Direct Message permission and the request will fail.

Even in the above case, if the Application tries to post to the Application Developer's Twitter account - there also it needs an API_KEY - from the Application Developer, which he can get from Twitter.

Once user grants access to an Application by it's API_KEY - during the entire life time of the key, the application can access his account. But, if the user wants to revoke the key, Twitter provides a way to do that as well. Basically when you go here, it displays all the Applications you have given permissions to access your Twitter account - if you want, you can revoke access from there.

8. Twitter lets users revoke API_KEYs

Also another interesting thing is how Twitter does API versioning. If you carefully look at the URLs, you will notice that the version number is included in the URL it self - But it does not let Application developers to pick, which versions of the APIs they want to use.

9. Twitter tracks API versions in runtime.
10.Twitter does not let Application developers to pick API versions at the time he designs it.

Twitter also has a way of monitoring the status of the API. Following shows a screenshot of it.

11. Twitter does API monitoring in runtime.