Wednesday, 11 March 2015

Components of Hadoop

The components of a running Hadoop cluster consist of a set of daemons. Some of these run on single server whereas some run across multiple servers. These daemons include:
  1. Namenode
  2. Secondary Namenode
  3. Datanode
  4. Jobtracker
  5. Tasktracker
Namenode:
The Namenode is responsible for managing filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories on the HDFS cluster. A namespace image file and an edit log file  on the local disk stores this information. The namenode knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts. A client accesses the filesystem on behalf of the user by communicating with the Namenode and datanodes.

A Namenode is a single point of failure of the Hadoop cluster. It is therefore necessary to make Namenode fault tolerant. There are two ways of doing this.The first way is to configure Hadoop so that it stores backup of the persistent state of filesystem metadata to multiple filesystem. The second way is using Secondary  Namenode.

Secondary Namenode:
Secondary Namenode periodically merges the namespace image with the edit log and maintain a copy of this namespace image. It usually runs on a seperate machine. However the Secondary Namenode lags in state with the primary Namenode, hence in case of failure of primary Namenode some data loss occurs for sure.

Datanode:
The Datanodes act as the work horses of the filesystem. They store and retrieve blocks when requested by clients or the namenode. A Datanode reports the Namenode with the lists of blocks that are stored on it.

All the above daemons are called as storage daemons, since they handle operations related to storage of files on HDFS.The storage daemons follow the master-slave architecture with the Namenode acting as master and Datanodes acting as slaves. Now we'll see compute daemons. They also follow master-slave architecture with Jobtracker acting as master and Tasktrackers acting as slaves.

Jobtracker:
A Jobtracker coordinates all the jobs that are run on the system by scheduling each task to run on tasktrackers. It is the responsibility of Jobtracker to reschedule a failed task on a different Tasktracker.

Tasktracker:
Tasktrackers run tasks allocated to them and send progress reports to the jobtracker, that keeps a record of the overall progress of each job.

The diagram below shows the topology of a Hadoop cluster:




Monday, 2 March 2015

Usage of hashcode() and equals() methods in java

In this post ,we will try to understand hashcode() and equals() method in java.
These methods can be found in the Object class and hence available to all java classes.Using these two methods, an object can be stored or retrieved from a Hashtable, HashMap or HashSet.
  • hashcode()
  • equals()

Usage of hashcode() and equals()

hashcode():
This method is used to get unique integer for given object. This integer is used to find bucket when storing in hashmap or hashset. By default, This method returns integer representation of memory address where object is stored.

equals():
This method is used to simply check the equality between two objects. By default, it checks where two reference refer to same object or not(==).

Lets override default implementation of hashcode() and equals():

If you don't override these method, everything will work fine but sometimes there is a need to override these method e.g. you want to define equality between two employee object as true if Both have same emailid.
Lets see with the help of example.We have a class called Emp
1. Emp.java
 
             package com.csamples;

public class Emp {
    int eid;
    String ename, email;
    public int getEid() {
        return eid;
    }
    public void setEid(int eid) {
        this.eid = eid;
    }
    public String getEname() {
        return ename;
    }
    public void setEname(String ename) {
        this.ename = ename;
    }
    public String getEmail() {
        return email;
    }
    public void setEmail(String email) {
        this.email = email;
    }
  
}

This Emp class have three basic attributes- eid, ename and email.
Now create a class called "EqualityCheckMain.java"
package com.csamples;
/**
 * @author skummitha
 *
 */
public class EqualityCheckMain {

    public static void main(String[] args) {
        
        Emp emp1=new Emp();
        emp1.setEid(101);
        emp1.setEname("Sreenu");
        emp1.setEmail("sreenu.vas2004@gmail.com");
        Emp emp2=new Emp();
        emp2.setEid(102);
        emp2.setEname("SRK");
        emp2.setEmail("sreenu.vas2004@gmail.com");
        System.out.println("Is emp1 is equal to emp2:" +emp1.equals(emp2));
       }
 }
When you run above program, you will get following output
                  Is emp1 is equal to emp2:false
 
In above program, we have created two different objects and set their email attribute to "sreenu.vas2004@gmail.com".
Because both references emp1 and emp2 are pointing to different object, as default implementation of equals check for ==,equals method is returning false. In real life, it should have return true because no two employees can have same email.
Now lets override equals and return true if two employee's email addresses are same.
Add this method to above Emp class:
    @Override
    public boolean equals(Object obj) {
        boolean flag=false;
           if(obj instanceof Emp)
           {
             Emp e=(Emp)obj;
             if(this.email.equals(e.email))
             {
                flag=true;
             }
           }
           return flag;
        
    }
and now run EqualityCheckMain.java again
You will get following output:
                 Is emp1 is equal to emp2:true
Now this is because overriden equals method return true if two employees have same email.
One thing to remember here, signature of equals method should be same as above.

Lets put this Emp objects in hashmap:

Here we are going to use Emp class object as key and their full name(string) as value in HashMap.
package com.csamples;

import java.util.HashMap; 
import java.util.Iterator; 
 
/**
 * @author skummitha
 *
 */
public class HashMapEqualityCheckMain { 
 

    public static void main(String[] args) { 
        HashMap<Emp,String> empNamelMap=new HashMap<Emp,String>();  
        Emp emp1=new Emp(); 
        emp1.setEid(101);
        emp1.setEname("Sreenu");
        emp1.setEmail("sreenu.vas2004@gmail.com");
        Emp emp2=new Emp(); 
        emp2.setEid(102);
        emp2.setEname("SRK");
        emp2.setEmail("sreenu.vas2004@gmail.com");
 
        empNamelMap.put(emp1, "Sree Reddy"); 
        empNamelMap.put(emp2, "SRK Reddy"); 
 
        Iterator<Emp> empFullNameIter=empNamelMap.keySet().iterator(); 
        while(empFullNameIter.hasNext()) 
        { 
            Emp empObj=empFullNameIter.next(); 
            String empFullName=empNamelMap.get(empObj); 
            System.out.println("Full Name of "+ empObj.getEname()+"----"+empFullName); 
 
        } 
    }  
}
When you run above program, you will see following output:

Full Name of SRK----SRK Reddy
Full Name of Sreenu----Sree Reddy
Now you must be wondering even through two objects are equal why HashMap contains two key value pair instead of one.This is because First HashMap uses hashcode to find bucket for that key object, if hashcodes are same then only it checks for equals method and because hashcode for above two employee objects uses default hashcode method,Both will have different memory address hence different hashcode.
Now lets override hashcode method.Add following method to Emp class
    @Override
    public int hashCode() {
        return  email.hashCode();//(eid + ename+ email).hashCode();
    }
Now run HashMapEqualityCheckMain.java again
You will see following output:
 
           Full Name of Sreenu----SRK Reddy
So now hashcode for above two objects emp1 and emp2 are same, so Both will be point to same bucket,now equals method will be used to compare them which  will return true.
This is the reason java doc says "if you override equals() method then you must override hashCode() method"

hashcode() and equals() contracts: 

equals():

The equals method implements an equivalence relation on non-null object references:
  • It is reflexive: for any non-null reference value x, x.equals(x) should return true.
  • It is symmetric: for any non-null reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
  • It is transitive: for any non-null reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
  • It is consistent: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified.
  • For any non-null reference value x, x.equals(null) should return false.

hashcode():

  • Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
  • If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
  • It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

Key points to remember:

  1. If you are overriding equals method then you should override hashcode() also.
  2. If two objects are equal then they must have same hashcode.
  3. If two objects have same hashcode then they may or may not be equal
  4. Always use same attributes to generate equals and hashcode as in our case we have used name.