Reducer storing same object in arrayList

Consider an use case, to find out the maximum temperature out of the given data.We also want to list down all temperature(years not needed for simplicity) exceeding 35.

1901 20
1901 20
1902 30
1902 40
1901 89
1902 23

Lets say the mapper is emitting (year,temperatureValue),for all this temperature.

In the reducer phase, we will have 2 input group. ie (1901 [20,20,89]) and (1902 [30,40,23])

Lets say we also want to display all the temperature exceeding 35 at the end of the result file.The sample reducer code would be…

public  class MyReducer extends Reducer < IntWritable, IntWritable, IntWritable, Writable >
    {
        ArrayList < IntWritable > tempList = new ArrayList < IntWritable > ();

        @Override
        protected void reduce(IntWritable year, Iterable < IntWritable > tempIt, Context context) throws IOException, InterruptedException {

                 //find max for each year using for loop,iterator etc.
           for (IntWritable temp: tempIt) {
                    //..some logic to find the max temperature
               if (temp.get() > 35) //I am not bother about year,for simplicity
                   tempList.add(temp);
            }

            //write the max temperture of that year  to the context object
        }
        protected void cleanup(Context context) throws IOException, InterruptedException {
          for (IntWritable temp: tempList) {
            context.write(temp, NullWritable.get()); //only interested in temp
           }
        }
    }

Note:- Although you have done your level best to write temperature>35 in the tempList.The ArrayList will have the expected size, but when you try to access and write them in context, you will get to see that it all contains the same temperature.

Reason:-

When you try storing the values to the ArrayList or any collection, hadoop uses the same Object to hold different upcoming values.That means same object will be used to hold different values.This results in having the latest value to the whole ArrayList,everywhere.So you will get only the last values in the whole ArrayList.Hadoop doesnot allow you to do so because if one object size in 10GB, then your list size will grow like anything.

Solution(WorkAround):-

Use copyConstructor to store the Object in the ArrayList. For example

 

tempList.add(new IntWritable(temp.get()); //instead of tempList.add(temp)

Similarly to Store Text in Arraylist use

myList.add(new Text(name)); //Instead of  myList.add(name);

Any Bug Please report us here.

Prasad!!!

Leave a Reply