You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by peterramesh <ra...@gmail.com> on 2009/06/23 15:38:39 UTC

Map Reduce performance

Hi,

I playing with a sample program using Map Reduce (MR).  All I have a text
file(685 MB), and using it to create a HTable. 

The testing environment is, 
1. single node cluster
2. 2 MB RAM 
3. Hadoop and Hbase version, both are 0.19.1

Here is the program attached, 
http://www.nabble.com/file/p24166190/MRTest.java MRTest.java 

and the hadoop-site.xml
http://www.nabble.com/file/p24166190/hadoop-site.xml hadoop-site.xml 

and fair scheduler allocation file
http://www.nabble.com/file/p24166190/mapred_fairseheduler_allocation_file.xml
mapred_fairseheduler_allocation_file.xml 
(I had used the FairScheduler, since the mapred.map.tasks were not getting
applied in the cluster instance, If I use JobQueueTaskScheduler (default),
which always run 2 tasks at a time).

On running the above program with the given configurations, it takes
(13mins, 46sec and 15mins, 3sec respectively - 2 samples) to create the
table.

If the do the same stuffs without MR, it takes 18mins, 04sec. So, the MR
gives me substantial gain. But, I would like to know, if there is better
optimization to improve the performance and also am I doing the right?

TIA,
Ramesh


-- 
View this message in context: http://www.nabble.com/Map-Reduce-performance-tp24166190p24166190.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: Map Reduce performance

Posted by peterramesh <ra...@gmail.com>.
Hi Eric/Tim,

Thanks for your appreciable points.

I have updated the Mapper implementation removing Htable instance, as
follows

public static class InnerMapWithTOF extends MapReduceBase implements
			Mapper<LongWritable, Text, ImmutableBytesWritable, BatchUpdate> {

		public void map(LongWritable key, Text value,
				OutputCollector<ImmutableBytesWritable, BatchUpdate> output,
				Reporter reporter) throws IOException {

			String[] splits = value.toString().split("\t");
			BatchUpdate bu = new BatchUpdate(splits[0]);

			int j = 0;
			while (j < HBaseTest.SNP_INFO_COLUMN_NAMES.length) {
				bu.put(HBaseTest.SNP_FAMILY_NAMES[0]
						+ HBaseTest.SNP_INFO_COLUMN_NAMES[j], new String(
						splits[j].getBytes()).getBytes());
				j++;
			}

			output
					.collect(new ImmutableBytesWritable(splits[0].getBytes()),
							bu);
		}

	}

But,  in the able code I'm reading same key value
(HBaseTest.SNP_FAMILY_NAMES[0] 	+ HBaseTest.SNP_INFO_COLUMN_NAMES[j])
sequentially all column family for each record.  Is there any way to set it
in the JobConf object or etc..

This TableReduce implementation does insert the records into the HTable, as
follows

	public static class InnerReduceWithTOF extends MapReduceBase implements
			TableReduce<ImmutableBytesWritable, BatchUpdate> {

		public void reduce(ImmutableBytesWritable key,
				Iterator<BatchUpdate> value,
				OutputCollector<ImmutableBytesWritable, BatchUpdate> output,
				Reporter reporter) throws IOException {

			while (value.hasNext()) {
				output.collect(key, value.next());
			}

		}
	}

and here is the Configuration..

		JobConf c = new JobConf(getConf(), MapReduceHBaseTest.class);
		c.setJobName("ConfMapReduce2");
		FileInputFormat.setInputPaths(c, new Path("snp.txt"));

		c.setMapperClass(InnerMapWithTOF.class);
		c.setReducerClass(InnerReduceWithTOF.class);
		c.setOutputFormat(TableOutputFormat.class);
		c.set(TableOutputFormat.OUTPUT_TABLE, "snp");

		c.setOutputKeyClass(ImmutableBytesWritable.class);
		c.setOutputValueClass(BatchUpdate.class);

		c.setMapOutputKeyClass(ImmutableBytesWritable.class);
		c.setMapOutputValueClass(BatchUpdate.class);

		int partitioner = c.getNumMapTasks();

		System.out.println(partitioner);
		System.out.println(c.getNumReduceTasks());

		TableMapReduceUtil.initTableReduceJob("snp",
				InnerReduceWithTOF.class, c);

		JobClient.runJob(c);


TIA,
Ramesh
-- 
View this message in context: http://www.nabble.com/Map-Reduce-performance-tp24166190p24214918.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: Map Reduce performance

Posted by Erik Holstad <er...@gmail.com>.
Hi Ramesh!
Have to agree with Tim about the size of your cluster, I honestly a little
bit surprised that you are actually seeing
that using MR on a single node is faster, since you only get the negative
sides, setup and so on from it, but not
the good stuff.
I looked at the code and it looks good, not really doing to much in the Job,
but I doesn't look like you are doing
anything wrong. I do have some things you can think about thought when you
get a bigger cluster up and running.
1. You might want to stay away from creating Text object, we are internally
trying to move away from all usage of Text in HBase and just use
ImmutableBytesWritable or something like that.
2. Getting a HTable is expensive, so you might want to create a pool of
those connections that you can share so you don't have to get a new one for
every task, not 100% sure about the configure call, but I think it gives you
one per call, might be worth looking into.

Erik

Re: Map Reduce performance

Posted by tim robertson <ti...@gmail.com>.
Hi Ramesh

I'm not sure it is really meaningful to try and draw conclusions about
performance running on only node as you don't gain any benefits of
parallelisation.  You might be better trying with a small cluster of
say 4 nodes in Amazon EC2, and then trying the same with say 8 nodes
and trying to draw some conclusions about increased cluster size
yielding better performance, which is presumably the proof you are
really looking for - e.g. proving that you can grow in data volume and
performance with increased hardware.

I think the MR will work much better with more nodes as you have more
clients doing inserts in parallel onto HBase so will increase rapidly
as you scale out.

Just my 2 cents...

Tim


On Tue, Jun 23, 2009 at 3:38 PM, peterramesh<ra...@gmail.com> wrote:
>
> Hi,
>
> I playing with a sample program using Map Reduce (MR).  All I have a text
> file(685 MB), and using it to create a HTable.
>
> The testing environment is,
> 1. single node cluster
> 2. 2 MB RAM
> 3. Hadoop and Hbase version, both are 0.19.1
>
> Here is the program attached,
> http://www.nabble.com/file/p24166190/MRTest.java MRTest.java
>
> and the hadoop-site.xml
> http://www.nabble.com/file/p24166190/hadoop-site.xml hadoop-site.xml
>
> and fair scheduler allocation file
> http://www.nabble.com/file/p24166190/mapred_fairseheduler_allocation_file.xml
> mapred_fairseheduler_allocation_file.xml
> (I had used the FairScheduler, since the mapred.map.tasks were not getting
> applied in the cluster instance, If I use JobQueueTaskScheduler (default),
> which always run 2 tasks at a time).
>
> On running the above program with the given configurations, it takes
> (13mins, 46sec and 15mins, 3sec respectively - 2 samples) to create the
> table.
>
> If the do the same stuffs without MR, it takes 18mins, 04sec. So, the MR
> gives me substantial gain. But, I would like to know, if there is better
> optimization to improve the performance and also am I doing the right?
>
> TIA,
> Ramesh
>
>
> --
> View this message in context: http://www.nabble.com/Map-Reduce-performance-tp24166190p24166190.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>