You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Yue Guan <pi...@gmail.com> on 2012/08/01 16:28:55 UTC

mapper is slower than hive' mapper

Hi, there

I'm writing mapreduce to replace some hive query and I find that my 
mapper is slow than hive's mapper. The Hive query is like:

select sum(column1) from table group by column2, column3;

My mapreduce program likes this:

     public static class HiveTableMapper extends Mapper<BytesWritable, 
Text, MyKey, DoubleWritable> {

         public void map(BytesWritable key, Text value, Context context) 
throws IOException, InterruptedException {
                 String[] sLine = StringUtils.split(value.toString(), 
StringUtils.ESCAPE_CHAR, HIVE_FIELD_DELIMITER_CHAR);
             context.write(new MyKey(Integer.parseInt(sLine[0]), 
sLine[1]), new DoubleWritable(Double.parseDouble(sLine[2])));
         }

     }

I assume hive is doing something similar. Is there any trick in hive to 
speed this thing up? Thank you!

Best,

Re: mapper is slower than hive' mapper

Posted by Bertrand Dechoux <de...@gmail.com>.

If you don't want to manage hive table, It doesn't necessarily means you
need to use the vanilla MapReduce.
If your workflow is complex using Hive, it won't be that easy to maintain
it if everything is implemented directly using MapReduce.
I would recommend you to look at libraries such as Cascading (or Crunch or
..).

Of course, there is a leaning curve but these alternatives are
1) java API
2) less verbose than vanilla MapReduce
3) enabling you to reuse common patterns (such as map side aggregation ->
pseudo combiner in Cascading)


On Wed, Aug 1, 2012 at 7:36 PM, Yue Guan <pi...@gmail.com> wrote:

>  The story here is that we have a work flow based on hive queries. It
> takes several stages to get to the final data. For each stage, we have a
> hive table. And we try to write the whole work flow in mapreduce. Ideally,
> it will remove all the intermediate process and take two rounds of
> mapreduce to do the job.
>
> I just try the buffer in mapper approach, the number of map output record
> matches with Hive. Thank you
>
>  On 08/01/2012 11:40 AM, Bertrand Dechoux wrote:
>
> I am not sure about Hive but if you look at Cascading they use a pseudo
> combiner instead of the standard (I mean Hadoop's) combiner.
> I guess Hive has a similar strategy.
>
> The point is that when you use a compiler, the compiler does smart thing
> that you don't need to think about (like loop unwinding).
> The result is that your code is still readable but optimized and in most
> cases the compiler will do better than you.
>
> Even your naive implementation of the Mapper (without the Reducer and the
> configuration) is more complicated than the whole Hive query.
>
> Like Chuck said Hive is basically a MapReduce compiler. It is fun to look
> at how it works. But it is often best to let the compiler work for you
> instead of trying to beat it.
>
> For simple cases, like a 'select', Hive (or any other same-level
> alternative solutions) is helpful. And for complex cases, with multiple
> joins, you will want to have something like Hive too because with the
> vanilla MapReduce API it can become quite hard to grasp everything.
> Basically, two reasons : faster to express and cheaper to maintain.
>
> One reason not to use Hive is if your approach is more programmatic like
> if you want to do machine learning which will require highly specific
> workflow and user defined functions.
>
> It would be interesting to know your issue : are you trying to benchmark
> Hive (and you)? Or have you any other reasons?
>
> Bertrand
>
>
> On Wed, Aug 1, 2012 at 5:13 PM, Edward Capriolo <ed...@gmail.com>wrote:
>
>> As mentioned, if you avoid using new, by re-using objects and possibly
>> use buffer objects you may be able to match or beat the speed. But in
>> the general case the hive saves you time by allowing you not to worry
>> about low level details like this.
>>
>> On Wed, Aug 1, 2012 at 10:35 AM, Connell, Chuck
>> <Ch...@nuance.com> wrote:
>> > This is actually not surprising. Hive is essentially a MapReduce
>> compiler. It is common for regular compilers (C, C#, Fortran) to emit
>> faster assembler code than you write yourself. Compilers know the tricks of
>> their target language.
>> >
>> > Chuck Connell
>> > Nuance R&D Data Team
>> > Burlington, MA
>> >
>> >
>> > -----Original Message-----
>> > From: Yue Guan [mailto:pipehappy@gmail.com]
>> > Sent: Wednesday, August 01, 2012 10:29 AM
>> > To: user@hive.apache.org
>> > Subject: mapper is slower than hive' mapper
>> >
>> > Hi, there
>> >
>> > I'm writing mapreduce to replace some hive query and I find that my
>> mapper is slow than hive's mapper. The Hive query is like:
>> >
>> > select sum(column1) from table group by column2, column3;
>> >
>> > My mapreduce program likes this:
>> >
>> >      public static class HiveTableMapper extends Mapper<BytesWritable,
>> Text, MyKey, DoubleWritable> {
>> >
>> >          public void map(BytesWritable key, Text value, Context
>> context) throws IOException, InterruptedException {
>> >                  String[] sLine = StringUtils.split(value.toString(),
>> > StringUtils.ESCAPE_CHAR, HIVE_FIELD_DELIMITER_CHAR);
>> >              context.write(new MyKey(Integer.parseInt(sLine[0]),
>> sLine[1]), new DoubleWritable(Double.parseDouble(sLine[2])));
>> >          }
>> >
>> >      }
>> >
>> > I assume hive is doing something similar. Is there any trick in hive to
>> speed this thing up? Thank you!
>> >
>> > Best,
>> >
>>
>
>
>
> --
> Bertrand Dechoux
>
>
>


-- 
Bertrand Dechoux

Re: mapper is slower than hive' mapper

Posted by Yue Guan <pi...@gmail.com>.

The story here is that we have a work flow based on hive queries. It 
takes several stages to get to the final data. For each stage, we have a 
hive table. And we try to write the whole work flow in mapreduce. 
Ideally, it will remove all the intermediate process and take two rounds 
of mapreduce to do the job.

I just try the buffer in mapper approach, the number of map output 
record matches with Hive. Thank you

On 08/01/2012 11:40 AM, Bertrand Dechoux wrote:
> I am not sure about Hive but if you look at Cascading they use a 
> pseudo combiner instead of the standard (I mean Hadoop's) combiner.
> I guess Hive has a similar strategy.
>
> The point is that when you use a compiler, the compiler does smart 
> thing that you don't need to think about (like loop unwinding).
> The result is that your code is still readable but optimized and in 
> most cases the compiler will do better than you.
>
> Even your naive implementation of the Mapper (without the Reducer and 
> the configuration) is more complicated than the whole Hive query.
>
> Like Chuck said Hive is basically a MapReduce compiler. It is fun to 
> look at how it works. But it is often best to let the compiler work 
> for you instead of trying to beat it.
>
> For simple cases, like a 'select', Hive (or any other same-level 
> alternative solutions) is helpful. And for complex cases, with 
> multiple joins, you will want to have something like Hive too because 
> with the vanilla MapReduce API it can become quite hard to grasp 
> everything. Basically, two reasons : faster to express and cheaper to 
> maintain.
>
> One reason not to use Hive is if your approach is more programmatic 
> like if you want to do machine learning which will require highly 
> specific workflow and user defined functions.
>
> It would be interesting to know your issue : are you trying to 
> benchmark Hive (and you)? Or have you any other reasons?
>
> Bertrand
>
>
> On Wed, Aug 1, 2012 at 5:13 PM, Edward Capriolo <edlinuxguru@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     As mentioned, if you avoid using new, by re-using objects and possibly
>     use buffer objects you may be able to match or beat the speed. But in
>     the general case the hive saves you time by allowing you not to worry
>     about low level details like this.
>
>     On Wed, Aug 1, 2012 at 10:35 AM, Connell, Chuck
>     <Chuck.Connell@nuance.com <ma...@nuance.com>> wrote:
>     > This is actually not surprising. Hive is essentially a MapReduce
>     compiler. It is common for regular compilers (C, C#, Fortran) to
>     emit faster assembler code than you write yourself. Compilers know
>     the tricks of their target language.
>     >
>     > Chuck Connell
>     > Nuance R&D Data Team
>     > Burlington, MA
>     >
>     >
>     > -----Original Message-----
>     > From: Yue Guan [mailto:pipehappy@gmail.com
>     <ma...@gmail.com>]
>     > Sent: Wednesday, August 01, 2012 10:29 AM
>     > To: user@hive.apache.org <ma...@hive.apache.org>
>     > Subject: mapper is slower than hive' mapper
>     >
>     > Hi, there
>     >
>     > I'm writing mapreduce to replace some hive query and I find that
>     my mapper is slow than hive's mapper. The Hive query is like:
>     >
>     > select sum(column1) from table group by column2, column3;
>     >
>     > My mapreduce program likes this:
>     >
>     >      public static class HiveTableMapper extends
>     Mapper<BytesWritable, Text, MyKey, DoubleWritable> {
>     >
>     >          public void map(BytesWritable key, Text value, Context
>     context) throws IOException, InterruptedException {
>     >                  String[] sLine =
>     StringUtils.split(value.toString(),
>     > StringUtils.ESCAPE_CHAR, HIVE_FIELD_DELIMITER_CHAR);
>     >              context.write(new MyKey(Integer.parseInt(sLine[0]),
>     sLine[1]), new DoubleWritable(Double.parseDouble(sLine[2])));
>     >          }
>     >
>     >      }
>     >
>     > I assume hive is doing something similar. Is there any trick in
>     hive to speed this thing up? Thank you!
>     >
>     > Best,
>     >
>
>
>
>
> -- 
> Bertrand Dechoux

Re: mapper is slower than hive' mapper

Posted by Bertrand Dechoux <de...@gmail.com>.

My bad. I wasn't sure, at least I know now. But other solutions may use
other 'Serialization' strategies like Thrift (which is only other
customisation point of Hadoop).

Bertrand

On Wed, Aug 1, 2012 at 5:49 PM, Edward Capriolo <ed...@gmail.com>wrote:

> Hive does not use combiners it uses map side aggregation. Hive does
> use writables, sometimes it uses ones from hadoop, sometimes it uses
> its own custom writables for things like timestamps.
>
> On Wed, Aug 1, 2012 at 11:40 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
> > I am not sure about Hive but if you look at Cascading they use a pseudo
> > combiner instead of the standard (I mean Hadoop's) combiner.
> > I guess Hive has a similar strategy.
> >
> > The point is that when you use a compiler, the compiler does smart thing
> > that you don't need to think about (like loop unwinding).
> > The result is that your code is still readable but optimized and in most
> > cases the compiler will do better than you.
> >
> > Even your naive implementation of the Mapper (without the Reducer and the
> > configuration) is more complicated than the whole Hive query.
> >
> > Like Chuck said Hive is basically a MapReduce compiler. It is fun to
> look at
> > how it works. But it is often best to let the compiler work for you
> instead
> > of trying to beat it.
> >
> > For simple cases, like a 'select', Hive (or any other same-level
> alternative
> > solutions) is helpful. And for complex cases, with multiple joins, you
> will
> > want to have something like Hive too because with the vanilla MapReduce
> API
> > it can become quite hard to grasp everything. Basically, two reasons :
> > faster to express and cheaper to maintain.
> >
> > One reason not to use Hive is if your approach is more programmatic like
> if
> > you want to do machine learning which will require highly specific
> workflow
> > and user defined functions.
> >
> > It would be interesting to know your issue : are you trying to benchmark
> > Hive (and you)? Or have you any other reasons?
> >
> > Bertrand
> >
> >
> > On Wed, Aug 1, 2012 at 5:13 PM, Edward Capriolo <ed...@gmail.com>
> > wrote:
> >>
> >> As mentioned, if you avoid using new, by re-using objects and possibly
> >> use buffer objects you may be able to match or beat the speed. But in
> >> the general case the hive saves you time by allowing you not to worry
> >> about low level details like this.
> >>
> >> On Wed, Aug 1, 2012 at 10:35 AM, Connell, Chuck
> >> <Ch...@nuance.com> wrote:
> >> > This is actually not surprising. Hive is essentially a MapReduce
> >> > compiler. It is common for regular compilers (C, C#, Fortran) to emit
> faster
> >> > assembler code than you write yourself. Compilers know the tricks of
> their
> >> > target language.
> >> >
> >> > Chuck Connell
> >> > Nuance R&D Data Team
> >> > Burlington, MA
> >> >
> >> >
> >> > -----Original Message-----
> >> > From: Yue Guan [mailto:pipehappy@gmail.com]
> >> > Sent: Wednesday, August 01, 2012 10:29 AM
> >> > To: user@hive.apache.org
> >> > Subject: mapper is slower than hive' mapper
> >> >
> >> > Hi, there
> >> >
> >> > I'm writing mapreduce to replace some hive query and I find that my
> >> > mapper is slow than hive's mapper. The Hive query is like:
> >> >
> >> > select sum(column1) from table group by column2, column3;
> >> >
> >> > My mapreduce program likes this:
> >> >
> >> >      public static class HiveTableMapper extends Mapper<BytesWritable,
> >> > Text, MyKey, DoubleWritable> {
> >> >
> >> >          public void map(BytesWritable key, Text value, Context
> context)
> >> > throws IOException, InterruptedException {
> >> >                  String[] sLine = StringUtils.split(value.toString(),
> >> > StringUtils.ESCAPE_CHAR, HIVE_FIELD_DELIMITER_CHAR);
> >> >              context.write(new MyKey(Integer.parseInt(sLine[0]),
> >> > sLine[1]), new DoubleWritable(Double.parseDouble(sLine[2])));
> >> >          }
> >> >
> >> >      }
> >> >
> >> > I assume hive is doing something similar. Is there any trick in hive
> to
> >> > speed this thing up? Thank you!
> >> >
> >> > Best,
> >> >
> >
> >
> >
> >
> > --
> > Bertrand Dechoux
>



-- 
Bertrand Dechoux

Re: mapper is slower than hive' mapper

Posted by Edward Capriolo <ed...@gmail.com>.

Hive does not use combiners it uses map side aggregation. Hive does
use writables, sometimes it uses ones from hadoop, sometimes it uses
its own custom writables for things like timestamps.

On Wed, Aug 1, 2012 at 11:40 AM, Bertrand Dechoux <de...@gmail.com> wrote:
> I am not sure about Hive but if you look at Cascading they use a pseudo
> combiner instead of the standard (I mean Hadoop's) combiner.
> I guess Hive has a similar strategy.
>
> The point is that when you use a compiler, the compiler does smart thing
> that you don't need to think about (like loop unwinding).
> The result is that your code is still readable but optimized and in most
> cases the compiler will do better than you.
>
> Even your naive implementation of the Mapper (without the Reducer and the
> configuration) is more complicated than the whole Hive query.
>
> Like Chuck said Hive is basically a MapReduce compiler. It is fun to look at
> how it works. But it is often best to let the compiler work for you instead
> of trying to beat it.
>
> For simple cases, like a 'select', Hive (or any other same-level alternative
> solutions) is helpful. And for complex cases, with multiple joins, you will
> want to have something like Hive too because with the vanilla MapReduce API
> it can become quite hard to grasp everything. Basically, two reasons :
> faster to express and cheaper to maintain.
>
> One reason not to use Hive is if your approach is more programmatic like if
> you want to do machine learning which will require highly specific workflow
> and user defined functions.
>
> It would be interesting to know your issue : are you trying to benchmark
> Hive (and you)? Or have you any other reasons?
>
> Bertrand
>
>
> On Wed, Aug 1, 2012 at 5:13 PM, Edward Capriolo <ed...@gmail.com>
> wrote:
>>
>> As mentioned, if you avoid using new, by re-using objects and possibly
>> use buffer objects you may be able to match or beat the speed. But in
>> the general case the hive saves you time by allowing you not to worry
>> about low level details like this.
>>
>> On Wed, Aug 1, 2012 at 10:35 AM, Connell, Chuck
>> <Ch...@nuance.com> wrote:
>> > This is actually not surprising. Hive is essentially a MapReduce
>> > compiler. It is common for regular compilers (C, C#, Fortran) to emit faster
>> > assembler code than you write yourself. Compilers know the tricks of their
>> > target language.
>> >
>> > Chuck Connell
>> > Nuance R&D Data Team
>> > Burlington, MA
>> >
>> >
>> > -----Original Message-----
>> > From: Yue Guan [mailto:pipehappy@gmail.com]
>> > Sent: Wednesday, August 01, 2012 10:29 AM
>> > To: user@hive.apache.org
>> > Subject: mapper is slower than hive' mapper
>> >
>> > Hi, there
>> >
>> > I'm writing mapreduce to replace some hive query and I find that my
>> > mapper is slow than hive's mapper. The Hive query is like:
>> >
>> > select sum(column1) from table group by column2, column3;
>> >
>> > My mapreduce program likes this:
>> >
>> >      public static class HiveTableMapper extends Mapper<BytesWritable,
>> > Text, MyKey, DoubleWritable> {
>> >
>> >          public void map(BytesWritable key, Text value, Context context)
>> > throws IOException, InterruptedException {
>> >                  String[] sLine = StringUtils.split(value.toString(),
>> > StringUtils.ESCAPE_CHAR, HIVE_FIELD_DELIMITER_CHAR);
>> >              context.write(new MyKey(Integer.parseInt(sLine[0]),
>> > sLine[1]), new DoubleWritable(Double.parseDouble(sLine[2])));
>> >          }
>> >
>> >      }
>> >
>> > I assume hive is doing something similar. Is there any trick in hive to
>> > speed this thing up? Thank you!
>> >
>> > Best,
>> >
>
>
>
>
> --
> Bertrand Dechoux

Re: mapper is slower than hive' mapper

Posted by Bertrand Dechoux <de...@gmail.com>.

I am not sure about Hive but if you look at Cascading they use a pseudo
combiner instead of the standard (I mean Hadoop's) combiner.
I guess Hive has a similar strategy.

The point is that when you use a compiler, the compiler does smart thing
that you don't need to think about (like loop unwinding).
The result is that your code is still readable but optimized and in most
cases the compiler will do better than you.

Even your naive implementation of the Mapper (without the Reducer and the
configuration) is more complicated than the whole Hive query.

Like Chuck said Hive is basically a MapReduce compiler. It is fun to look
at how it works. But it is often best to let the compiler work for you
instead of trying to beat it.

For simple cases, like a 'select', Hive (or any other same-level
alternative solutions) is helpful. And for complex cases, with multiple
joins, you will want to have something like Hive too because with the
vanilla MapReduce API it can become quite hard to grasp everything.
Basically, two reasons : faster to express and cheaper to maintain.

One reason not to use Hive is if your approach is more programmatic like if
you want to do machine learning which will require highly specific workflow
and user defined functions.

It would be interesting to know your issue : are you trying to benchmark
Hive (and you)? Or have you any other reasons?

Bertrand

On Wed, Aug 1, 2012 at 5:13 PM, Edward Capriolo <ed...@gmail.com>wrote:

> As mentioned, if you avoid using new, by re-using objects and possibly
> use buffer objects you may be able to match or beat the speed. But in
> the general case the hive saves you time by allowing you not to worry
> about low level details like this.
>
> On Wed, Aug 1, 2012 at 10:35 AM, Connell, Chuck
> <Ch...@nuance.com> wrote:
> > This is actually not surprising. Hive is essentially a MapReduce
> compiler. It is common for regular compilers (C, C#, Fortran) to emit
> faster assembler code than you write yourself. Compilers know the tricks of
> their target language.
> >
> > Chuck Connell
> > Nuance R&D Data Team
> > Burlington, MA
> >
> >
> > -----Original Message-----
> > From: Yue Guan [mailto:pipehappy@gmail.com]
> > Sent: Wednesday, August 01, 2012 10:29 AM
> > To: user@hive.apache.org
> > Subject: mapper is slower than hive' mapper
> >
> > Hi, there
> >
> > I'm writing mapreduce to replace some hive query and I find that my
> mapper is slow than hive's mapper. The Hive query is like:
> >
> > select sum(column1) from table group by column2, column3;
> >
> > My mapreduce program likes this:
> >
> >      public static class HiveTableMapper extends Mapper<BytesWritable,
> Text, MyKey, DoubleWritable> {
> >
> >          public void map(BytesWritable key, Text value, Context context)
> throws IOException, InterruptedException {
> >                  String[] sLine = StringUtils.split(value.toString(),
> > StringUtils.ESCAPE_CHAR, HIVE_FIELD_DELIMITER_CHAR);
> >              context.write(new MyKey(Integer.parseInt(sLine[0]),
> sLine[1]), new DoubleWritable(Double.parseDouble(sLine[2])));
> >          }
> >
> >      }
> >
> > I assume hive is doing something similar. Is there any trick in hive to
> speed this thing up? Thank you!
> >
> > Best,
> >
>

-- 
Bertrand Dechoux

Re: mapper is slower than hive' mapper

Posted by Edward Capriolo <ed...@gmail.com>.

As mentioned, if you avoid using new, by re-using objects and possibly
use buffer objects you may be able to match or beat the speed. But in
the general case the hive saves you time by allowing you not to worry
about low level details like this.

On Wed, Aug 1, 2012 at 10:35 AM, Connell, Chuck
<Ch...@nuance.com> wrote:
> This is actually not surprising. Hive is essentially a MapReduce compiler. It is common for regular compilers (C, C#, Fortran) to emit faster assembler code than you write yourself. Compilers know the tricks of their target language.
>
> Chuck Connell
> Nuance R&D Data Team
> Burlington, MA
>
>
> -----Original Message-----
> From: Yue Guan [mailto:pipehappy@gmail.com]
> Sent: Wednesday, August 01, 2012 10:29 AM
> To: user@hive.apache.org
> Subject: mapper is slower than hive' mapper
>
> Hi, there
>
> I'm writing mapreduce to replace some hive query and I find that my mapper is slow than hive's mapper. The Hive query is like:
>
> select sum(column1) from table group by column2, column3;
>
> My mapreduce program likes this:
>
>      public static class HiveTableMapper extends Mapper<BytesWritable, Text, MyKey, DoubleWritable> {
>
>          public void map(BytesWritable key, Text value, Context context) throws IOException, InterruptedException {
>                  String[] sLine = StringUtils.split(value.toString(),
> StringUtils.ESCAPE_CHAR, HIVE_FIELD_DELIMITER_CHAR);
>              context.write(new MyKey(Integer.parseInt(sLine[0]), sLine[1]), new DoubleWritable(Double.parseDouble(sLine[2])));
>          }
>
>      }
>
> I assume hive is doing something similar. Is there any trick in hive to speed this thing up? Thank you!
>
> Best,
>

Re: mapper is slower than hive' mapper

Posted by Yue Guan <pi...@gmail.com>.

Hive don't use Writable?!!. Could you please give me a pointer to hive 
code to see how they do the job?

I check the map output record. I find this:
my case:
total mapper input record: 23091348
total mapper output record: 23091348
avg mapper output bytes/record: 34.819994
total combiner output record: 27298
hive:
total mapper input record: 23091348
total mapper output record: 13164
avg mapper output bytes/record: 36.199407
total combiner output record: 0

Hive actually do reduce in mapper? How does that work?



On 08/01/2012 10:41 AM, Bertrand Dechoux wrote:
> One hint would be to reduce the number of writable instances you need.
> Create the object once and reuse it.
> By the way, Hive do not use Writable. ;)
>
> Bertrand
>
> On Wed, Aug 1, 2012 at 4:35 PM, Connell, Chuck 
> <Chuck.Connell@nuance.com <ma...@nuance.com>> wrote:
>
>     This is actually not surprising. Hive is essentially a MapReduce
>     compiler. It is common for regular compilers (C, C#, Fortran) to
>     emit faster assembler code than you write yourself. Compilers know
>     the tricks of their target language.
>
>     Chuck Connell
>     Nuance R&D Data Team
>     Burlington, MA
>
>
>     -----Original Message-----
>     From: Yue Guan [mailto:pipehappy@gmail.com
>     <ma...@gmail.com>]
>     Sent: Wednesday, August 01, 2012 10:29 AM
>     To: user@hive.apache.org <ma...@hive.apache.org>
>     Subject: mapper is slower than hive' mapper
>
>     Hi, there
>
>     I'm writing mapreduce to replace some hive query and I find that
>     my mapper is slow than hive's mapper. The Hive query is like:
>
>     select sum(column1) from table group by column2, column3;
>
>     My mapreduce program likes this:
>
>          public static class HiveTableMapper extends
>     Mapper<BytesWritable, Text, MyKey, DoubleWritable> {
>
>              public void map(BytesWritable key, Text value, Context
>     context) throws IOException, InterruptedException {
>                      String[] sLine = StringUtils.split(value.toString(),
>     StringUtils.ESCAPE_CHAR, HIVE_FIELD_DELIMITER_CHAR);
>                  context.write(new MyKey(Integer.parseInt(sLine[0]),
>     sLine[1]), new DoubleWritable(Double.parseDouble(sLine[2])));
>              }
>
>          }
>
>     I assume hive is doing something similar. Is there any trick in
>     hive to speed this thing up? Thank you!
>
>     Best,
>
>
>
>
> -- 
> Bertrand Dechoux

Re: mapper is slower than hive' mapper

Posted by Bertrand Dechoux <de...@gmail.com>.

One hint would be to reduce the number of writable instances you need.
Create the object once and reuse it.
By the way, Hive do not use Writable. ;)

Bertrand

On Wed, Aug 1, 2012 at 4:35 PM, Connell, Chuck <Ch...@nuance.com>wrote:

> This is actually not surprising. Hive is essentially a MapReduce compiler.
> It is common for regular compilers (C, C#, Fortran) to emit faster
> assembler code than you write yourself. Compilers know the tricks of their
> target language.
>
> Chuck Connell
> Nuance R&D Data Team
> Burlington, MA
>
>
> -----Original Message-----
> From: Yue Guan [mailto:pipehappy@gmail.com]
> Sent: Wednesday, August 01, 2012 10:29 AM
> To: user@hive.apache.org
> Subject: mapper is slower than hive' mapper
>
> Hi, there
>
> I'm writing mapreduce to replace some hive query and I find that my mapper
> is slow than hive's mapper. The Hive query is like:
>
> select sum(column1) from table group by column2, column3;
>
> My mapreduce program likes this:
>
>      public static class HiveTableMapper extends Mapper<BytesWritable,
> Text, MyKey, DoubleWritable> {
>
>          public void map(BytesWritable key, Text value, Context context)
> throws IOException, InterruptedException {
>                  String[] sLine = StringUtils.split(value.toString(),
> StringUtils.ESCAPE_CHAR, HIVE_FIELD_DELIMITER_CHAR);
>              context.write(new MyKey(Integer.parseInt(sLine[0]),
> sLine[1]), new DoubleWritable(Double.parseDouble(sLine[2])));
>          }
>
>      }
>
> I assume hive is doing something similar. Is there any trick in hive to
> speed this thing up? Thank you!
>
> Best,
>
>


-- 
Bertrand Dechoux

RE: mapper is slower than hive' mapper

Posted by "Connell, Chuck" <Ch...@nuance.com>.

This is actually not surprising. Hive is essentially a MapReduce compiler. It is common for regular compilers (C, C#, Fortran) to emit faster assembler code than you write yourself. Compilers know the tricks of their target language.

Chuck Connell
Nuance R&D Data Team
Burlington, MA


-----Original Message-----
From: Yue Guan [mailto:pipehappy@gmail.com] 
Sent: Wednesday, August 01, 2012 10:29 AM
To: user@hive.apache.org
Subject: mapper is slower than hive' mapper

Hi, there

I'm writing mapreduce to replace some hive query and I find that my mapper is slow than hive's mapper. The Hive query is like:

select sum(column1) from table group by column2, column3;

My mapreduce program likes this:

     public static class HiveTableMapper extends Mapper<BytesWritable, Text, MyKey, DoubleWritable> {

         public void map(BytesWritable key, Text value, Context context) throws IOException, InterruptedException {
                 String[] sLine = StringUtils.split(value.toString(),
StringUtils.ESCAPE_CHAR, HIVE_FIELD_DELIMITER_CHAR);
             context.write(new MyKey(Integer.parseInt(sLine[0]), sLine[1]), new DoubleWritable(Double.parseDouble(sLine[2])));
         }

     }

I assume hive is doing something similar. Is there any trick in hive to speed this thing up? Thank you!

Best,