You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Michael Segel <ms...@hotmail.com> on 2016/10/19 16:00:13 UTC

Bug in ORC file code? (OrcSerde)?

Hi,
Since I am not on the ORC mailing list… and since the ORC java code is in the hive APIs… this seems like a good place to start. ;-)

So…

Ran in to a little problem…

One of my developers was writing a map/reduce job to read records from a source and after some filter, write the result set to an ORC file.
There’s an example of how to do this at:
http://hadoopcraft.blogspot.com/2014/07/generating-orc-files-using-mapreduce.html

So far, so good.
But now here’s the problem…. Large source data, means many mappers and with the filter, the number of output rows is a fraction in terms of size.
So we want to write to a single reducer. (An identity reducer) so that we get only a single file.

Here’s the snag.

We were using the OrcSerde class to serialize the data and generate an Orc row which we then wrote to the file.

Looking at the source code for OrcSerde, OrcSerde.serialize() returns a OrcSerdeRow.
see: http://grepcode.com/file/repo1.maven.org/maven2/co.cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java

OrcSerdeRow implements Writable and as we can see in the example code… for a map only example… context.write(Text, Writable) works.

However… if we attempt to make this in to a Map/Reduce job, we run in to a problem during run time. the context.write() throws the following exception:
"Error: java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.Writable, received org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow”

The goal was to reduce the orc rows and then write out in the reducer.

I’m curious as to why the context.write() fails?
The error is a bit cryptic since the OrcSerdeRow implements Writable… so the error message doesn’t make sense.

Now the quick fix is to borrow the ArrayListWritable from giraph and create the list of fields in to an ArrayListWritable and pass that to the reducer which will then use that to generate the ORC file.

Trying to figure out why the context.write() fails… when sending to reducer while it works if its a mapside write.

The documentation on the ORC site is … well… to be polite… lacking. ;-)

I have some ideas why it doesn’t work, however I would like to confirm my suspicions.

Thx

-Mike

Re: Bug in ORC file code? (OrcSerde)?

Posted by Ravi Prakash <ra...@gmail.com>.

MIchael!

Although there is a little overlap in the communities, I strongly suggest
you email user@orc.apache.org ( https://orc.apache.org/help/ ) I don't know
if you have to be subscribed to a mailing list to get replies to your email
address.

Ravi



On Wed, Oct 19, 2016 at 11:29 AM, Michael Segel <ms...@hotmail.com>
wrote:

> Just to follow up…
>
> This appears to be a bug in the hive version of the code… fixed in the orc
> library…  NOTE: There are two different libraries.
>
> Documentation is a bit lax… but in terms of design…
>
> Its better to do the build completely in the reducer making the mapper
> code cleaner.
>
>
> > On Oct 19, 2016, at 11:00 AM, Michael Segel <ms...@hotmail.com>
> wrote:
> >
> > Hi,
> > Since I am not on the ORC mailing list… and since the ORC java code is
> in the hive APIs… this seems like a good place to start. ;-)
> >
> >
> > So…
> >
> > Ran in to a little problem…
> >
> > One of my developers was writing a map/reduce job to read records from a
> source and after some filter, write the result set to an ORC file.
> > There’s an example of how to do this at:
> > http://hadoopcraft.blogspot.com/2014/07/generating-orc-
> files-using-mapreduce.html
> >
> > So far, so good.
> > But now here’s the problem….  Large source data, means many mappers and
> with the filter, the number of output rows is a fraction in terms of size.
> > So we want to write to a single reducer. (An identity reducer) so that
> we get only a single file.
> >
> > Here’s the snag.
> >
> > We were using the OrcSerde class to serialize the data and generate an
> Orc row which we then wrote to the file.
> >
> > Looking at the source code for OrcSerde, OrcSerde.serialize() returns a
> OrcSerdeRow.
> > see: http://grepcode.com/file/repo1.maven.org/maven2/co.
> cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java
> >
> > OrcSerdeRow implements Writable and as we can see in the example code…
> for a map only example… context.write(Text, Writable) works.
> >
> > However… if we attempt to make this in to a Map/Reduce job, we run in to
> a problem during run time. the context.write() throws the following
> exception:
> > "Error: java.io.IOException: Type mismatch in value from map: expected
> org.apache.hadoop.io.Writable, received org.apache.hadoop.hive.ql.io.
> orc.OrcSerde$OrcSerdeRow”
> >
> >
> > The goal was to reduce the orc rows and then write out in the reducer.
> >
> > I’m curious as to why the context.write() fails?
> > The error is a bit cryptic since the OrcSerdeRow implements Writable… so
> the error message doesn’t make sense.
> >
> >
> > Now the quick fix is to borrow the ArrayListWritable from giraph and
> create the list of fields in to an ArrayListWritable and pass that to the
> reducer which will then use that to generate the ORC file.
> >
> > Trying to figure out why the context.write() fails… when sending to
> reducer while it works if its a mapside write.
> >
> > The documentation on the ORC site is … well… to be polite… lacking. ;-)
> >
> > I have some ideas why it doesn’t work, however I would like to confirm
> my suspicions.
> >
> > Thx
> >
> > -Mike
> >
> >
> >  B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB�
> � [��X��ܚX�K  K[XZ[ � \�\�][��X��ܚX�P  Y �� �\ X� K�ܙ�B��܈ Y  ] [ۘ[  ��[X[�
> �  K[XZ[ � \�\�Z [    Y �� �\ X� K�ܙ�B
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org
>

Re: Bug in ORC file code? (OrcSerde)?

Posted by Ravi Prakash <ra...@gmail.com>.

MIchael!

Although there is a little overlap in the communities, I strongly suggest
you email user@orc.apache.org ( https://orc.apache.org/help/ ) I don't know
if you have to be subscribed to a mailing list to get replies to your email
address.

Ravi



On Wed, Oct 19, 2016 at 11:29 AM, Michael Segel <ms...@hotmail.com>
wrote:

> Just to follow up…
>
> This appears to be a bug in the hive version of the code… fixed in the orc
> library…  NOTE: There are two different libraries.
>
> Documentation is a bit lax… but in terms of design…
>
> Its better to do the build completely in the reducer making the mapper
> code cleaner.
>
>
> > On Oct 19, 2016, at 11:00 AM, Michael Segel <ms...@hotmail.com>
> wrote:
> >
> > Hi,
> > Since I am not on the ORC mailing list… and since the ORC java code is
> in the hive APIs… this seems like a good place to start. ;-)
> >
> >
> > So…
> >
> > Ran in to a little problem…
> >
> > One of my developers was writing a map/reduce job to read records from a
> source and after some filter, write the result set to an ORC file.
> > There’s an example of how to do this at:
> > http://hadoopcraft.blogspot.com/2014/07/generating-orc-
> files-using-mapreduce.html
> >
> > So far, so good.
> > But now here’s the problem….  Large source data, means many mappers and
> with the filter, the number of output rows is a fraction in terms of size.
> > So we want to write to a single reducer. (An identity reducer) so that
> we get only a single file.
> >
> > Here’s the snag.
> >
> > We were using the OrcSerde class to serialize the data and generate an
> Orc row which we then wrote to the file.
> >
> > Looking at the source code for OrcSerde, OrcSerde.serialize() returns a
> OrcSerdeRow.
> > see: http://grepcode.com/file/repo1.maven.org/maven2/co.
> cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java
> >
> > OrcSerdeRow implements Writable and as we can see in the example code…
> for a map only example… context.write(Text, Writable) works.
> >
> > However… if we attempt to make this in to a Map/Reduce job, we run in to
> a problem during run time. the context.write() throws the following
> exception:
> > "Error: java.io.IOException: Type mismatch in value from map: expected
> org.apache.hadoop.io.Writable, received org.apache.hadoop.hive.ql.io.
> orc.OrcSerde$OrcSerdeRow”
> >
> >
> > The goal was to reduce the orc rows and then write out in the reducer.
> >
> > I’m curious as to why the context.write() fails?
> > The error is a bit cryptic since the OrcSerdeRow implements Writable… so
> the error message doesn’t make sense.
> >
> >
> > Now the quick fix is to borrow the ArrayListWritable from giraph and
> create the list of fields in to an ArrayListWritable and pass that to the
> reducer which will then use that to generate the ORC file.
> >
> > Trying to figure out why the context.write() fails… when sending to
> reducer while it works if its a mapside write.
> >
> > The documentation on the ORC site is … well… to be polite… lacking. ;-)
> >
> > I have some ideas why it doesn’t work, however I would like to confirm
> my suspicions.
> >
> > Thx
> >
> > -Mike
> >
> >
> >  B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB�
> � [��X��ܚX�K  K[XZ[ � \�\�][��X��ܚX�P  Y �� �\ X� K�ܙ�B��܈ Y  ] [ۘ[  ��[X[�
> �  K[XZ[ � \�\�Z [    Y �� �\ X� K�ܙ�B
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org
>

Re: Bug in ORC file code? (OrcSerde)?

Posted by Michael Segel <ms...@hotmail.com>.

Just to follow up… 

This appears to be a bug in the hive version of the code… fixed in the orc library…  NOTE: There are two different libraries. 

Documentation is a bit lax… but in terms of design… 

Its better to do the build completely in the reducer making the mapper code cleaner. 


> On Oct 19, 2016, at 11:00 AM, Michael Segel <ms...@hotmail.com> wrote:
> 
> Hi, 
> Since I am not on the ORC mailing list… and since the ORC java code is in the hive APIs… this seems like a good place to start. ;-)
> 
> 
> So… 
> 
> Ran in to a little problem… 
> 
> One of my developers was writing a map/reduce job to read records from a source and after some filter, write the result set to an ORC file. 
> There’s an example of how to do this at:
> http://hadoopcraft.blogspot.com/2014/07/generating-orc-files-using-mapreduce.html
> 
> So far, so good. 
> But now here’s the problem….  Large source data, means many mappers and with the filter, the number of output rows is a fraction in terms of size. 
> So we want to write to a single reducer. (An identity reducer) so that we get only a single file. 
> 
> Here’s the snag. 
> 
> We were using the OrcSerde class to serialize the data and generate an Orc row which we then wrote to the file. 
> 
> Looking at the source code for OrcSerde, OrcSerde.serialize() returns a OrcSerdeRow.
> see: http://grepcode.com/file/repo1.maven.org/maven2/co.cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java
> 
> OrcSerdeRow implements Writable and as we can see in the example code… for a map only example… context.write(Text, Writable) works. 
> 
> However… if we attempt to make this in to a Map/Reduce job, we run in to a problem during run time. the context.write() throws the following exception:
> "Error: java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.Writable, received org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow”
> 
> 
> The goal was to reduce the orc rows and then write out in the reducer. 
> 
> I’m curious as to why the context.write() fails? 
> The error is a bit cryptic since the OrcSerdeRow implements Writable… so the error message doesn’t make sense. 
> 
> 
> Now the quick fix is to borrow the ArrayListWritable from giraph and create the list of fields in to an ArrayListWritable and pass that to the reducer which will then use that to generate the ORC file. 
> 
> Trying to figure out why the context.write() fails… when sending to reducer while it works if its a mapside write.
> 
> The documentation on the ORC site is … well… to be polite… lacking. ;-) 
> 
> I have some ideas why it doesn’t work, however I would like to confirm my suspicions. 
> 
> Thx
> 
> -Mike
> 
> 
> B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB��[��X��ܚX�KK[XZ[�\�\�][��X��ܚX�PY���\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[�\�\�Z[Y���\X�K�ܙ�B


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

Re: Bug in ORC file code? (OrcSerde)?

Posted by Michael Segel <ms...@hotmail.com>.

Just to follow up… 

This appears to be a bug in the hive version of the code… fixed in the orc library…  NOTE: There are two different libraries. 

Documentation is a bit lax… but in terms of design… 

Its better to do the build completely in the reducer making the mapper code cleaner. 


> On Oct 19, 2016, at 11:00 AM, Michael Segel <ms...@hotmail.com> wrote:
> 
> Hi, 
> Since I am not on the ORC mailing list… and since the ORC java code is in the hive APIs… this seems like a good place to start. ;-)
> 
> 
> So… 
> 
> Ran in to a little problem… 
> 
> One of my developers was writing a map/reduce job to read records from a source and after some filter, write the result set to an ORC file. 
> There’s an example of how to do this at:
> http://hadoopcraft.blogspot.com/2014/07/generating-orc-files-using-mapreduce.html
> 
> So far, so good. 
> But now here’s the problem….  Large source data, means many mappers and with the filter, the number of output rows is a fraction in terms of size. 
> So we want to write to a single reducer. (An identity reducer) so that we get only a single file. 
> 
> Here’s the snag. 
> 
> We were using the OrcSerde class to serialize the data and generate an Orc row which we then wrote to the file. 
> 
> Looking at the source code for OrcSerde, OrcSerde.serialize() returns a OrcSerdeRow.
> see: http://grepcode.com/file/repo1.maven.org/maven2/co.cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java
> 
> OrcSerdeRow implements Writable and as we can see in the example code… for a map only example… context.write(Text, Writable) works. 
> 
> However… if we attempt to make this in to a Map/Reduce job, we run in to a problem during run time. the context.write() throws the following exception:
> "Error: java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.Writable, received org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow”
> 
> 
> The goal was to reduce the orc rows and then write out in the reducer. 
> 
> I’m curious as to why the context.write() fails? 
> The error is a bit cryptic since the OrcSerdeRow implements Writable… so the error message doesn’t make sense. 
> 
> 
> Now the quick fix is to borrow the ArrayListWritable from giraph and create the list of fields in to an ArrayListWritable and pass that to the reducer which will then use that to generate the ORC file. 
> 
> Trying to figure out why the context.write() fails… when sending to reducer while it works if its a mapside write.
> 
> The documentation on the ORC site is … well… to be polite… lacking. ;-) 
> 
> I have some ideas why it doesn’t work, however I would like to confirm my suspicions. 
> 
> Thx
> 
> -Mike
> 
> 
> B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB��[��X��ܚX�KK[XZ[�\�\�][��X��ܚX�PY���\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[�\�\�Z[Y���\X�K�ܙ�B