You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@orc.apache.org by Jeff Evans <je...@gmail.com> on 2018/01/17 17:16:17 UTC

Orc core (Java) dependency on hadoop-common

Hi,

I am a software engineer with StreamSets, and am working on a project
to incorporate ORC support into our product.  The first phase of this
will be to support Avro to ORC conversion. (I saw a post on this topic
to this list a couple months ago, before I joined.  Would be happy to
share more details/code for scrutiny once it's closer to completion.)

One issue I'm running into is the dependency of orc-core on
hadoop-common.  Our product can be deployed in a variety of Hadoop
distributions from different vendors, and also standalone (i.e. not in
Hadoop at all).  Therefore, this dependency makes it difficult for us
to incorporate orc-core in a central way in our codebase (since the
vendor typically provides this jar in their installation).  Besides
that, hadoop-common also brings in a number of other problematic
dependencies for us (the deprecated com.sun.jersey group for Jersey
and zookeeper, to name a couple).

Does anyone have suggestions for how to work around this?  It seems
the only actual classes I reference are the same ones referenced in
the core-java tutorial (org.apache.hadoop.conf.Configuration and
org.apache.hadoop.fs.Path), although obviously the library may be
making use of more itself.  Are there any plans to remove the
dependency on Hadoop down the line, or should I accommodate this by
shuffling our dependencies such that our code only lives in a
Hadoop-provided packaging configuration?  Any insight is appreciated.

Re: Orc core (Java) dependency on hadoop-common

Posted by Owen O'Malley <ow...@gmail.com>.

Which version of ORC are you using? We've been pretty aggressive about
keeping the dependencies down as much as possible. In the upcoming 1.5,
we've made the support more flexible and define a minimum and desired
version of Hadoop and will dynamically use the features from the newer
versions if you are using a version that supports it.

There have been some users talking about trying to remove it, but it is
pretty complicated.

The hard bits:
* Configuration
* FileSystem
* zlib compression codec
* HDFS (in 1.5 for controlling the variable length blocks via shims)
* KeyProvider (for upcoming column encryption via shims)

You should probably look at the shims module on the master branch that was
refactored in ORC-234 and ORC-91. We made the non-shims modules only depend
on Hadoop 2.2, but the shims depends on Hadoop 2.7. Thus we can ensure that
the core unit tests run with hadoop 2.2 and yet have access to the features
that were only added in hadoop 2.7+.

So would that level of version flexibility be enough or is it more?

.. Owen

On Wed, Jan 17, 2018 at 9:16 AM, Jeff Evans <je...@gmail.com>
wrote:

> Hi,
>
> I am a software engineer with StreamSets, and am working on a project
> to incorporate ORC support into our product.  The first phase of this
> will be to support Avro to ORC conversion. (I saw a post on this topic
> to this list a couple months ago, before I joined.  Would be happy to
> share more details/code for scrutiny once it's closer to completion.)
>
> One issue I'm running into is the dependency of orc-core on
> hadoop-common.  Our product can be deployed in a variety of Hadoop
> distributions from different vendors, and also standalone (i.e. not in
> Hadoop at all).  Therefore, this dependency makes it difficult for us
> to incorporate orc-core in a central way in our codebase (since the
> vendor typically provides this jar in their installation).  Besides
> that, hadoop-common also brings in a number of other problematic
> dependencies for us (the deprecated com.sun.jersey group for Jersey
> and zookeeper, to name a couple).
>
> Does anyone have suggestions for how to work around this?  It seems
> the only actual classes I reference are the same ones referenced in
> the core-java tutorial (org.apache.hadoop.conf.Configuration and
> org.apache.hadoop.fs.Path), although obviously the library may be
> making use of more itself.  Are there any plans to remove the
> dependency on Hadoop down the line, or should I accommodate this by
> shuffling our dependencies such that our code only lives in a
> Hadoop-provided packaging configuration?  Any insight is appreciated.
>

Re: Orc core (Java) dependency on hadoop-common

Posted by Jeff Evans <je...@gmail.com>.

To close the loop on this admittedly old thread, we now have some code
that performs this conversion as part of our open source product.  I'm
mentioning it here in case anyone else finds it useful, or has any
feedback on implementation, bugs, invalid assumptions, etc.

This class converts an Avro schema to an ORC schema:
https://github.com/streamsets/datacollector/blob/master/mapreduce-protolib/src/main/java/com/streamsets/pipeline/lib/util/avroorc/AvroToOrcSchemaConverter.java

This class converts an Avro file to an ORC file (using a schema built
using the above):
https://github.com/streamsets/datacollector/blob/master/mapreduce-protolib/src/main/java/com/streamsets/pipeline/lib/util/avroorc/AvroToOrcRecordConverter.java

Both make use of a utility class:
https://github.com/streamsets/datacollector/blob/master/commonlib/src/main/java/com/streamsets/pipeline/lib/util/AvroTypeUtil.java

There are some test cases here:
https://github.com/streamsets/datacollector/tree/master/mapreduce-protolib/src/test/java/com/streamsets/pipeline/lib/util/avroorc

Thanks for all the info before, and any feedback/critiques are welcomed!


On Wed, Jan 17, 2018 at 4:55 PM Jeff Evans
<je...@gmail.com> wrote:
>
> Thanks, István and Owen!
>
> I appreciate the input.  At the moment, I'm developing against
> orc-core 1.4.1.  I think I will go the route of excluding
> hadoop-common from the orc-core dependency and explicitly scope it as
> provided.  For modules that will ultimately live in one of our Hadoop
> deployments, this should work fine.  Moreover, we already adopt this
> sort of packaging strategy in our project, so it wouldn't be too much
> of a stretch.  For the "standalone" operation, I will probably just
> create a separate module that explicitly declares hadoop-common as a
> compile dependency, so those not on Hadoop can simply bring in the
> same version that orc itself specifies.
>
> I think the longer term approach you describe makes good sense,
> István.  Unfortunately, given other priorities I wouldn't be able to
> devote any time to it in the near future.  As far as Hadoop
> versioning, I think the minimum/desired approach outlined in Owen's
> message would work fine.
>
> On Wed, Jan 17, 2018 at 3:58 PM, István <le...@gmail.com> wrote:
> > Hi Jeff,
> >
> > Few months back I wondering about the same topic. Unfortunately dependency
> > management and importing libraries is not the strongest suit of Hadoop
> > related libraries and that includes ORC. We got with our project to the
> > point when we considered forking ORC and just create our own version of it
> > becuase we want to use it outside Hadoop. Unfortunately Hadoop related code
> > is all over the place so we decided to just exclude a bunch of libraries and
> > we ended up with a pom.xml like this:
> >
> > https://gist.github.com/l1x/0c00fe69bdcb6db305e0bffae042817c
> >
> > Keep in mind this is an older version of ORC that is included in the Hive
> > 1.2.1 release. I also started to work on a project to deal with Hadoop
> > dependencies easier but we dropped the entire project altogether.
> >
> > I think what would be reasonable is to have libraries like ORC at the bottom
> > of the dependency stack (orc-core) and create a library that provides an
> > interface for Hadoop or any project that wants to use this file format
> > (orc-hadoop, orc-something, etc.) so that we don't have this dependency hell
> > that you can see in projects like ORC. I am not sure who else is interested
> > in such a project but if you are I think I could provide you some
> > development time.
> >
> > Owen was really helpful with the efforts. See more here:
> > https://issues.apache.org/jira/browse/ORC-151
> > https://github.com/apache/orc/pull/96
> >
> > Thanks,
> > Istvan
> >
> > On Wed, Jan 17, 2018 at 6:16 PM, Jeff Evans <je...@gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> I am a software engineer with StreamSets, and am working on a project
> >> to incorporate ORC support into our product.  The first phase of this
> >> will be to support Avro to ORC conversion. (I saw a post on this topic
> >> to this list a couple months ago, before I joined.  Would be happy to
> >> share more details/code for scrutiny once it's closer to completion.)
> >>
> >> One issue I'm running into is the dependency of orc-core on
> >> hadoop-common.  Our product can be deployed in a variety of Hadoop
> >> distributions from different vendors, and also standalone (i.e. not in
> >> Hadoop at all).  Therefore, this dependency makes it difficult for us
> >> to incorporate orc-core in a central way in our codebase (since the
> >> vendor typically provides this jar in their installation).  Besides
> >> that, hadoop-common also brings in a number of other problematic
> >> dependencies for us (the deprecated com.sun.jersey group for Jersey
> >> and zookeeper, to name a couple).
> >>
> >> Does anyone have suggestions for how to work around this?  It seems
> >> the only actual classes I reference are the same ones referenced in
> >> the core-java tutorial (org.apache.hadoop.conf.Configuration and
> >> org.apache.hadoop.fs.Path), although obviously the library may be
> >> making use of more itself.  Are there any plans to remove the
> >> dependency on Hadoop down the line, or should I accommodate this by
> >> shuffling our dependencies such that our code only lives in a
> >> Hadoop-provided packaging configuration?  Any insight is appreciated.
> >
> >
> >
> >
> > --
> > the sun shines for all
> >
> >

Re: Orc core (Java) dependency on hadoop-common

Posted by Jeff Evans <je...@gmail.com>.

Thanks, István and Owen!

I appreciate the input.  At the moment, I'm developing against
orc-core 1.4.1.  I think I will go the route of excluding
hadoop-common from the orc-core dependency and explicitly scope it as
provided.  For modules that will ultimately live in one of our Hadoop
deployments, this should work fine.  Moreover, we already adopt this
sort of packaging strategy in our project, so it wouldn't be too much
of a stretch.  For the "standalone" operation, I will probably just
create a separate module that explicitly declares hadoop-common as a
compile dependency, so those not on Hadoop can simply bring in the
same version that orc itself specifies.

I think the longer term approach you describe makes good sense,
István.  Unfortunately, given other priorities I wouldn't be able to
devote any time to it in the near future.  As far as Hadoop
versioning, I think the minimum/desired approach outlined in Owen's
message would work fine.

On Wed, Jan 17, 2018 at 3:58 PM, István <le...@gmail.com> wrote:
> Hi Jeff,
>
> Few months back I wondering about the same topic. Unfortunately dependency
> management and importing libraries is not the strongest suit of Hadoop
> related libraries and that includes ORC. We got with our project to the
> point when we considered forking ORC and just create our own version of it
> becuase we want to use it outside Hadoop. Unfortunately Hadoop related code
> is all over the place so we decided to just exclude a bunch of libraries and
> we ended up with a pom.xml like this:
>
> https://gist.github.com/l1x/0c00fe69bdcb6db305e0bffae042817c
>
> Keep in mind this is an older version of ORC that is included in the Hive
> 1.2.1 release. I also started to work on a project to deal with Hadoop
> dependencies easier but we dropped the entire project altogether.
>
> I think what would be reasonable is to have libraries like ORC at the bottom
> of the dependency stack (orc-core) and create a library that provides an
> interface for Hadoop or any project that wants to use this file format
> (orc-hadoop, orc-something, etc.) so that we don't have this dependency hell
> that you can see in projects like ORC. I am not sure who else is interested
> in such a project but if you are I think I could provide you some
> development time.
>
> Owen was really helpful with the efforts. See more here:
> https://issues.apache.org/jira/browse/ORC-151
> https://github.com/apache/orc/pull/96
>
> Thanks,
> Istvan
>
> On Wed, Jan 17, 2018 at 6:16 PM, Jeff Evans <je...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> I am a software engineer with StreamSets, and am working on a project
>> to incorporate ORC support into our product.  The first phase of this
>> will be to support Avro to ORC conversion. (I saw a post on this topic
>> to this list a couple months ago, before I joined.  Would be happy to
>> share more details/code for scrutiny once it's closer to completion.)
>>
>> One issue I'm running into is the dependency of orc-core on
>> hadoop-common.  Our product can be deployed in a variety of Hadoop
>> distributions from different vendors, and also standalone (i.e. not in
>> Hadoop at all).  Therefore, this dependency makes it difficult for us
>> to incorporate orc-core in a central way in our codebase (since the
>> vendor typically provides this jar in their installation).  Besides
>> that, hadoop-common also brings in a number of other problematic
>> dependencies for us (the deprecated com.sun.jersey group for Jersey
>> and zookeeper, to name a couple).
>>
>> Does anyone have suggestions for how to work around this?  It seems
>> the only actual classes I reference are the same ones referenced in
>> the core-java tutorial (org.apache.hadoop.conf.Configuration and
>> org.apache.hadoop.fs.Path), although obviously the library may be
>> making use of more itself.  Are there any plans to remove the
>> dependency on Hadoop down the line, or should I accommodate this by
>> shuffling our dependencies such that our code only lives in a
>> Hadoop-provided packaging configuration?  Any insight is appreciated.
>
>
>
>
> --
> the sun shines for all
>
>

Re: Orc core (Java) dependency on hadoop-common

Posted by István <le...@gmail.com>.

Hi Jeff,

Few months back I wondering about the same topic. Unfortunately dependency
management and importing libraries is not the strongest suit of Hadoop
related libraries and that includes ORC. We got with our project to the
point when we considered forking ORC and just create our own version of it
becuase we want to use it outside Hadoop. Unfortunately Hadoop related code
is all over the place so we decided to just exclude a bunch of libraries
and we ended up with a pom.xml like this:

https://gist.github.com/l1x/0c00fe69bdcb6db305e0bffae042817c

Keep in mind this is an older version of ORC that is included in the Hive
1.2.1 release. I also started to work on a project to deal with Hadoop
dependencies easier but we dropped the entire project altogether.

I think what would be reasonable is to have libraries like ORC at the
bottom of the dependency stack (orc-core) and create a library that
provides an interface for Hadoop or any project that wants to use this file
format (orc-hadoop, orc-something, etc.) so that we don't have this
dependency hell that you can see in projects like ORC. I am not sure who
else is interested in such a project but if you are I think I could provide
you some development time.

Owen was really helpful with the efforts. See more here:
https://issues.apache.org/jira/browse/ORC-151
https://github.com/apache/orc/pull/96

Thanks,
Istvan

On Wed, Jan 17, 2018 at 6:16 PM, Jeff Evans <je...@gmail.com>
wrote:

> Hi,
>
> I am a software engineer with StreamSets, and am working on a project
> to incorporate ORC support into our product.  The first phase of this
> will be to support Avro to ORC conversion. (I saw a post on this topic
> to this list a couple months ago, before I joined.  Would be happy to
> share more details/code for scrutiny once it's closer to completion.)
>
> One issue I'm running into is the dependency of orc-core on
> hadoop-common.  Our product can be deployed in a variety of Hadoop
> distributions from different vendors, and also standalone (i.e. not in
> Hadoop at all).  Therefore, this dependency makes it difficult for us
> to incorporate orc-core in a central way in our codebase (since the
> vendor typically provides this jar in their installation).  Besides
> that, hadoop-common also brings in a number of other problematic
> dependencies for us (the deprecated com.sun.jersey group for Jersey
> and zookeeper, to name a couple).
>
> Does anyone have suggestions for how to work around this?  It seems
> the only actual classes I reference are the same ones referenced in
> the core-java tutorial (org.apache.hadoop.conf.Configuration and
> org.apache.hadoop.fs.Path), although obviously the library may be
> making use of more itself.  Are there any plans to remove the
> dependency on Hadoop down the line, or should I accommodate this by
> shuffling our dependencies such that our code only lives in a
> Hadoop-provided packaging configuration?  Any insight is appreciated.
>

-- 
the sun shines for all