You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Nathan Bamford <na...@redpoint.net> on 2014/09/03 02:26:12 UTC

Reading and Writing with Hive 0.13 from a Yarn application

Hi,

  My company has been working on a Yarn application for a couple of years-- we essentially take the place of MapReduce and split our data and processing ourselves.

  One of the things we've been working to support is Hive access, and the HCatalog interfaces and API seemed perfect. Using this information: <https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html> https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html and TestReaderWriter.java from the source code, I was able to create and use HCatSplits to allow balanced data local parallel reading (using the size and locations methods available from each HCatSplit).

  Much to my dismay, 0.13 removes a lot of that functionality. The ReaderContext class is now an interface that only exposes numSplits, whereas all of the other methods are in the inaccessible (package only) ReaderContextImpl class.

  Since I no longer have access to the actual HCatSplits from the ReaderContext, I am unable to process them and send them to our yarn app on the data local nodes.  My only choice seems to be to partition out the splits to slave nodes more or less at random.

  Does anyone know if, as of 0.13, this is the intended way to interface with Hive via non-Hadoop yarn applications? Is the underlying HCatSplit only intended for internal use, now?


Thanks,


Nathan Bamford

RE: Reading and Writing with Hive 0.13 from a Yarn application

Posted by Nathan Bamford <na...@redpoint.net>.
?That's very helpful Ashutosh, thank you! I will file the jira.

________________________________
From: Ashutosh Chauhan <ha...@apache.org>
Sent: Wednesday, September 3, 2014 11:25 AM
To: user@hive.apache.org
Subject: Re: Reading and Writing with Hive 0.13 from a Yarn application

This api is designed for use cases like yours only. So, I will say api is failing if it cannot service what you are trying to do with it. So, I will encourage you to use this api and consider current shortcoming as missing feature in it.
Feel free to file a jira requesting addition of these methods in ReaderContext. Patches are welcome too :)

Hope it helps,
Ashutosh


On Wed, Sep 3, 2014 at 11:12 AM, Nathan Bamford <na...@redpoint.net>> wrote:

Hi Ashutosh,

  Thanks for the reply!

  Well, we are a yarn app that is essentially doing the same things mapreduce does. For regular files in Hadoop, we get the block locations and sizes and perform some internal sorting and load balancing on the master which then creates the slave yarn apps on individual nodes for reading. We strive for data locality, as much as possible.

  To interface with Hive, the HCatalog api seemed like the appropriate interface.  It does a lot of things we want via the ReadEntity, allowing us to query and read the Hive tables at a high level.

  I used the readerwriter example (from Hive 0.12) to get things running, but I was using HCatSplit just like our internal split classes. I retrieved them from the ReaderContext and ran them through the same sorting algorithms, then serialized them and sent them to the individual yarn apps, etc.

  I understand the rationale for the smaller api, which is why I wondered if there's another avenue I should be pursuing as a yarn app (metadata vs. HCatalog, for instance).

  All that being said :), the ability to get the block locations (and sizes, if possible) would certainly solve my problems.


Thanks,


Nathan



________________________________
From: Ashutosh Chauhan <ha...@apache.org>>
Sent: Wednesday, September 3, 2014 9:16 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Reading and Writing with Hive 0.13 from a Yarn application

Hi Nathan,

This was done in https://issues.apache.org/jira/browse/HIVE-6248 Reasoning was to minimize api surface area to users so that they are immune of incompatible changes in internal classes and thus making it easier for them to consume this and not get worried about version upgrade. Seems like in the process some of the functionality went away.
Which info you are looking for exactly? Is it String[] getBlockLocations() equivalent of InputSplit? If so, we can consider adding that in ReaderContext() since that one need not to expose any hadoop or hive classes.

Thanks,
Ashutosh


On Tue, Sep 2, 2014 at 5:26 PM, Nathan Bamford <na...@redpoint.net>> wrote:

Hi,

  My company has been working on a Yarn application for a couple of years-- we essentially take the place of MapReduce and split our data and processing ourselves.

  One of the things we've been working to support is Hive access, and the HCatalog interfaces and API seemed perfect. Using this information: <https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html> https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html and TestReaderWriter.java from the source code, I was able to create and use HCatSplits to allow balanced data local parallel reading (using the size and locations methods available from each HCatSplit).

  Much to my dismay, 0.13 removes a lot of that functionality. The ReaderContext class is now an interface that only exposes numSplits, whereas all of the other methods are in the inaccessible (package only) ReaderContextImpl class.

  Since I no longer have access to the actual HCatSplits from the ReaderContext, I am unable to process them and send them to our yarn app on the data local nodes.  My only choice seems to be to partition out the splits to slave nodes more or less at random.

  Does anyone know if, as of 0.13, this is the intended way to interface with Hive via non-Hadoop yarn applications? Is the underlying HCatSplit only intended for internal use, now?


Thanks,


Nathan Bamford



Re: Reading and Writing with Hive 0.13 from a Yarn application

Posted by Ashutosh Chauhan <ha...@apache.org>.
This api is designed for use cases like yours only. So, I will say api is
failing if it cannot service what you are trying to do with it. So, I will
encourage you to use this api and consider current shortcoming as missing
feature in it.
Feel free to file a jira requesting addition of these methods in
ReaderContext. Patches are welcome too :)

Hope it helps,
Ashutosh


On Wed, Sep 3, 2014 at 11:12 AM, Nathan Bamford <nathan.bamford@redpoint.net
> wrote:

>  Hi Ashutosh,
>
>   Thanks for the reply!
>
>   Well, we are a yarn app that is essentially doing the same things
> mapreduce does. For regular files in Hadoop, we get the block locations and
> sizes and perform some internal sorting and load balancing on the master
> which then creates the slave yarn apps on individual nodes for reading. We
> strive for data locality, as much as possible.
>
>   To interface with Hive, the HCatalog api seemed like the appropriate
> interface.  It does a lot of things we want via the ReadEntity, allowing us
> to query and read the Hive tables at a high level.
>
>   I used the readerwriter example (from Hive 0.12) to get things running,
> but I was using HCatSplit just like our internal split classes. I retrieved
> them from the ReaderContext and ran them through the same sorting
> algorithms, then serialized them and sent them to the individual yarn apps,
> etc.
>
>   I understand the rationale for the smaller api, which is why I wondered
> if there's another avenue I should be pursuing as a yarn app (metadata vs.
> HCatalog, for instance).
>
>   All that being said :), the ability to get the block locations (and
> sizes, if possible) would certainly solve my problems.
>
>
>  Thanks,
>
>
>  Nathan
>
>
>  ------------------------------
> *From:* Ashutosh Chauhan <ha...@apache.org>
> *Sent:* Wednesday, September 3, 2014 9:16 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Reading and Writing with Hive 0.13 from a Yarn application
>
>   Hi Nathan,
>
>  This was done in https://issues.apache.org/jira/browse/HIVE-6248
> Reasoning was to minimize api surface area to users so that they are immune
> of incompatible changes in internal classes and thus making it easier for
> them to consume this and not get worried about version upgrade. Seems like
> in the process some of the functionality went away.
> Which info you are looking for exactly? Is it String[] getBlockLocations()
> equivalent of InputSplit? If so, we can consider adding that in
> ReaderContext() since that one need not to expose any hadoop or hive
> classes.
>
>  Thanks,
> Ashutosh
>
>
> On Tue, Sep 2, 2014 at 5:26 PM, Nathan Bamford <
> nathan.bamford@redpoint.net> wrote:
>
>>  Hi,
>>
>>   My company has been working on a Yarn application for a couple of
>> years-- we essentially take the place of MapReduce and split our data and
>> processing ourselves.
>>
>>   One of the things we've been working to support is Hive access, and the
>> HCatalog interfaces and API seemed perfect. Using this information:
>> <https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html>
>> https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html and
>> TestReaderWriter.java from the source code, I was able to create and use
>> HCatSplits to allow balanced data local parallel reading (using the size
>> and locations methods available from each HCatSplit).
>>
>>   Much to my dismay, 0.13 removes a lot of that functionality. The
>> ReaderContext class is now an interface that only exposes numSplits,
>> whereas all of the other methods are in the inaccessible (package
>> only) ReaderContextImpl class.
>>
>>   Since I no longer have access to the actual HCatSplits from the
>> ReaderContext, I am unable to process them and send them to our yarn app on
>> the data local nodes.  My only choice seems to be to partition out the
>> splits to slave nodes more or less at random.
>>
>>   Does anyone know if, as of 0.13, this is the intended way to interface
>> with Hive via non-Hadoop yarn applications? Is the underlying HCatSplit
>> only intended for internal use, now?
>>
>>
>>  Thanks,
>>
>>
>>  Nathan Bamford
>>
>
>

RE: Reading and Writing with Hive 0.13 from a Yarn application

Posted by Nathan Bamford <na...@redpoint.net>.
?Hi Ashutosh,

  Thanks for the reply!

  Well, we are a yarn app that is essentially doing the same things mapreduce does. For regular files in Hadoop, we get the block locations and sizes and perform some internal sorting and load balancing on the master which then creates the slave yarn apps on individual nodes for reading. We strive for data locality, as much as possible.

  To interface with Hive, the HCatalog api seemed like the appropriate interface.  It does a lot of things we want via the ReadEntity, allowing us to query and read the Hive tables at a high level.

  I used the readerwriter example (from Hive 0.12) to get things running, but I was using HCatSplit just like our internal split classes. I retrieved them from the ReaderContext and ran them through the same sorting algorithms, then serialized them and sent them to the individual yarn apps, etc.

  I understand the rationale for the smaller api, which is why I wondered if there's another avenue I should be pursuing as a yarn app (metadata vs. HCatalog, for instance).

  All that being said :), the ability to get the block locations (and sizes, if possible) would certainly solve my problems.


Thanks,


Nathan



________________________________
From: Ashutosh Chauhan <ha...@apache.org>
Sent: Wednesday, September 3, 2014 9:16 AM
To: user@hive.apache.org
Subject: Re: Reading and Writing with Hive 0.13 from a Yarn application

Hi Nathan,

This was done in https://issues.apache.org/jira/browse/HIVE-6248 Reasoning was to minimize api surface area to users so that they are immune of incompatible changes in internal classes and thus making it easier for them to consume this and not get worried about version upgrade. Seems like in the process some of the functionality went away.
Which info you are looking for exactly? Is it String[] getBlockLocations() equivalent of InputSplit? If so, we can consider adding that in ReaderContext() since that one need not to expose any hadoop or hive classes.

Thanks,
Ashutosh


On Tue, Sep 2, 2014 at 5:26 PM, Nathan Bamford <na...@redpoint.net>> wrote:

Hi,

  My company has been working on a Yarn application for a couple of years-- we essentially take the place of MapReduce and split our data and processing ourselves.

  One of the things we've been working to support is Hive access, and the HCatalog interfaces and API seemed perfect. Using this information: <https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html> https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html and TestReaderWriter.java from the source code, I was able to create and use HCatSplits to allow balanced data local parallel reading (using the size and locations methods available from each HCatSplit).

  Much to my dismay, 0.13 removes a lot of that functionality. The ReaderContext class is now an interface that only exposes numSplits, whereas all of the other methods are in the inaccessible (package only) ReaderContextImpl class.

  Since I no longer have access to the actual HCatSplits from the ReaderContext, I am unable to process them and send them to our yarn app on the data local nodes.  My only choice seems to be to partition out the splits to slave nodes more or less at random.

  Does anyone know if, as of 0.13, this is the intended way to interface with Hive via non-Hadoop yarn applications? Is the underlying HCatSplit only intended for internal use, now?


Thanks,


Nathan Bamford


Re: Reading and Writing with Hive 0.13 from a Yarn application

Posted by Ashutosh Chauhan <ha...@apache.org>.
Hi Nathan,

This was done in https://issues.apache.org/jira/browse/HIVE-6248 Reasoning
was to minimize api surface area to users so that they are immune of
incompatible changes in internal classes and thus making it easier for them
to consume this and not get worried about version upgrade. Seems like in
the process some of the functionality went away.
Which info you are looking for exactly? Is it String[] getBlockLocations()
equivalent of InputSplit? If so, we can consider adding that in
ReaderContext() since that one need not to expose any hadoop or hive
classes.

Thanks,
Ashutosh


On Tue, Sep 2, 2014 at 5:26 PM, Nathan Bamford <na...@redpoint.net>
wrote:

>  Hi,
>
>   My company has been working on a Yarn application for a couple of
> years-- we essentially take the place of MapReduce and split our data and
> processing ourselves.
>
>   One of the things we've been working to support is Hive access, and the
> HCatalog interfaces and API seemed perfect. Using this information:
> <https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html>
> https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html and
> TestReaderWriter.java from the source code, I was able to create and use
> HCatSplits to allow balanced data local parallel reading (using the size
> and locations methods available from each HCatSplit).
>
>   Much to my dismay, 0.13 removes a lot of that functionality. The
> ReaderContext class is now an interface that only exposes numSplits,
> whereas all of the other methods are in the inaccessible (package
> only) ReaderContextImpl class.
>
>   Since I no longer have access to the actual HCatSplits from the
> ReaderContext, I am unable to process them and send them to our yarn app on
> the data local nodes.  My only choice seems to be to partition out the
> splits to slave nodes more or less at random.
>
>   Does anyone know if, as of 0.13, this is the intended way to interface
> with Hive via non-Hadoop yarn applications? Is the underlying HCatSplit
> only intended for internal use, now?
>
>
>  Thanks,
>
>
>  Nathan Bamford
>