You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Elliot West <te...@gmail.com> on 2016/01/07 13:17:57 UTC

Hive ExIm from on-premise HDP to Amazon EMR

Hello,

Following on from my earlier post concerning syncing Hive data from an on
premise cluster to the cloud, I've been experimenting with the
IMPORT/EXPORT functionality to move data from an on-premise HDP cluster to
Amazon EMR. I started out with some simple Exports/Imports as these can be
the core operations on which replication is founded. This worked fine with
some on-premise clusters running HDP-2.2.4.


// on cluster 1

EXPORT TABLE my_table PARTITION (year_month='2015-12')
TO '/exports/my_table'
FOR REPLICATION ('1');

// Copy from cluster1:/exports/my_table to cluster2:/staging/my_table

// on cluster 2

IMPORT FROM '/staging/my_table'
LOCATION '/warehouse/my_table';

// Table created, partition created, data relocated to
/warehouse/my_table/year_month=2015-12


I next tried similar with HDP-2.2.4 → EMR (4.2.0) like so:

// On premise HDP2.2.4
SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;

EXPORT TABLE my_table PARTITION (year_month='2015-12')
TO 's3n://API_KEY:SECRET_KEY@exports-bucket/my_table'

// on EMR
SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;

IMPORT FROM 's3n://exports-bucket/my_table'
LOCATION 's3n://hive-warehouse-bucket/my_table'


The IMPORT behaviour I see is bizarre:

   1. Creates the folder 's3n://hive-warehouse/my_table'
   2. Copies the part file from
   's3n://exports-bucket/my_table/year_month=2015-12' to
   's3n://exports-bucket/my_table' (i.e. to the parent)
   3. Fails with: "ERROR exec.Task: Failed with exception checkPaths:
   s3n://exports-bucket/my_table has nested
   directorys3n://exports-bucket/my_table/year_month=2015-12"

It is as if it is attempting to set the final partition location to
's3n://exports-bucket/my_table' and not
's3n://hive-warehouse-bucket/my_table/year_month=2015-12' as happens with
HDP → HDP.

I've tried variations, specifying the partition on import, excluding the
location, all with the same result. Any thoughts or assistance would be
appreciated.

Thanks - Elliot.

Re: Hive ExIm from on-premise HDP to Amazon EMR

Posted by Elliot West <te...@gmail.com>.

Yes, we do use Falcon. But only a small fraction of our the datasets we
wish to replicate are defined in this way. Could I perhaps just declare the
feeds in falcon and not the processes that create them? Also, doesn't
falcon use Hive ExIm/Replication to achieve this internally and therefore
might I still encounter the same bug I am seeing now?

Thanks for your response.

On Sunday, 24 January 2016, Artem Ervits <db...@gmail.com> wrote:

> Have you looked at Apache Falcon?
> On Jan 8, 2016 2:41 AM, "Elliot West" <teabot@gmail.com
> <javascript:_e(%7B%7D,'cvml','teabot@gmail.com');>> wrote:
>
>> Further investigation appears to show this going wrong in a copy phase of
>> the plan. The correctly functioning HDFS → HDFS import copy stage looks
>> like this:
>>
>> STAGE PLANS:
>>   Stage: Stage-1
>>     Copy
>>       source: hdfs://host:8020/staging/my_table/year_month=2015-12
>>       destination:
>> hdfs://host:8020/tmp/hive/hadoop/4f155e62-cec1-4b35-95e5-647ab5a74d3d/hive_2016-01-07_17-27-48_864_1838369633925145253-1/-ext-10000
>>
>>
>> Whereas the S3 → S3 import copy stage shows an unexpected destination,
>> which was presumably meant to be a temporary location on the source file
>> system but is in fact simply the parent directory:
>>
>>
>> STAGE PLANS:
>>   Stage: Stage-1
>>     Copy
>>       source: s3n://exports-bucket/my_table/year_month=2015-12
>>       destination: s3n://exports-bucket/my_table
>>
>>
>> These stage plans were obtained using:
>>
>> EXPLAIN
>> IMPORT FROM 'spource
>> LOCATION 'destination';
>>
>>
>> I'm beginning to think that this is a bug and not something I can work
>> around, which is unfortunate as I'm not really in a position to deploy a
>> fixed version in the short term. That said, if you confirm that this is not
>> the intended behaviour, I'll raise a JIRA and possibly work on a fix.
>>
>> Thanks - Elliot.
>>
>>
>> On 7 January 2016 at 16:53, Elliot West <teabot@gmail.com
>> <javascript:_e(%7B%7D,'cvml','teabot@gmail.com');>> wrote:
>>
>>> More information: This works if I move the export into EMR's HDFS and
>>> then import from there to a new location in HDFS. It does not work across
>>> FileSystems:
>>>
>>>    - Import from S3 → EMR HDFS (fails in a similar manner to S3 → S3)
>>>    - Import from EMR HDFS → S3 (complains that HDFS FileSystem was
>>>    expected as the destination. Presumably the same FileSystem instance
>>>    is used for the source and destination).
>>>
>>>
>>>
>>> On 7 January 2016 at 12:17, Elliot West <teabot@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','teabot@gmail.com');>> wrote:
>>>
>>>> Hello,
>>>>
>>>> Following on from my earlier post concerning syncing Hive data from an
>>>> on premise cluster to the cloud, I've been experimenting with the
>>>> IMPORT/EXPORT functionality to move data from an on-premise HDP cluster to
>>>> Amazon EMR. I started out with some simple Exports/Imports as these can be
>>>> the core operations on which replication is founded. This worked fine with
>>>> some on-premise clusters running HDP-2.2.4.
>>>>
>>>>
>>>> // on cluster 1
>>>>
>>>> EXPORT TABLE my_table PARTITION (year_month='2015-12')
>>>> TO '/exports/my_table'
>>>> FOR REPLICATION ('1');
>>>>
>>>> // Copy from cluster1:/exports/my_table to cluster2:/staging/my_table
>>>>
>>>> // on cluster 2
>>>>
>>>> IMPORT FROM '/staging/my_table'
>>>> LOCATION '/warehouse/my_table';
>>>>
>>>> // Table created, partition created, data relocated to
>>>> /warehouse/my_table/year_month=2015-12
>>>>
>>>>
>>>> I next tried similar with HDP-2.2.4 → EMR (4.2.0) like so:
>>>>
>>>> // On premise HDP2.2.4
>>>> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>>>>
>>>> EXPORT TABLE my_table PARTITION (year_month='2015-12')
>>>> TO 's3n://API_KEY:SECRET_KEY@exports-bucket/my_table'
>>>>
>>>> // on EMR
>>>> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>>>>
>>>> IMPORT FROM 's3n://exports-bucket/my_table'
>>>> LOCATION 's3n://hive-warehouse-bucket/my_table'
>>>>
>>>>
>>>> The IMPORT behaviour I see is bizarre:
>>>>
>>>>    1. Creates the folder 's3n://hive-warehouse/my_table'
>>>>    2. Copies the part file from
>>>>    's3n://exports-bucket/my_table/year_month=2015-12' to
>>>>    's3n://exports-bucket/my_table' (i.e. to the parent)
>>>>    3. Fails with: "ERROR exec.Task: Failed with exception checkPaths:
>>>>    s3n://exports-bucket/my_table has nested
>>>>    directorys3n://exports-bucket/my_table/year_month=2015-12"
>>>>
>>>> It is as if it is attempting to set the final partition location to
>>>> 's3n://exports-bucket/my_table' and not
>>>> 's3n://hive-warehouse-bucket/my_table/year_month=2015-12' as happens with
>>>> HDP → HDP.
>>>>
>>>> I've tried variations, specifying the partition on import, excluding
>>>> the location, all with the same result. Any thoughts or assistance would be
>>>> appreciated.
>>>>
>>>> Thanks - Elliot.
>>>>
>>>>
>>>>
>>>
>>

Re: Hive ExIm from on-premise HDP to Amazon EMR

Posted by Artem Ervits <db...@gmail.com>.

Have you looked at Apache Falcon?
On Jan 8, 2016 2:41 AM, "Elliot West" <te...@gmail.com> wrote:

> Further investigation appears to show this going wrong in a copy phase of
> the plan. The correctly functioning HDFS → HDFS import copy stage looks
> like this:
>
> STAGE PLANS:
>   Stage: Stage-1
>     Copy
>       source: hdfs://host:8020/staging/my_table/year_month=2015-12
>       destination:
> hdfs://host:8020/tmp/hive/hadoop/4f155e62-cec1-4b35-95e5-647ab5a74d3d/hive_2016-01-07_17-27-48_864_1838369633925145253-1/-ext-10000
>
>
> Whereas the S3 → S3 import copy stage shows an unexpected destination,
> which was presumably meant to be a temporary location on the source file
> system but is in fact simply the parent directory:
>
>
> STAGE PLANS:
>   Stage: Stage-1
>     Copy
>       source: s3n://exports-bucket/my_table/year_month=2015-12
>       destination: s3n://exports-bucket/my_table
>
>
> These stage plans were obtained using:
>
> EXPLAIN
> IMPORT FROM 'spource
> LOCATION 'destination';
>
>
> I'm beginning to think that this is a bug and not something I can work
> around, which is unfortunate as I'm not really in a position to deploy a
> fixed version in the short term. That said, if you confirm that this is not
> the intended behaviour, I'll raise a JIRA and possibly work on a fix.
>
> Thanks - Elliot.
>
>
> On 7 January 2016 at 16:53, Elliot West <te...@gmail.com> wrote:
>
>> More information: This works if I move the export into EMR's HDFS and
>> then import from there to a new location in HDFS. It does not work across
>> FileSystems:
>>
>>    - Import from S3 → EMR HDFS (fails in a similar manner to S3 → S3)
>>    - Import from EMR HDFS → S3 (complains that HDFS FileSystem was
>>    expected as the destination. Presumably the same FileSystem instance
>>    is used for the source and destination).
>>
>>
>>
>> On 7 January 2016 at 12:17, Elliot West <te...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Following on from my earlier post concerning syncing Hive data from an
>>> on premise cluster to the cloud, I've been experimenting with the
>>> IMPORT/EXPORT functionality to move data from an on-premise HDP cluster to
>>> Amazon EMR. I started out with some simple Exports/Imports as these can be
>>> the core operations on which replication is founded. This worked fine with
>>> some on-premise clusters running HDP-2.2.4.
>>>
>>>
>>> // on cluster 1
>>>
>>> EXPORT TABLE my_table PARTITION (year_month='2015-12')
>>> TO '/exports/my_table'
>>> FOR REPLICATION ('1');
>>>
>>> // Copy from cluster1:/exports/my_table to cluster2:/staging/my_table
>>>
>>> // on cluster 2
>>>
>>> IMPORT FROM '/staging/my_table'
>>> LOCATION '/warehouse/my_table';
>>>
>>> // Table created, partition created, data relocated to
>>> /warehouse/my_table/year_month=2015-12
>>>
>>>
>>> I next tried similar with HDP-2.2.4 → EMR (4.2.0) like so:
>>>
>>> // On premise HDP2.2.4
>>> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>>>
>>> EXPORT TABLE my_table PARTITION (year_month='2015-12')
>>> TO 's3n://API_KEY:SECRET_KEY@exports-bucket/my_table'
>>>
>>> // on EMR
>>> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>>>
>>> IMPORT FROM 's3n://exports-bucket/my_table'
>>> LOCATION 's3n://hive-warehouse-bucket/my_table'
>>>
>>>
>>> The IMPORT behaviour I see is bizarre:
>>>
>>>    1. Creates the folder 's3n://hive-warehouse/my_table'
>>>    2. Copies the part file from
>>>    's3n://exports-bucket/my_table/year_month=2015-12' to
>>>    's3n://exports-bucket/my_table' (i.e. to the parent)
>>>    3. Fails with: "ERROR exec.Task: Failed with exception checkPaths:
>>>    s3n://exports-bucket/my_table has nested
>>>    directorys3n://exports-bucket/my_table/year_month=2015-12"
>>>
>>> It is as if it is attempting to set the final partition location to
>>> 's3n://exports-bucket/my_table' and not
>>> 's3n://hive-warehouse-bucket/my_table/year_month=2015-12' as happens with
>>> HDP → HDP.
>>>
>>> I've tried variations, specifying the partition on import, excluding the
>>> location, all with the same result. Any thoughts or assistance would be
>>> appreciated.
>>>
>>> Thanks - Elliot.
>>>
>>>
>>>
>>
>

Re: Hive ExIm from on-premise HDP to Amazon EMR

Posted by Elliot West <te...@gmail.com>.

Further investigation appears to show this going wrong in a copy phase of
the plan. The correctly functioning HDFS → HDFS import copy stage looks
like this:

STAGE PLANS:
  Stage: Stage-1
    Copy
      source: hdfs://host:8020/staging/my_table/year_month=2015-12
      destination:
hdfs://host:8020/tmp/hive/hadoop/4f155e62-cec1-4b35-95e5-647ab5a74d3d/hive_2016-01-07_17-27-48_864_1838369633925145253-1/-ext-10000


Whereas the S3 → S3 import copy stage shows an unexpected destination,
which was presumably meant to be a temporary location on the source file
system but is in fact simply the parent directory:


STAGE PLANS:
  Stage: Stage-1
    Copy
      source: s3n://exports-bucket/my_table/year_month=2015-12
      destination: s3n://exports-bucket/my_table


These stage plans were obtained using:

EXPLAIN
IMPORT FROM 'spource
LOCATION 'destination';


I'm beginning to think that this is a bug and not something I can work
around, which is unfortunate as I'm not really in a position to deploy a
fixed version in the short term. That said, if you confirm that this is not
the intended behaviour, I'll raise a JIRA and possibly work on a fix.

Thanks - Elliot.


On 7 January 2016 at 16:53, Elliot West <te...@gmail.com> wrote:

> More information: This works if I move the export into EMR's HDFS and then
> import from there to a new location in HDFS. It does not work across
> FileSystems:
>
>    - Import from S3 → EMR HDFS (fails in a similar manner to S3 → S3)
>    - Import from EMR HDFS → S3 (complains that HDFS FileSystem was
>    expected as the destination. Presumably the same FileSystem instance
>    is used for the source and destination).
>
>
>
> On 7 January 2016 at 12:17, Elliot West <te...@gmail.com> wrote:
>
>> Hello,
>>
>> Following on from my earlier post concerning syncing Hive data from an on
>> premise cluster to the cloud, I've been experimenting with the
>> IMPORT/EXPORT functionality to move data from an on-premise HDP cluster to
>> Amazon EMR. I started out with some simple Exports/Imports as these can be
>> the core operations on which replication is founded. This worked fine with
>> some on-premise clusters running HDP-2.2.4.
>>
>>
>> // on cluster 1
>>
>> EXPORT TABLE my_table PARTITION (year_month='2015-12')
>> TO '/exports/my_table'
>> FOR REPLICATION ('1');
>>
>> // Copy from cluster1:/exports/my_table to cluster2:/staging/my_table
>>
>> // on cluster 2
>>
>> IMPORT FROM '/staging/my_table'
>> LOCATION '/warehouse/my_table';
>>
>> // Table created, partition created, data relocated to
>> /warehouse/my_table/year_month=2015-12
>>
>>
>> I next tried similar with HDP-2.2.4 → EMR (4.2.0) like so:
>>
>> // On premise HDP2.2.4
>> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>>
>> EXPORT TABLE my_table PARTITION (year_month='2015-12')
>> TO 's3n://API_KEY:SECRET_KEY@exports-bucket/my_table'
>>
>> // on EMR
>> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>>
>> IMPORT FROM 's3n://exports-bucket/my_table'
>> LOCATION 's3n://hive-warehouse-bucket/my_table'
>>
>>
>> The IMPORT behaviour I see is bizarre:
>>
>>    1. Creates the folder 's3n://hive-warehouse/my_table'
>>    2. Copies the part file from
>>    's3n://exports-bucket/my_table/year_month=2015-12' to
>>    's3n://exports-bucket/my_table' (i.e. to the parent)
>>    3. Fails with: "ERROR exec.Task: Failed with exception checkPaths:
>>    s3n://exports-bucket/my_table has nested
>>    directorys3n://exports-bucket/my_table/year_month=2015-12"
>>
>> It is as if it is attempting to set the final partition location to
>> 's3n://exports-bucket/my_table' and not
>> 's3n://hive-warehouse-bucket/my_table/year_month=2015-12' as happens with
>> HDP → HDP.
>>
>> I've tried variations, specifying the partition on import, excluding the
>> location, all with the same result. Any thoughts or assistance would be
>> appreciated.
>>
>> Thanks - Elliot.
>>
>>
>>
>

Re: Hive ExIm from on-premise HDP to Amazon EMR

Posted by Elliot West <te...@gmail.com>.

More information: This works if I move the export into EMR's HDFS and then
import from there to a new location in HDFS. It does not work across
FileSystems:

   - Import from S3 → EMR HDFS (fails in a similar manner to S3 → S3)
   - Import from EMR HDFS → S3 (complains that HDFS FileSystem was expected
   as the destination. Presumably the same FileSystem instance is used for
   the source and destination).



On 7 January 2016 at 12:17, Elliot West <te...@gmail.com> wrote:

> Hello,
>
> Following on from my earlier post concerning syncing Hive data from an on
> premise cluster to the cloud, I've been experimenting with the
> IMPORT/EXPORT functionality to move data from an on-premise HDP cluster to
> Amazon EMR. I started out with some simple Exports/Imports as these can be
> the core operations on which replication is founded. This worked fine with
> some on-premise clusters running HDP-2.2.4.
>
>
> // on cluster 1
>
> EXPORT TABLE my_table PARTITION (year_month='2015-12')
> TO '/exports/my_table'
> FOR REPLICATION ('1');
>
> // Copy from cluster1:/exports/my_table to cluster2:/staging/my_table
>
> // on cluster 2
>
> IMPORT FROM '/staging/my_table'
> LOCATION '/warehouse/my_table';
>
> // Table created, partition created, data relocated to
> /warehouse/my_table/year_month=2015-12
>
>
> I next tried similar with HDP-2.2.4 → EMR (4.2.0) like so:
>
> // On premise HDP2.2.4
> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>
> EXPORT TABLE my_table PARTITION (year_month='2015-12')
> TO 's3n://API_KEY:SECRET_KEY@exports-bucket/my_table'
>
> // on EMR
> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>
> IMPORT FROM 's3n://exports-bucket/my_table'
> LOCATION 's3n://hive-warehouse-bucket/my_table'
>
>
> The IMPORT behaviour I see is bizarre:
>
>    1. Creates the folder 's3n://hive-warehouse/my_table'
>    2. Copies the part file from
>    's3n://exports-bucket/my_table/year_month=2015-12' to
>    's3n://exports-bucket/my_table' (i.e. to the parent)
>    3. Fails with: "ERROR exec.Task: Failed with exception checkPaths:
>    s3n://exports-bucket/my_table has nested
>    directorys3n://exports-bucket/my_table/year_month=2015-12"
>
> It is as if it is attempting to set the final partition location to
> 's3n://exports-bucket/my_table' and not
> 's3n://hive-warehouse-bucket/my_table/year_month=2015-12' as happens with
> HDP → HDP.
>
> I've tried variations, specifying the partition on import, excluding the
> location, all with the same result. Any thoughts or assistance would be
> appreciated.
>
> Thanks - Elliot.
>
>
>