You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@gobblin.apache.org by Rohit Kalhans <ro...@gmail.com> on 2018/03/14 20:47:04 UTC

Queries regarding HIve DISTCP

Hi
I have a couple of queries regarding Gobblin Hive DISTCP,

1. We have a use case where We have Source cluster A and destination
cluster B. We are using Hive to hive replication using gobblin. the
Destination cluster B Sees a lot a long running Queries.
Lets say the Hive Query calculated the splits and started the map reduce
job, But the Hive DISTCP ran in between and it updated those partitions
(e.g. DELETED partitions). I believe if the old files are removed
from HDFS, We would get split not found error. This results in Query
failure. I was wondering if we would face this issue with the atomic DISTCP
of gobblin.

2. We have a use case where we need different partitions in source and
destination. e.g. saource has date/hour/minute partitions however the
destination we need only date partitions.  Is there a way we can achieve
this?

-- 
Cheerio!

*Rohit*

Re: Queries regarding HIve DISTCP

Posted by Abhishek Tiwari <ab...@apache.org>.

Hi Rohit,

Replies inline.

Abhishek

On Wed, Mar 14, 2018 at 1:47 PM, Rohit Kalhans <ro...@gmail.com>
wrote:

>
> Hi
> I have a couple of queries regarding Gobblin Hive DISTCP,
>
> 1. We have a use case where We have Source cluster A and destination
> cluster B. We are using Hive to hive replication using gobblin. the
> Destination cluster B Sees a lot a long running Queries.
> Lets say the Hive Query calculated the splits and started the map reduce
> job, But the Hive DISTCP ran in between and it updated those partitions
> (e.g. DELETED partitions). I believe if the old files are removed
> from HDFS, We would get split not found error. This results in Query
> failure. I was wondering if we would face this issue with the atomic DISTCP
> of gobblin.
>
Yes, you will still end up in failed queries, because if underlying files
change then the splits will not resolve. So, atomic move or not, you need
to keep the older data around for queries to not fail. We solve this in a
few of our pipelines (non-distcp) by keeping around versioned data that
gets cleaned by k-latest retention policy. However, for your model if the
changes are too frequent that means full frequent copies and might be very
inefficient.

>
> 2. We have a use case where we need different partitions in source and
> destination. e.g. saource has date/hour/minute partitions however the
> destination we need only date partitions.  Is there a way we can achieve
> this?
>
Assuming you can differentiate between the two using regex, you can use
whiltelist configs. If not, I think you will need an extension of
HiveSource to also inspect underlying path and / or metadata (hive table
props) to check that.

>
> --
> Cheerio!
>
> *Rohit*
>