You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Naga Vijay <na...@gmail.com> on 2015/12/02 04:36:04 UTC

DistCp from Amazon S3 to HDFS

Hello,

Is there a processor to DistCp from Amazon S3 to HDFS, or do I need to
write a processor for it?

Thanks
Naga

Re: DistCp from Amazon S3 to HDFS

Posted by Joe Witt <jo...@gmail.com>.
Naga,

I like your idea/flow very much.  We should definitely put this up as
an example template with documentation on why/how it works.

Joe

On Thu, Dec 3, 2015 at 11:33 AM, Naga Vijay <na...@gmail.com> wrote:
> Mark, JoeS, JoeW,
>
> I have gone through Mark's comment in
> https://issues.apache.org/jira/browse/NIFI-25 and tend to agree ... I am
> also trying to see how AWS Lambda can fit into the picture ...
>
> --
>
> I'm not sure about the ListS3. I can definitely see the value of it.
> However, it requires that the processor maintain a significant amount of
> state about what it has seen This is not cluster friendly at all. It also
> requires continually pulling a potentially huge listing to see if anything
> has changed.
>
> I think we should instead push users to configure S3 to add a notification
> to SQS when a new object is placed in an S3 bucket. We can then have a
> GetSQS processor to detect that an item was added and then fetch the
> contents via GetS3/FetchS3/RetrieveS3. This is a much more scalable approach
> and handles backpressure well.
>
> --
>
> I notice https://issues.apache.org/jira/browse/NIFI-840 (Create ListS3
> processor) has been around for sometime.  Let me know your thoughts on when
> we can have ListS3 and/or if any help is needed.
>
> Naga Vijayapuram
>
>
> On Wed, Dec 2, 2015 at 12:31 PM, Naga Vijay <na...@gmail.com> wrote:
>>
>> Mark,
>>
>> Thanks for the pointer on SQS.
>>
>> I am thinking that it would help in having a higher level processor for
>> distcp to cover both HDFS and S3 as source/sink.
>>
>> Naga Vijayapuram
>>
>>
>> On Wed, Dec 2, 2015 at 9:48 AM, Mark Payne <ma...@hotmail.com> wrote:
>>>
>>> We certainly can do the reverse case - sync S3 with HDFS. With S3, as Joe
>>> S mentioned, we really should have a ListS3
>>> but currently do not (We do have a ListHDFS though). Typically the use
>>> case that I've used with S3 is to setup S3 to notify
>>> when an object arrives via SQS. Then have GetSQS get that notification
>>> and then pull the data via FetchS3Object.
>>> So you could fairly easily setup a GetSQS -> EvaluateJSONPath ->
>>> FetchS3Object -> PutHDFS. That would require that SQS be setup though to
>>> notify you when new objects arrive.
>>>
>>> On Dec 2, 2015, at 12:24 PM, Naga Vijay <na...@gmail.com> wrote:
>>>
>>> Joe Witt & Joe Skora,
>>>
>>> Thanks for thinking about this.  Yes, it would serve as a great
>>> example/template (as would the reverse case).
>>>
>>> Naga Vijayapuram
>>>
>>>
>>> On Tue, Dec 1, 2015 at 11:05 PM, Joe Skora <js...@gmail.com> wrote:
>>>>
>>>> @JoeW,
>>>>
>>>> It looks like we need to add a ListS3 processor in addition to the
>>>> Multipart Upload management that I'm looking into now.  Extending
>>>> ListFileTransfer for S3 shouldn't be too bad.
>>>>
>>>> JoeS
>>>>
>>>> On Wed, Dec 2, 2015 at 12:04 AM, Joe Witt <jo...@gmail.com> wrote:
>>>>>
>>>>> Hello
>>>>>
>>>>> So we have FetchS3 and PutHDFS and a series of interesting in between
>>>>> processes to help.  So that would get you most of the way there.  How
>>>>> to get the listing/know what to pull from S3?  That part I'm not sure
>>>>> about.
>>>>>
>>>>> This would make for a great example/template for us to post (as would
>>>>> the reverse case).
>>>>>
>>>>> Thanks
>>>>> Joe
>>>>>
>>>>> On Tue, Dec 1, 2015 at 10:36 PM, Naga Vijay <na...@gmail.com> wrote:
>>>>> > Hello,
>>>>> >
>>>>> > Is there a processor to DistCp from Amazon S3 to HDFS, or do I need
>>>>> > to write
>>>>> > a processor for it?
>>>>> >
>>>>> > Thanks
>>>>> > Naga
>>>>
>>>>
>>>
>>>
>>
>

Re: DistCp from Amazon S3 to HDFS

Posted by Naga Vijay <na...@gmail.com>.
Mark, JoeS, JoeW,

I have gone through Mark's comment in
https://issues.apache.org/jira/browse/NIFI-25 and tend to agree ... I am
also trying to see how AWS Lambda can fit into the picture ...

--

I'm not sure about the ListS3. I can definitely see the value of it.
However, it requires that the processor maintain a significant amount of
state about what it has seen This is not cluster friendly at all. It also
requires continually pulling a potentially huge listing to see if anything
has changed.

I think we should instead push users to configure S3 to add a notification
to SQS when a new object is placed in an S3 bucket. We can then have a
GetSQS processor to detect that an item was added and then fetch the
contents via GetS3/FetchS3/RetrieveS3. This is a much more scalable
approach and handles backpressure well.

--

I notice https://issues.apache.org/jira/browse/NIFI-840 (Create ListS3
processor) has been around for sometime.  Let me know your thoughts on when
we can have ListS3 and/or if any help is needed.
Naga Vijayapuram


On Wed, Dec 2, 2015 at 12:31 PM, Naga Vijay <na...@gmail.com> wrote:

> Mark,
>
> Thanks for the pointer on SQS.
>
> I am thinking that it would help in having a higher level processor for
> distcp to cover both HDFS and S3 as source/sink.
>
> Naga Vijayapuram
>
>
> On Wed, Dec 2, 2015 at 9:48 AM, Mark Payne <ma...@hotmail.com> wrote:
>
>> We certainly can do the reverse case - sync S3 with HDFS. With S3, as Joe
>> S mentioned, we really should have a ListS3
>> but currently do not (We do have a ListHDFS though). Typically the use
>> case that I've used with S3 is to setup S3 to notify
>> when an object arrives via SQS. Then have GetSQS get that notification
>> and then pull the data via FetchS3Object.
>> So you could fairly easily setup a GetSQS -> EvaluateJSONPath ->
>> FetchS3Object -> PutHDFS. That would require that SQS be setup though to
>> notify you when new objects arrive.
>>
>> On Dec 2, 2015, at 12:24 PM, Naga Vijay <na...@gmail.com> wrote:
>>
>> Joe Witt & Joe Skora,
>>
>> Thanks for thinking about this.  Yes, it would serve as a great
>> example/template (as would the reverse case).
>>
>> Naga Vijayapuram
>>
>>
>> On Tue, Dec 1, 2015 at 11:05 PM, Joe Skora <js...@gmail.com> wrote:
>>
>>> @JoeW,
>>>
>>> It looks like we need to add a ListS3 processor in addition to the
>>> Multipart Upload management that I'm looking into now.  Extending
>>> ListFileTransfer for S3 shouldn't be too bad.
>>>
>>> JoeS
>>>
>>> On Wed, Dec 2, 2015 at 12:04 AM, Joe Witt <jo...@gmail.com> wrote:
>>>
>>>> Hello
>>>>
>>>> So we have FetchS3 and PutHDFS and a series of interesting in between
>>>> processes to help.  So that would get you most of the way there.  How
>>>> to get the listing/know what to pull from S3?  That part I'm not sure
>>>> about.
>>>>
>>>> This would make for a great example/template for us to post (as would
>>>> the reverse case).
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> On Tue, Dec 1, 2015 at 10:36 PM, Naga Vijay <na...@gmail.com> wrote:
>>>> > Hello,
>>>> >
>>>> > Is there a processor to DistCp from Amazon S3 to HDFS, or do I need
>>>> to write
>>>> > a processor for it?
>>>> >
>>>> > Thanks
>>>> > Naga
>>>>
>>>
>>>
>>
>>
>

Re: DistCp from Amazon S3 to HDFS

Posted by Naga Vijay <na...@gmail.com>.
Mark,

Thanks for the pointer on SQS.

I am thinking that it would help in having a higher level processor for
distcp to cover both HDFS and S3 as source/sink.

Naga Vijayapuram


On Wed, Dec 2, 2015 at 9:48 AM, Mark Payne <ma...@hotmail.com> wrote:

> We certainly can do the reverse case - sync S3 with HDFS. With S3, as Joe
> S mentioned, we really should have a ListS3
> but currently do not (We do have a ListHDFS though). Typically the use
> case that I've used with S3 is to setup S3 to notify
> when an object arrives via SQS. Then have GetSQS get that notification and
> then pull the data via FetchS3Object.
> So you could fairly easily setup a GetSQS -> EvaluateJSONPath ->
> FetchS3Object -> PutHDFS. That would require that SQS be setup though to
> notify you when new objects arrive.
>
> On Dec 2, 2015, at 12:24 PM, Naga Vijay <na...@gmail.com> wrote:
>
> Joe Witt & Joe Skora,
>
> Thanks for thinking about this.  Yes, it would serve as a great
> example/template (as would the reverse case).
>
> Naga Vijayapuram
>
>
> On Tue, Dec 1, 2015 at 11:05 PM, Joe Skora <js...@gmail.com> wrote:
>
>> @JoeW,
>>
>> It looks like we need to add a ListS3 processor in addition to the
>> Multipart Upload management that I'm looking into now.  Extending
>> ListFileTransfer for S3 shouldn't be too bad.
>>
>> JoeS
>>
>> On Wed, Dec 2, 2015 at 12:04 AM, Joe Witt <jo...@gmail.com> wrote:
>>
>>> Hello
>>>
>>> So we have FetchS3 and PutHDFS and a series of interesting in between
>>> processes to help.  So that would get you most of the way there.  How
>>> to get the listing/know what to pull from S3?  That part I'm not sure
>>> about.
>>>
>>> This would make for a great example/template for us to post (as would
>>> the reverse case).
>>>
>>> Thanks
>>> Joe
>>>
>>> On Tue, Dec 1, 2015 at 10:36 PM, Naga Vijay <na...@gmail.com> wrote:
>>> > Hello,
>>> >
>>> > Is there a processor to DistCp from Amazon S3 to HDFS, or do I need to
>>> write
>>> > a processor for it?
>>> >
>>> > Thanks
>>> > Naga
>>>
>>
>>
>
>

Re: DistCp from Amazon S3 to HDFS

Posted by Mark Payne <ma...@hotmail.com>.
We certainly can do the reverse case - sync S3 with HDFS. With S3, as Joe S mentioned, we really should have a ListS3
but currently do not (We do have a ListHDFS though). Typically the use case that I've used with S3 is to setup S3 to notify
when an object arrives via SQS. Then have GetSQS get that notification and then pull the data via FetchS3Object.
So you could fairly easily setup a GetSQS -> EvaluateJSONPath -> FetchS3Object -> PutHDFS. That would require that SQS be setup though to
notify you when new objects arrive.

> On Dec 2, 2015, at 12:24 PM, Naga Vijay <na...@gmail.com> wrote:
> 
> Joe Witt & Joe Skora,
> 
> Thanks for thinking about this.  Yes, it would serve as a great example/template (as would the reverse case).
> 
> Naga Vijayapuram
> 
> 
> On Tue, Dec 1, 2015 at 11:05 PM, Joe Skora <jskora@gmail.com <ma...@gmail.com>> wrote:
> @JoeW,
> 
> It looks like we need to add a ListS3 processor in addition to the Multipart Upload management that I'm looking into now.  Extending ListFileTransfer for S3 shouldn't be too bad.
> 
> JoeS
> 
> On Wed, Dec 2, 2015 at 12:04 AM, Joe Witt <joe.witt@gmail.com <ma...@gmail.com>> wrote:
> Hello
> 
> So we have FetchS3 and PutHDFS and a series of interesting in between
> processes to help.  So that would get you most of the way there.  How
> to get the listing/know what to pull from S3?  That part I'm not sure
> about.
> 
> This would make for a great example/template for us to post (as would
> the reverse case).
> 
> Thanks
> Joe
> 
> On Tue, Dec 1, 2015 at 10:36 PM, Naga Vijay <nagahive@gmail.com <ma...@gmail.com>> wrote:
> > Hello,
> >
> > Is there a processor to DistCp from Amazon S3 to HDFS, or do I need to write
> > a processor for it?
> >
> > Thanks
> > Naga
> 
> 


Re: DistCp from Amazon S3 to HDFS

Posted by Naga Vijay <na...@gmail.com>.
Joe Witt & Joe Skora,

Thanks for thinking about this.  Yes, it would serve as a great
example/template (as would the reverse case).

Naga Vijayapuram


On Tue, Dec 1, 2015 at 11:05 PM, Joe Skora <js...@gmail.com> wrote:

> @JoeW,
>
> It looks like we need to add a ListS3 processor in addition to the
> Multipart Upload management that I'm looking into now.  Extending
> ListFileTransfer for S3 shouldn't be too bad.
>
> JoeS
>
> On Wed, Dec 2, 2015 at 12:04 AM, Joe Witt <jo...@gmail.com> wrote:
>
>> Hello
>>
>> So we have FetchS3 and PutHDFS and a series of interesting in between
>> processes to help.  So that would get you most of the way there.  How
>> to get the listing/know what to pull from S3?  That part I'm not sure
>> about.
>>
>> This would make for a great example/template for us to post (as would
>> the reverse case).
>>
>> Thanks
>> Joe
>>
>> On Tue, Dec 1, 2015 at 10:36 PM, Naga Vijay <na...@gmail.com> wrote:
>> > Hello,
>> >
>> > Is there a processor to DistCp from Amazon S3 to HDFS, or do I need to
>> write
>> > a processor for it?
>> >
>> > Thanks
>> > Naga
>>
>
>

Re: DistCp from Amazon S3 to HDFS

Posted by Joe Skora <js...@gmail.com>.
@JoeW,

It looks like we need to add a ListS3 processor in addition to the
Multipart Upload management that I'm looking into now.  Extending
ListFileTransfer for S3 shouldn't be too bad.

JoeS

On Wed, Dec 2, 2015 at 12:04 AM, Joe Witt <jo...@gmail.com> wrote:

> Hello
>
> So we have FetchS3 and PutHDFS and a series of interesting in between
> processes to help.  So that would get you most of the way there.  How
> to get the listing/know what to pull from S3?  That part I'm not sure
> about.
>
> This would make for a great example/template for us to post (as would
> the reverse case).
>
> Thanks
> Joe
>
> On Tue, Dec 1, 2015 at 10:36 PM, Naga Vijay <na...@gmail.com> wrote:
> > Hello,
> >
> > Is there a processor to DistCp from Amazon S3 to HDFS, or do I need to
> write
> > a processor for it?
> >
> > Thanks
> > Naga
>

Re: DistCp from Amazon S3 to HDFS

Posted by Joe Witt <jo...@gmail.com>.
Hello

So we have FetchS3 and PutHDFS and a series of interesting in between
processes to help.  So that would get you most of the way there.  How
to get the listing/know what to pull from S3?  That part I'm not sure
about.

This would make for a great example/template for us to post (as would
the reverse case).

Thanks
Joe

On Tue, Dec 1, 2015 at 10:36 PM, Naga Vijay <na...@gmail.com> wrote:
> Hello,
>
> Is there a processor to DistCp from Amazon S3 to HDFS, or do I need to write
> a processor for it?
>
> Thanks
> Naga