You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by James McMahon <js...@gmail.com> on 2017/03/17 17:31:35 UTC

How to reject S3 Writes if folder does not exist?

Good afternoon. In my workflow I build an S3 output target from metadata
attributes. The vast majority of the time, the output target exists, and so
in my PutS3Object processor I set Objectkey to ${outputTarget}/${filename},
the target output folder exists, and my file is written to the right place
in my S3 bucket.

On rare occasions the output target may not exist. Perhaps someone did an
Http Request with malformed incoming attributes. In this case I want to
feed the Failure from my first PutS3Object to a second PutS3Object that
hard wires the Objectkey to an existing for_review folder in my bucket.

My problem is that the first PutS3Object appears to force the creation of
the malformed outputTarget named folder, and I can't get the error case to
cascade to the second S3 output processor. Is there a means to do this? Is
there a processor I can use prior to the S3 output processor to check for
the *existence *of the S3 folder, and output to either outputTarget (if it
exists), or to for_review (if it does not)?

Thanks in advance for your help. -Jim

Re: How to reject S3 Writes if folder does not exist?

Posted by Joe Skora <js...@gmail.com>.
Just to clarify, PutS3Object does not force the creation of a directory, it
just uploads to the requested "Object Key".  Technically, there are no
"directories" in S3, it is just a flat object store where buckets holds
objects.  This is why PutS3Object has properties for "Bucket" and "Object
Key" but not path or directory.

The notion of a hierarchical directory structure is superimposed by the S3
web GUI (and possibly other tools) such that a directory "projectX/" will
be shown if there is any "Object Key" stored that equals or begins with
"projectX/", such as "projectX/file1.txt".  In fact, if
"projectX/file1.txt" exists as a text file it is still possible to upload
another document as "projectX/file1.txt/logo.png" even though that violates
the rules of most hierarchical file systems since that implies a directory
and file with the same path.

Adding logic to confirm the existence of the directory structure would
create artificial constraints not required by S3, add complexity, and
require S3 requests and possibly state storage that are otherwise not
needed to store the object.

I hope that helps.

Regards,
Joe

On Fri, Mar 17, 2017 at 9:55 PM, James McMahon <js...@gmail.com> wrote:

> Thank you Adam and James. This has been very helpful, and gives me a
> number of options to explore. I am all set, thanks again for your help! -Jim
>
> On Fri, Mar 17, 2017 at 5:33 PM, Adam Lamar <ad...@gmail.com> wrote:
>
>> Jim,
>>
>> Absolutely that's one way. Depending on how many directories you have,
>> you can also do it directly with RouteOnAttribute and the expression
>> language:
>>
>> Property name: s3exists
>> Property value: ${outputTarget:equals('foo'):or(outputTarget:equals('
>> bar'))}
>>
>> Then route the s3exists relationship to PutS3Object.
>>
>> The python script strategy you mentioned may be good for a small to
>> medium number of directories.
>>
>> The ListS3 strategy mentioned by James might be a better fit if the list
>> is too large to easily maintain by hand.
>>
>> Hope that helps,
>> Adam
>>
>>
>> On Fri, Mar 17, 2017 at 3:07 PM, James McMahon <js...@gmail.com>
>> wrote:
>>
>>> So keep my list in a python script dictionary called by an ExecuteScript
>>> processor, and toss my outputTarget value against that. Set a new attribute
>>> s3exists to true or false in my script based on that result, and then use
>>> RouteAttribute to direct the output. Is that what you have in mind? -Jim
>>>
>>> On Fri, Mar 17, 2017 at 4:59 PM, Adam Lamar <ad...@gmail.com>
>>> wrote:
>>>
>>>> Jim,
>>>>
>>>> Also keep in mind that as an object store, S3 uses "directories" only
>>>> as a grouping concept, and not as a hierarchal storage mechanism. That's
>>>> why the initial PutS3Object doesn't fail with a new "directory". See
>>>> http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
>>>>
>>>> I think James' advice is spot on - to accomplish what you need, you'll
>>>> likely want to keep a list of known outputTargets in NiFi.
>>>>
>>>> Cheers,
>>>> Adam
>>>>
>>>
>>>
>>
>

Re: How to reject S3 Writes if folder does not exist?

Posted by James McMahon <js...@gmail.com>.
Thank you Adam and James. This has been very helpful, and gives me a number
of options to explore. I am all set, thanks again for your help! -Jim

On Fri, Mar 17, 2017 at 5:33 PM, Adam Lamar <ad...@gmail.com> wrote:

> Jim,
>
> Absolutely that's one way. Depending on how many directories you have, you
> can also do it directly with RouteOnAttribute and the expression language:
>
> Property name: s3exists
> Property value: ${outputTarget:equals('foo'):or(outputTarget:equals('
> bar'))}
>
> Then route the s3exists relationship to PutS3Object.
>
> The python script strategy you mentioned may be good for a small to medium
> number of directories.
>
> The ListS3 strategy mentioned by James might be a better fit if the list
> is too large to easily maintain by hand.
>
> Hope that helps,
> Adam
>
>
> On Fri, Mar 17, 2017 at 3:07 PM, James McMahon <js...@gmail.com>
> wrote:
>
>> So keep my list in a python script dictionary called by an ExecuteScript
>> processor, and toss my outputTarget value against that. Set a new attribute
>> s3exists to true or false in my script based on that result, and then use
>> RouteAttribute to direct the output. Is that what you have in mind? -Jim
>>
>> On Fri, Mar 17, 2017 at 4:59 PM, Adam Lamar <ad...@gmail.com> wrote:
>>
>>> Jim,
>>>
>>> Also keep in mind that as an object store, S3 uses "directories" only as
>>> a grouping concept, and not as a hierarchal storage mechanism. That's why
>>> the initial PutS3Object doesn't fail with a new "directory". See
>>> http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
>>>
>>> I think James' advice is spot on - to accomplish what you need, you'll
>>> likely want to keep a list of known outputTargets in NiFi.
>>>
>>> Cheers,
>>> Adam
>>>
>>
>>
>

Re: How to reject S3 Writes if folder does not exist?

Posted by Adam Lamar <ad...@gmail.com>.
Jim,

Absolutely that's one way. Depending on how many directories you have, you
can also do it directly with RouteOnAttribute and the expression language:

Property name: s3exists
Property value: ${outputTarget:equals('foo'):or(outputTarget:equals('bar'))}

Then route the s3exists relationship to PutS3Object.

The python script strategy you mentioned may be good for a small to medium
number of directories.

The ListS3 strategy mentioned by James might be a better fit if the list is
too large to easily maintain by hand.

Hope that helps,
Adam


On Fri, Mar 17, 2017 at 3:07 PM, James McMahon <js...@gmail.com> wrote:

> So keep my list in a python script dictionary called by an ExecuteScript
> processor, and toss my outputTarget value against that. Set a new attribute
> s3exists to true or false in my script based on that result, and then use
> RouteAttribute to direct the output. Is that what you have in mind? -Jim
>
> On Fri, Mar 17, 2017 at 4:59 PM, Adam Lamar <ad...@gmail.com> wrote:
>
>> Jim,
>>
>> Also keep in mind that as an object store, S3 uses "directories" only as
>> a grouping concept, and not as a hierarchal storage mechanism. That's why
>> the initial PutS3Object doesn't fail with a new "directory". See
>> http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
>>
>> I think James' advice is spot on - to accomplish what you need, you'll
>> likely want to keep a list of known outputTargets in NiFi.
>>
>> Cheers,
>> Adam
>>
>
>

Re: How to reject S3 Writes if folder does not exist?

Posted by James McMahon <js...@gmail.com>.
So keep my list in a python script dictionary called by an ExecuteScript
processor, and toss my outputTarget value against that. Set a new attribute
s3exists to true or false in my script based on that result, and then use
RouteAttribute to direct the output. Is that what you have in mind? -Jim

On Fri, Mar 17, 2017 at 4:59 PM, Adam Lamar <ad...@gmail.com> wrote:

> Jim,
>
> Also keep in mind that as an object store, S3 uses "directories" only as a
> grouping concept, and not as a hierarchal storage mechanism. That's why the
> initial PutS3Object doesn't fail with a new "directory". See
> http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
>
> I think James' advice is spot on - to accomplish what you need, you'll
> likely want to keep a list of known outputTargets in NiFi.
>
> Cheers,
> Adam
>

Re: How to reject S3 Writes if folder does not exist?

Posted by Adam Lamar <ad...@gmail.com>.
Jim,

Also keep in mind that as an object store, S3 uses "directories" only as a
grouping concept, and not as a hierarchal storage mechanism. That's why the
initial PutS3Object doesn't fail with a new "directory". See
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html

I think James' advice is spot on - to accomplish what you need, you'll
likely want to keep a list of known outputTargets in NiFi.

Cheers,
Adam

Re: How to reject S3 Writes if folder does not exist?

Posted by James McMahon <js...@gmail.com>.
Hmmm. Thank you James - I'm certainly willing to give this a try. Do you
have a link to an example that does a lookup against a DistributedMapCache
and takes two different workflow paths depending on the outcome? -Jim

On Fri, Mar 17, 2017 at 2:42 PM, James Wing <jv...@gmail.com> wrote:

> Jim,
>
> You could use ListS3 to get existing S3 keys, then parse out the
> 'directories', and put the directories in a key/value store for a lookup
> (like DistributedMapCache).  But you might also be able to maintain the
> lookup just with your metadata attributes in NiFi alone.
>
>
> Thanks,
>
> James
>
> On Fri, Mar 17, 2017 at 10:31 AM, James McMahon <js...@gmail.com>
> wrote:
>
>> Good afternoon. In my workflow I build an S3 output target from metadata
>> attributes. The vast majority of the time, the output target exists, and so
>> in my PutS3Object processor I set Objectkey to ${outputTarget}/${filename},
>> the target output folder exists, and my file is written to the right place
>> in my S3 bucket.
>>
>> On rare occasions the output target may not exist. Perhaps someone did an
>> Http Request with malformed incoming attributes. In this case I want to
>> feed the Failure from my first PutS3Object to a second PutS3Object that
>> hard wires the Objectkey to an existing for_review folder in my bucket.
>>
>> My problem is that the first PutS3Object appears to force the creation of
>> the malformed outputTarget named folder, and I can't get the error case to
>> cascade to the second S3 output processor. Is there a means to do this? Is
>> there a processor I can use prior to the S3 output processor to check for
>> the *existence *of the S3 folder, and output to either outputTarget (if
>> it exists), or to for_review (if it does not)?
>>
>> Thanks in advance for your help. -Jim
>>
>
>

Re: How to reject S3 Writes if folder does not exist?

Posted by James Wing <jv...@gmail.com>.
Jim,

You could use ListS3 to get existing S3 keys, then parse out the
'directories', and put the directories in a key/value store for a lookup
(like DistributedMapCache).  But you might also be able to maintain the
lookup just with your metadata attributes in NiFi alone.


Thanks,

James

On Fri, Mar 17, 2017 at 10:31 AM, James McMahon <js...@gmail.com>
wrote:

> Good afternoon. In my workflow I build an S3 output target from metadata
> attributes. The vast majority of the time, the output target exists, and so
> in my PutS3Object processor I set Objectkey to ${outputTarget}/${filename},
> the target output folder exists, and my file is written to the right place
> in my S3 bucket.
>
> On rare occasions the output target may not exist. Perhaps someone did an
> Http Request with malformed incoming attributes. In this case I want to
> feed the Failure from my first PutS3Object to a second PutS3Object that
> hard wires the Objectkey to an existing for_review folder in my bucket.
>
> My problem is that the first PutS3Object appears to force the creation of
> the malformed outputTarget named folder, and I can't get the error case to
> cascade to the second S3 output processor. Is there a means to do this? Is
> there a processor I can use prior to the S3 output processor to check for
> the *existence *of the S3 folder, and output to either outputTarget (if
> it exists), or to for_review (if it does not)?
>
> Thanks in advance for your help. -Jim
>