You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Mark Petronic <ma...@gmail.com> on 2015/10/25 04:08:14 UTC

ExecuteStreamCommand processor for "tail -n +2" not working as expected

Just starting to use Nifi and built a flow that implements the following:

unzip -p my.zip *LMTD* | tail -n +2 | gzip --fast | hdfs dfs -put -
/some/hdfs/file

I used the following processor flow:

ExecuteProcess(unzip -p) -> ExecuteStreamCommand(tail -n +2) ->
CompressContent(gzip) -> PutHDFS

Couple questions/observations:

1. I got hung up for awhile on the ExecuteStreamCommand(tail -n +2)
part. I need that to strip the header line off of CSV files. I did not
see a simple way using a specific processor to strip off the first
line of a flow file. Is there a better way? But, I did notice a very
odd behavior of this command. If I configured the command arguments as
"-n +2" (without the quotes and space between the two parts), the
command would result in a "tail -n2" behavior. So, instead of giving
me all EXCEPT the first line, I only got the last 2 lines. However,
using "-n+2" (without the quotes and REMOVING the space) it worked as
expected. I believe with is confusing to the user. Both forms work
perfectly from the bash command line but only one works in Nifi?
Anyone care to comment on this? Should there be an enhancement to
remove this sort of inconsistent behavior?

2. Regarding my need to unzip ONLY one specific file from the zip
files (the one that matches *LMTD*), I did not see a way to do that
using the UnpackContent processor. Seems like it will only unzip the
whole zip file and provide me index numbers for each file unpacked.
This would be quite inefficient in my case because there are a number
of large files inside the zip file and I only need one. So, seems like
I am doing this the preferred way but, being new to Nifi, just wanted
to see if there are any other ideas on how to do this?

Thanks in advance for thoughts on this

Re: ExecuteStreamCommand processor for "tail -n +2" not working as expected

Posted by Mark Petronic <ma...@gmail.com>.
Thanks, Adam. I can't believe I missed that note about delimiting with
semicolon. Guess I was using the same format specified in the
ExecuteProcess processor that says args are space delimited. Hmmm,
maybe there should be a change to make args handling consistent across
processors? Anyway, I tried various combinations and it still only
worked using no spaces "-n+2". I opted for a different approach anyway
now do no longer using the tail.

On Sun, Oct 25, 2015 at 1:12 AM, Adam Lamar <ad...@gmail.com> wrote:
> Mark,
>
>> If I configured the command arguments as
> "-n +2" (without the quotes and space between the two parts), the
> command would result in a "tail -n2" behavior.
>
> If you look at the tooltip for the Command Arguments property in
> ExecuteStreamCommand, you'll see that the arguments need to be delimited by
> a semicolon. Maybe try "-n;+2" instead? I'm not sure the exact rules in
> NiFi, but I've seen similar behavior with regard to spaces in libraries that
> execute processes with command line arguments.
>
> There probably is a better way to process the CSV, but I'm afraid someone
> else will need to comment on that.
>
>> Seems like it will only unzip the
> whole zip file and provide me index numbers for each file unpacked.
>
> A quick look at the UnpackContent source [1] suggests that there is no way
> to filter the filenames inside the zipfile prior to extraction. I agree that
> would be a useful feature. Maybe one of the NiFi devs will comment on the
> possibility of including it as a feature in the future.
>
> Cheers,
> Adam
>
>
> [1]
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/UnpackContent.java#L304
>
>
>
> On 10/24/15 9:08 PM, Mark Petronic wrote:
>>
>> Just starting to use Nifi and built a flow that implements the following:
>>
>> unzip -p my.zip *LMTD* | tail -n +2 | gzip --fast | hdfs dfs -put -
>> /some/hdfs/file
>>
>> I used the following processor flow:
>>
>> ExecuteProcess(unzip -p) -> ExecuteStreamCommand(tail -n +2) ->
>> CompressContent(gzip) -> PutHDFS
>>
>> Couple questions/observations:
>>
>> 1. I got hung up for awhile on the ExecuteStreamCommand(tail -n +2)
>> part. I need that to strip the header line off of CSV files. I did not
>> see a simple way using a specific processor to strip off the first
>> line of a flow file. Is there a better way? But, I did notice a very
>> odd behavior of this command. If I configured the command arguments as
>> "-n +2" (without the quotes and space between the two parts), the
>> command would result in a "tail -n2" behavior. So, instead of giving
>> me all EXCEPT the first line, I only got the last 2 lines. However,
>> using "-n+2" (without the quotes and REMOVING the space) it worked as
>> expected. I believe with is confusing to the user. Both forms work
>> perfectly from the bash command line but only one works in Nifi?
>> Anyone care to comment on this? Should there be an enhancement to
>> remove this sort of inconsistent behavior?
>>
>> 2. Regarding my need to unzip ONLY one specific file from the zip
>> files (the one that matches *LMTD*), I did not see a way to do that
>> using the UnpackContent processor. Seems like it will only unzip the
>> whole zip file and provide me index numbers for each file unpacked.
>> This would be quite inefficient in my case because there are a number
>> of large files inside the zip file and I only need one. So, seems like
>> I am doing this the preferred way but, being new to Nifi, just wanted
>> to see if there are any other ideas on how to do this?
>>
>> Thanks in advance for thoughts on this
>
>

Re: ExecuteStreamCommand processor for "tail -n +2" not working as expected

Posted by Mark Payne <ma...@hotmail.com>.
Joe,

Ultimately, we couldn't change the behavior without breaking backward compatibility.

We do have a ticket [1] to add an "Argument Delimiter" property that is completed and
will be included in 0.4.0. It will default to semi-colon in order to maintain backward compatibility
but it can be changed to a space. It will at least make it more obvious that there's a funky
delimiter being used.

Thanks
-Mark


[1] https://issues.apache.org/jira/browse/NIFI-604 <https://issues.apache.org/jira/browse/NIFI-604>


> On Oct 26, 2015, at 10:14 AM, Joe Witt <jo...@gmail.com> wrote:
> 
> Mark
> 
> Ok understood.  I think ultimately in the case of ZIP the IO is
> happening anyway but if we can avoid writing these items to our
> repositories at all if they're uninteresting then great.  Do you mind
> filing a JIRA for that?
> 
> And yes you are absolutely right that you should be able to expect/get
> a consistent behavior between executecommand/script processors.  We
> have discussed this before.  I didn't find a jira.  Anyone else know
> the status of this?
> 
> Thanks
> Joe
> 
> On Mon, Oct 26, 2015 at 1:23 AM, Mark Petronic <ma...@gmail.com> wrote:
>> Joe, yes, I wanted to be able to selectively unzip a specific file
>> from a zip archive. For example, I have this zip archive and want to
>> just pull all files that match *LMTD* from it to standard out as a
>> stream to feed into hdfs as a file put. Since there are a bunch of big
>> files there, it is really wasteful to network I/O to have to stream
>> the whole file file just to throw away most of the bits in a later
>> filter stage just to end up with some part of the bits. I like
>> efficiency where it makes sense and there is already a lot of I/O from
>> Hadoop - no need to add more unnecessary stuff that could be easily
>> avoided. :)
>> 
>> unzip -l /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip
>> Archive:  /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip
>>  Length      Date    Time    Name
>> ---------  ---------- -----   ----
>> 73166261  10-22-2015 02:17   Consolidated_LMTD_001_20151022021503.csv
>> 80864628  10-22-2015 02:17   Consolidated_MODC_001_20151022021503.csv
>> 14033836  10-22-2015 02:17   Consolidated_SYMC_001_20151022021503.csv
>>   120463  10-22-2015 02:17   Consolidated_XPRT_001_20151022021503.csv
>> ---------                     -------
>> 168185188                     4 files
>> 
>> On Sun, Oct 25, 2015 at 11:56 AM, Joe Witt <jo...@gmail.com> wrote:
>>> Hello
>>> 
>>> For the unpacking portion are you saying you have a single archive
>>> (let's say in zip format) and it contains multiple objects within.
>>> You'd like to be able to use UnpackContent but tell it you'd like to
>>> skip or include specific items based on a regex or something against
>>> the names?
>>> 
>>> That seems reasonable to do but just wanted to make sure I understood.
>>> For now you can put a RouteOnAttribute processor after Unpack and just
>>> route to throw away unbundled items you don't care about.  You can
>>> create a property on that processor called 'stuff-i-dont-want' and the
>>> value would be something like
>>> ${filename:matches('*stuff-i-dont-want*')}.
>>> 
>>> Thanks
>>> Joe
>>> 
>>> On Sun, Oct 25, 2015 at 1:12 AM, Adam Lamar <ad...@gmail.com> wrote:
>>>> Mark,
>>>> 
>>>>> If I configured the command arguments as
>>>> "-n +2" (without the quotes and space between the two parts), the
>>>> command would result in a "tail -n2" behavior.
>>>> 
>>>> If you look at the tooltip for the Command Arguments property in
>>>> ExecuteStreamCommand, you'll see that the arguments need to be delimited by
>>>> a semicolon. Maybe try "-n;+2" instead? I'm not sure the exact rules in
>>>> NiFi, but I've seen similar behavior with regard to spaces in libraries that
>>>> execute processes with command line arguments.
>>>> 
>>>> There probably is a better way to process the CSV, but I'm afraid someone
>>>> else will need to comment on that.
>>>> 
>>>>> Seems like it will only unzip the
>>>> whole zip file and provide me index numbers for each file unpacked.
>>>> 
>>>> A quick look at the UnpackContent source [1] suggests that there is no way
>>>> to filter the filenames inside the zipfile prior to extraction. I agree that
>>>> would be a useful feature. Maybe one of the NiFi devs will comment on the
>>>> possibility of including it as a feature in the future.
>>>> 
>>>> Cheers,
>>>> Adam
>>>> 
>>>> 
>>>> [1]
>>>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/UnpackContent.java#L304
>>>> 
>>>> 
>>>> 
>>>> On 10/24/15 9:08 PM, Mark Petronic wrote:
>>>>> 
>>>>> Just starting to use Nifi and built a flow that implements the following:
>>>>> 
>>>>> unzip -p my.zip *LMTD* | tail -n +2 | gzip --fast | hdfs dfs -put -
>>>>> /some/hdfs/file
>>>>> 
>>>>> I used the following processor flow:
>>>>> 
>>>>> ExecuteProcess(unzip -p) -> ExecuteStreamCommand(tail -n +2) ->
>>>>> CompressContent(gzip) -> PutHDFS
>>>>> 
>>>>> Couple questions/observations:
>>>>> 
>>>>> 1. I got hung up for awhile on the ExecuteStreamCommand(tail -n +2)
>>>>> part. I need that to strip the header line off of CSV files. I did not
>>>>> see a simple way using a specific processor to strip off the first
>>>>> line of a flow file. Is there a better way? But, I did notice a very
>>>>> odd behavior of this command. If I configured the command arguments as
>>>>> "-n +2" (without the quotes and space between the two parts), the
>>>>> command would result in a "tail -n2" behavior. So, instead of giving
>>>>> me all EXCEPT the first line, I only got the last 2 lines. However,
>>>>> using "-n+2" (without the quotes and REMOVING the space) it worked as
>>>>> expected. I believe with is confusing to the user. Both forms work
>>>>> perfectly from the bash command line but only one works in Nifi?
>>>>> Anyone care to comment on this? Should there be an enhancement to
>>>>> remove this sort of inconsistent behavior?
>>>>> 
>>>>> 2. Regarding my need to unzip ONLY one specific file from the zip
>>>>> files (the one that matches *LMTD*), I did not see a way to do that
>>>>> using the UnpackContent processor. Seems like it will only unzip the
>>>>> whole zip file and provide me index numbers for each file unpacked.
>>>>> This would be quite inefficient in my case because there are a number
>>>>> of large files inside the zip file and I only need one. So, seems like
>>>>> I am doing this the preferred way but, being new to Nifi, just wanted
>>>>> to see if there are any other ideas on how to do this?
>>>>> 
>>>>> Thanks in advance for thoughts on this
>>>> 
>>>> 


Re: ExecuteStreamCommand processor for "tail -n +2" not working as expected

Posted by Joe Witt <jo...@gmail.com>.
Mark

Ok understood.  I think ultimately in the case of ZIP the IO is
happening anyway but if we can avoid writing these items to our
repositories at all if they're uninteresting then great.  Do you mind
filing a JIRA for that?

And yes you are absolutely right that you should be able to expect/get
a consistent behavior between executecommand/script processors.  We
have discussed this before.  I didn't find a jira.  Anyone else know
the status of this?

Thanks
Joe

On Mon, Oct 26, 2015 at 1:23 AM, Mark Petronic <ma...@gmail.com> wrote:
> Joe, yes, I wanted to be able to selectively unzip a specific file
> from a zip archive. For example, I have this zip archive and want to
> just pull all files that match *LMTD* from it to standard out as a
> stream to feed into hdfs as a file put. Since there are a bunch of big
> files there, it is really wasteful to network I/O to have to stream
> the whole file file just to throw away most of the bits in a later
> filter stage just to end up with some part of the bits. I like
> efficiency where it makes sense and there is already a lot of I/O from
> Hadoop - no need to add more unnecessary stuff that could be easily
> avoided. :)
>
> unzip -l /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip
> Archive:  /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip
>   Length      Date    Time    Name
> ---------  ---------- -----   ----
>  73166261  10-22-2015 02:17   Consolidated_LMTD_001_20151022021503.csv
>  80864628  10-22-2015 02:17   Consolidated_MODC_001_20151022021503.csv
>  14033836  10-22-2015 02:17   Consolidated_SYMC_001_20151022021503.csv
>    120463  10-22-2015 02:17   Consolidated_XPRT_001_20151022021503.csv
> ---------                     -------
> 168185188                     4 files
>
> On Sun, Oct 25, 2015 at 11:56 AM, Joe Witt <jo...@gmail.com> wrote:
>> Hello
>>
>> For the unpacking portion are you saying you have a single archive
>> (let's say in zip format) and it contains multiple objects within.
>> You'd like to be able to use UnpackContent but tell it you'd like to
>> skip or include specific items based on a regex or something against
>> the names?
>>
>> That seems reasonable to do but just wanted to make sure I understood.
>> For now you can put a RouteOnAttribute processor after Unpack and just
>> route to throw away unbundled items you don't care about.  You can
>> create a property on that processor called 'stuff-i-dont-want' and the
>> value would be something like
>> ${filename:matches('*stuff-i-dont-want*')}.
>>
>> Thanks
>> Joe
>>
>> On Sun, Oct 25, 2015 at 1:12 AM, Adam Lamar <ad...@gmail.com> wrote:
>>> Mark,
>>>
>>>> If I configured the command arguments as
>>> "-n +2" (without the quotes and space between the two parts), the
>>> command would result in a "tail -n2" behavior.
>>>
>>> If you look at the tooltip for the Command Arguments property in
>>> ExecuteStreamCommand, you'll see that the arguments need to be delimited by
>>> a semicolon. Maybe try "-n;+2" instead? I'm not sure the exact rules in
>>> NiFi, but I've seen similar behavior with regard to spaces in libraries that
>>> execute processes with command line arguments.
>>>
>>> There probably is a better way to process the CSV, but I'm afraid someone
>>> else will need to comment on that.
>>>
>>>> Seems like it will only unzip the
>>> whole zip file and provide me index numbers for each file unpacked.
>>>
>>> A quick look at the UnpackContent source [1] suggests that there is no way
>>> to filter the filenames inside the zipfile prior to extraction. I agree that
>>> would be a useful feature. Maybe one of the NiFi devs will comment on the
>>> possibility of including it as a feature in the future.
>>>
>>> Cheers,
>>> Adam
>>>
>>>
>>> [1]
>>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/UnpackContent.java#L304
>>>
>>>
>>>
>>> On 10/24/15 9:08 PM, Mark Petronic wrote:
>>>>
>>>> Just starting to use Nifi and built a flow that implements the following:
>>>>
>>>> unzip -p my.zip *LMTD* | tail -n +2 | gzip --fast | hdfs dfs -put -
>>>> /some/hdfs/file
>>>>
>>>> I used the following processor flow:
>>>>
>>>> ExecuteProcess(unzip -p) -> ExecuteStreamCommand(tail -n +2) ->
>>>> CompressContent(gzip) -> PutHDFS
>>>>
>>>> Couple questions/observations:
>>>>
>>>> 1. I got hung up for awhile on the ExecuteStreamCommand(tail -n +2)
>>>> part. I need that to strip the header line off of CSV files. I did not
>>>> see a simple way using a specific processor to strip off the first
>>>> line of a flow file. Is there a better way? But, I did notice a very
>>>> odd behavior of this command. If I configured the command arguments as
>>>> "-n +2" (without the quotes and space between the two parts), the
>>>> command would result in a "tail -n2" behavior. So, instead of giving
>>>> me all EXCEPT the first line, I only got the last 2 lines. However,
>>>> using "-n+2" (without the quotes and REMOVING the space) it worked as
>>>> expected. I believe with is confusing to the user. Both forms work
>>>> perfectly from the bash command line but only one works in Nifi?
>>>> Anyone care to comment on this? Should there be an enhancement to
>>>> remove this sort of inconsistent behavior?
>>>>
>>>> 2. Regarding my need to unzip ONLY one specific file from the zip
>>>> files (the one that matches *LMTD*), I did not see a way to do that
>>>> using the UnpackContent processor. Seems like it will only unzip the
>>>> whole zip file and provide me index numbers for each file unpacked.
>>>> This would be quite inefficient in my case because there are a number
>>>> of large files inside the zip file and I only need one. So, seems like
>>>> I am doing this the preferred way but, being new to Nifi, just wanted
>>>> to see if there are any other ideas on how to do this?
>>>>
>>>> Thanks in advance for thoughts on this
>>>
>>>

Re: ExecuteStreamCommand processor for "tail -n +2" not working as expected

Posted by Mark Petronic <ma...@gmail.com>.
Joe, yes, I wanted to be able to selectively unzip a specific file
from a zip archive. For example, I have this zip archive and want to
just pull all files that match *LMTD* from it to standard out as a
stream to feed into hdfs as a file put. Since there are a bunch of big
files there, it is really wasteful to network I/O to have to stream
the whole file file just to throw away most of the bits in a later
filter stage just to end up with some part of the bits. I like
efficiency where it makes sense and there is already a lot of I/O from
Hadoop - no need to add more unnecessary stuff that could be easily
avoided. :)

unzip -l /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip
Archive:  /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
 73166261  10-22-2015 02:17   Consolidated_LMTD_001_20151022021503.csv
 80864628  10-22-2015 02:17   Consolidated_MODC_001_20151022021503.csv
 14033836  10-22-2015 02:17   Consolidated_SYMC_001_20151022021503.csv
   120463  10-22-2015 02:17   Consolidated_XPRT_001_20151022021503.csv
---------                     -------
168185188                     4 files

On Sun, Oct 25, 2015 at 11:56 AM, Joe Witt <jo...@gmail.com> wrote:
> Hello
>
> For the unpacking portion are you saying you have a single archive
> (let's say in zip format) and it contains multiple objects within.
> You'd like to be able to use UnpackContent but tell it you'd like to
> skip or include specific items based on a regex or something against
> the names?
>
> That seems reasonable to do but just wanted to make sure I understood.
> For now you can put a RouteOnAttribute processor after Unpack and just
> route to throw away unbundled items you don't care about.  You can
> create a property on that processor called 'stuff-i-dont-want' and the
> value would be something like
> ${filename:matches('*stuff-i-dont-want*')}.
>
> Thanks
> Joe
>
> On Sun, Oct 25, 2015 at 1:12 AM, Adam Lamar <ad...@gmail.com> wrote:
>> Mark,
>>
>>> If I configured the command arguments as
>> "-n +2" (without the quotes and space between the two parts), the
>> command would result in a "tail -n2" behavior.
>>
>> If you look at the tooltip for the Command Arguments property in
>> ExecuteStreamCommand, you'll see that the arguments need to be delimited by
>> a semicolon. Maybe try "-n;+2" instead? I'm not sure the exact rules in
>> NiFi, but I've seen similar behavior with regard to spaces in libraries that
>> execute processes with command line arguments.
>>
>> There probably is a better way to process the CSV, but I'm afraid someone
>> else will need to comment on that.
>>
>>> Seems like it will only unzip the
>> whole zip file and provide me index numbers for each file unpacked.
>>
>> A quick look at the UnpackContent source [1] suggests that there is no way
>> to filter the filenames inside the zipfile prior to extraction. I agree that
>> would be a useful feature. Maybe one of the NiFi devs will comment on the
>> possibility of including it as a feature in the future.
>>
>> Cheers,
>> Adam
>>
>>
>> [1]
>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/UnpackContent.java#L304
>>
>>
>>
>> On 10/24/15 9:08 PM, Mark Petronic wrote:
>>>
>>> Just starting to use Nifi and built a flow that implements the following:
>>>
>>> unzip -p my.zip *LMTD* | tail -n +2 | gzip --fast | hdfs dfs -put -
>>> /some/hdfs/file
>>>
>>> I used the following processor flow:
>>>
>>> ExecuteProcess(unzip -p) -> ExecuteStreamCommand(tail -n +2) ->
>>> CompressContent(gzip) -> PutHDFS
>>>
>>> Couple questions/observations:
>>>
>>> 1. I got hung up for awhile on the ExecuteStreamCommand(tail -n +2)
>>> part. I need that to strip the header line off of CSV files. I did not
>>> see a simple way using a specific processor to strip off the first
>>> line of a flow file. Is there a better way? But, I did notice a very
>>> odd behavior of this command. If I configured the command arguments as
>>> "-n +2" (without the quotes and space between the two parts), the
>>> command would result in a "tail -n2" behavior. So, instead of giving
>>> me all EXCEPT the first line, I only got the last 2 lines. However,
>>> using "-n+2" (without the quotes and REMOVING the space) it worked as
>>> expected. I believe with is confusing to the user. Both forms work
>>> perfectly from the bash command line but only one works in Nifi?
>>> Anyone care to comment on this? Should there be an enhancement to
>>> remove this sort of inconsistent behavior?
>>>
>>> 2. Regarding my need to unzip ONLY one specific file from the zip
>>> files (the one that matches *LMTD*), I did not see a way to do that
>>> using the UnpackContent processor. Seems like it will only unzip the
>>> whole zip file and provide me index numbers for each file unpacked.
>>> This would be quite inefficient in my case because there are a number
>>> of large files inside the zip file and I only need one. So, seems like
>>> I am doing this the preferred way but, being new to Nifi, just wanted
>>> to see if there are any other ideas on how to do this?
>>>
>>> Thanks in advance for thoughts on this
>>
>>

Re: ExecuteStreamCommand processor for "tail -n +2" not working as expected

Posted by Joe Witt <jo...@gmail.com>.
Hello

For the unpacking portion are you saying you have a single archive
(let's say in zip format) and it contains multiple objects within.
You'd like to be able to use UnpackContent but tell it you'd like to
skip or include specific items based on a regex or something against
the names?

That seems reasonable to do but just wanted to make sure I understood.
For now you can put a RouteOnAttribute processor after Unpack and just
route to throw away unbundled items you don't care about.  You can
create a property on that processor called 'stuff-i-dont-want' and the
value would be something like
${filename:matches('*stuff-i-dont-want*')}.

Thanks
Joe

On Sun, Oct 25, 2015 at 1:12 AM, Adam Lamar <ad...@gmail.com> wrote:
> Mark,
>
>> If I configured the command arguments as
> "-n +2" (without the quotes and space between the two parts), the
> command would result in a "tail -n2" behavior.
>
> If you look at the tooltip for the Command Arguments property in
> ExecuteStreamCommand, you'll see that the arguments need to be delimited by
> a semicolon. Maybe try "-n;+2" instead? I'm not sure the exact rules in
> NiFi, but I've seen similar behavior with regard to spaces in libraries that
> execute processes with command line arguments.
>
> There probably is a better way to process the CSV, but I'm afraid someone
> else will need to comment on that.
>
>> Seems like it will only unzip the
> whole zip file and provide me index numbers for each file unpacked.
>
> A quick look at the UnpackContent source [1] suggests that there is no way
> to filter the filenames inside the zipfile prior to extraction. I agree that
> would be a useful feature. Maybe one of the NiFi devs will comment on the
> possibility of including it as a feature in the future.
>
> Cheers,
> Adam
>
>
> [1]
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/UnpackContent.java#L304
>
>
>
> On 10/24/15 9:08 PM, Mark Petronic wrote:
>>
>> Just starting to use Nifi and built a flow that implements the following:
>>
>> unzip -p my.zip *LMTD* | tail -n +2 | gzip --fast | hdfs dfs -put -
>> /some/hdfs/file
>>
>> I used the following processor flow:
>>
>> ExecuteProcess(unzip -p) -> ExecuteStreamCommand(tail -n +2) ->
>> CompressContent(gzip) -> PutHDFS
>>
>> Couple questions/observations:
>>
>> 1. I got hung up for awhile on the ExecuteStreamCommand(tail -n +2)
>> part. I need that to strip the header line off of CSV files. I did not
>> see a simple way using a specific processor to strip off the first
>> line of a flow file. Is there a better way? But, I did notice a very
>> odd behavior of this command. If I configured the command arguments as
>> "-n +2" (without the quotes and space between the two parts), the
>> command would result in a "tail -n2" behavior. So, instead of giving
>> me all EXCEPT the first line, I only got the last 2 lines. However,
>> using "-n+2" (without the quotes and REMOVING the space) it worked as
>> expected. I believe with is confusing to the user. Both forms work
>> perfectly from the bash command line but only one works in Nifi?
>> Anyone care to comment on this? Should there be an enhancement to
>> remove this sort of inconsistent behavior?
>>
>> 2. Regarding my need to unzip ONLY one specific file from the zip
>> files (the one that matches *LMTD*), I did not see a way to do that
>> using the UnpackContent processor. Seems like it will only unzip the
>> whole zip file and provide me index numbers for each file unpacked.
>> This would be quite inefficient in my case because there are a number
>> of large files inside the zip file and I only need one. So, seems like
>> I am doing this the preferred way but, being new to Nifi, just wanted
>> to see if there are any other ideas on how to do this?
>>
>> Thanks in advance for thoughts on this
>
>

Re: ExecuteStreamCommand processor for "tail -n +2" not working as expected

Posted by Adam Lamar <ad...@gmail.com>.
Mark,

 > If I configured the command arguments as
"-n +2" (without the quotes and space between the two parts), the
command would result in a "tail -n2" behavior.

If you look at the tooltip for the Command Arguments property in 
ExecuteStreamCommand, you'll see that the arguments need to be delimited 
by a semicolon. Maybe try "-n;+2" instead? I'm not sure the exact rules 
in NiFi, but I've seen similar behavior with regard to spaces in 
libraries that execute processes with command line arguments.

There probably is a better way to process the CSV, but I'm afraid 
someone else will need to comment on that.

 > Seems like it will only unzip the
whole zip file and provide me index numbers for each file unpacked.

A quick look at the UnpackContent source [1] suggests that there is no 
way to filter the filenames inside the zipfile prior to extraction. I 
agree that would be a useful feature. Maybe one of the NiFi devs will 
comment on the possibility of including it as a feature in the future.

Cheers,
Adam


[1] 
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/UnpackContent.java#L304


On 10/24/15 9:08 PM, Mark Petronic wrote:
> Just starting to use Nifi and built a flow that implements the following:
>
> unzip -p my.zip *LMTD* | tail -n +2 | gzip --fast | hdfs dfs -put -
> /some/hdfs/file
>
> I used the following processor flow:
>
> ExecuteProcess(unzip -p) -> ExecuteStreamCommand(tail -n +2) ->
> CompressContent(gzip) -> PutHDFS
>
> Couple questions/observations:
>
> 1. I got hung up for awhile on the ExecuteStreamCommand(tail -n +2)
> part. I need that to strip the header line off of CSV files. I did not
> see a simple way using a specific processor to strip off the first
> line of a flow file. Is there a better way? But, I did notice a very
> odd behavior of this command. If I configured the command arguments as
> "-n +2" (without the quotes and space between the two parts), the
> command would result in a "tail -n2" behavior. So, instead of giving
> me all EXCEPT the first line, I only got the last 2 lines. However,
> using "-n+2" (without the quotes and REMOVING the space) it worked as
> expected. I believe with is confusing to the user. Both forms work
> perfectly from the bash command line but only one works in Nifi?
> Anyone care to comment on this? Should there be an enhancement to
> remove this sort of inconsistent behavior?
>
> 2. Regarding my need to unzip ONLY one specific file from the zip
> files (the one that matches *LMTD*), I did not see a way to do that
> using the UnpackContent processor. Seems like it will only unzip the
> whole zip file and provide me index numbers for each file unpacked.
> This would be quite inefficient in my case because there are a number
> of large files inside the zip file and I only need one. So, seems like
> I am doing this the preferred way but, being new to Nifi, just wanted
> to see if there are any other ideas on how to do this?
>
> Thanks in advance for thoughts on this