You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by jamesgreen <ja...@standardbank.co.za> on 2016/04/06 13:12:51 UTC

Compression of Data in HDFS

I am trying to compress a whole lot of files from my HDFS and write to
another folder on the HDFS
My Folder Structure is as follows:
\landing\databasename\prodeiw_arc\tablename\_SUCCESS
\landing\databasename\prodeiw_arc\tablename\part-m-00000

\landing\databasename\prodeiw_arc\tablename2\_SUCCESS
\landing\databasename\prodeiw_arc\tablename2\part-m-00000

I am trying to compress to the following
\landing\compressed\prodeiw_arc\tablename\_SUCCESS
\landing\compressed\prodeiw_arc\tablename\part-m-00000

\landing\compressed\prodeiw_arc\tablename2\_SUCCESS
\landing\compressed\prodeiw_arc\tablename2\part-m-00000

I have found that it compresses to 
\landing\compressed\prodeiw_arc\_SUCCESS
\landing\compressed\prodeiw_arc\tablename\part-m-00000

it will then continue to overwrite. Is there anyway I can keep the directory
structure when doing a PutHDFS?

Thanks and Regards




--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/Compression-of-Data-in-HDFS-tp8821.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: Compression of Data in HDFS

Posted by jamesgreen <ja...@standardbank.co.za>.
Hi Bryan
Thanks for your input, I did get it to work now. Sorry for the delayed
response 

Just to confirm if it reads from a certain file and compresses and writes
the compressed file to the Target Directory - how does nifi know that its
has read from a certain file already?
Or does it continue to read from Random Files?

Thanks 

James



--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/Compression-of-Data-in-HDFS-tp8821p9061.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: Compression of Data in HDFS

Posted by Bryan Bende <bb...@gmail.com>.
Hi James,

It looks like there may be a typo in what I wrote... I had $[path} but it
should be ${path}

Sorry about that, can you let us know if that worked.

Thanks,

Bryan

On Thu, Apr 7, 2016 at 3:05 AM, jamesgreen <ja...@standardbank.co.za>
wrote:

> Hi Bryan
>
> I tried what you suggested but it just creates a path called
> "/landing/teradata/compressed/prodeiw_arc/$[path}" ?
>
>
>
> --
> View this message in context:
> http://apache-nifi-developer-list.39713.n7.nabble.com/Compression-of-Data-in-HDFS-tp8821p8861.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>

Re: Compression of Data in HDFS

Posted by jamesgreen <ja...@standardbank.co.za>.
Hi Bryan

I tried what you suggested but it just creates a path called
"/landing/teradata/compressed/prodeiw_arc/$[path}" ?



--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/Compression-of-Data-in-HDFS-tp8821p8861.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: Compression of Data in HDFS

Posted by Bryan Bende <bb...@gmail.com>.
Ok one more question...

On GetHDFS are you setting the Directory to
"\landing\databasename\prodeiw_arc\"
and then setting Recurse Sub-Directories to true to have it go into each
table's directory?

The reason I ask is because the FlowFiles coming out of GetHDFS have an
attribute on them called Path, the documentation says:

The path is set to the relative path of the file's directory on HDFS. For
example, if the Directory property is set to /tmp, then files picked up
from /tmp will have the path attribute set to "./". If the Recurse
Subdirectories property is set to true and a file is picked up from
/tmp/abc/1/2/3, then the path attribute will be set to "abc/1/2/3"

So theoretically if you were pointing to "\landing\databasename\prodeiw_arc\"
and then it recursed into "\landing\databasename\prodeiw_arc\tablename",
the path attribute would end up being "tablename".

You could then reference this in your PutHDFS processor by setting the
Directory to "/landing/teradata/compressed/prodeiw_arc/$[path}"



On Wed, Apr 6, 2016 at 8:46 AM, jamesgreen <ja...@standardbank.co.za>
wrote:

> Hi Brian, Thanks for the help!
>
> I have tried two ways
> a.
> 1.      I use GetHDFS to retrieve data from the HDFS , I then use putHDFS
> and set
> the compression to GZIP.
> 2.      In the Directory I am putting the complete path i.e
> /landing/teradata/compressed/prodeiw_arc
> b.
> 1.       I use GetHDFS to retrieve data from the HDFS, I then use Compress
> Content to apply the compression and then use PutHDFS
> 2.      In the Directory I am putting the complete path i.e
> /landing/teradata/compressed/prodeiw_arc
>
>
>
>
> --
> View this message in context:
> http://apache-nifi-developer-list.39713.n7.nabble.com/Compression-of-Data-in-HDFS-tp8821p8825.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>

Re: Compression of Data in HDFS

Posted by jamesgreen <ja...@standardbank.co.za>.
Hi Brian, Thanks for the help!

I have tried two ways 
a.
1.	I use GetHDFS to retrieve data from the HDFS , I then use putHDFS and set
the compression to GZIP.
2.	In the Directory I am putting the complete path i.e
/landing/teradata/compressed/prodeiw_arc
b.
1.	 I use GetHDFS to retrieve data from the HDFS, I then use Compress
Content to apply the compression and then use PutHDFS
2.	In the Directory I am putting the complete path i.e
/landing/teradata/compressed/prodeiw_arc




--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/Compression-of-Data-in-HDFS-tp8821p8825.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: Compression of Data in HDFS

Posted by Bryan Bende <bb...@gmail.com>.
Hello,

Can you describe your flow a bit more?

Are you using ListHDFS + FetchHDFS to retrieve the data from HDFS?

What value do you have for the Directory property in PutHDFS?

Thanks,

Bryan

On Wed, Apr 6, 2016 at 7:12 AM, jamesgreen <ja...@standardbank.co.za>
wrote:

> I am trying to compress a whole lot of files from my HDFS and write to
> another folder on the HDFS
> My Folder Structure is as follows:
> \landing\databasename\prodeiw_arc\tablename\_SUCCESS
> \landing\databasename\prodeiw_arc\tablename\part-m-00000
>
> \landing\databasename\prodeiw_arc\tablename2\_SUCCESS
> \landing\databasename\prodeiw_arc\tablename2\part-m-00000
>
> I am trying to compress to the following
> \landing\compressed\prodeiw_arc\tablename\_SUCCESS
> \landing\compressed\prodeiw_arc\tablename\part-m-00000
>
> \landing\compressed\prodeiw_arc\tablename2\_SUCCESS
> \landing\compressed\prodeiw_arc\tablename2\part-m-00000
>
> I have found that it compresses to
> \landing\compressed\prodeiw_arc\_SUCCESS
> \landing\compressed\prodeiw_arc\tablename\part-m-00000
>
> it will then continue to overwrite. Is there anyway I can keep the
> directory
> structure when doing a PutHDFS?
>
> Thanks and Regards
>
>
>
>
> --
> View this message in context:
> http://apache-nifi-developer-list.39713.n7.nabble.com/Compression-of-Data-in-HDFS-tp8821.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>