You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Georgi Ivanov <iv...@vesseltracker.com> on 2014/09/22 16:40:54 UTC

Bzip2 files as an input to MR job

Hi guys,
I would like to compress the files on HDFS to save some storage.

As far as i see bzip2 is the only format which is splitable (and slow).

The actual files are Avro.

So in my driver class i have :

job.setInputFormatClass(AvroKeyInputFormat.class);

I have number of jobs running processing Avro files so i would like to 
keep the code change to a minimum.

Is it possible to comrpess these avro files with bzip2 and keep the code 
of MR jobs the same (or with little change)
If it is , please give me some hints as so far i don't seem to find any 
good resources on the Internet.


Georgi

RE: Bzip2 files as an input to MR job

Posted by java8964 <ja...@hotmail.com>.
Georgi:
I think  you misunderstand the originally answer.
If you already use Avor format, then the file will be splitable. If you want to add compression on top of that,  feel free going ahead.
If you read the Avor DataFileWriter API:
http://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/file/DataFileWriter.html#setCodec(org.apache.avro.file.CodecFactory)
You will see there is a setCodec method, which allow to you specify any codec to compress your data.
The compression can be either per block, or per record. Per block is recommended, as it will be more efficient.
You can use bzip2 or gzip or snappy or any other compression. You just need to to use the above api, and make sure the compression codec is available in all your task nodes.
splitable or unsplitable compression doesn't matter to you in this case, as you are using AVRO, which is splitable.
What you need to choose is which compression is better, or fit your application usage case.
In our production, we use snappy, as it gives us a good balance between compression ratio and read/decompression speed and CPU usage.
Different compressions have trade off. You need to compare them based on your case.
Yong

Date: Mon, 22 Sep 2014 17:21:29 +0200
From: ivanov@vesseltracker.com
To: user@hadoop.apache.org
Subject: Re: Bzip2 files as an input to MR job


  
    
  
  
    Hi Niels,

      Thanks for the reply.

      Changing the avro files is not really an option for me as it will
      require a lot of time( i have a lot ).

      The Avro files themself are compressed a bit.

      But still bzip2 gives 50% compression on one avro file.

      

      So what i want is , to use Bzip2 compressed file as an input to my
      MR jobs.

      Bzip2 is splittable.

      Should be possible somehow , but i don't seem to find it atm.

      

      On 22.09.2014 17:13, Niels Basjes wrote:

    
    
      
        Hi,
        

        
        You can use the GZip inside the AVRO files and still have
        splittable AVRO files.
        This has the to with the fact that there is a block
          structure inside the AVRO and these blocks are gzipped.

        
        

        
        I suggest you simply try it.
        

        
        Niels
        

        
        

          On Mon, Sep 22, 2014 at 4:40 PM,
            Georgi Ivanov <iv...@vesseltracker.com>
            wrote:

            Hi guys,

              I would like to compress the files on HDFS to save some
              storage.

              

              As far as i see bzip2 is the only format which is
              splitable (and slow).

              

              The actual files are Avro.

              

              So in my driver class i have :

              

              job.setInputFormatClass(AvroKeyInputFormat.class);

              

              I have number of jobs running processing Avro files so i
              would like to keep the code change to a minimum.

              

              Is it possible to comrpess these avro files with bzip2 and
              keep the code of MR jobs the same (or with little change)

              If it is , please give me some hints as so far i don't
              seem to find any good resources on the Internet.

                  

                  

                  Georgi

                
          
          

          
          

          
          -- 

          Best regards / Met vriendelijke groeten,

          

          Niels Basjes
        
      
    
    
 		 	   		  

RE: Bzip2 files as an input to MR job

Posted by java8964 <ja...@hotmail.com>.
Georgi:
I think  you misunderstand the originally answer.
If you already use Avor format, then the file will be splitable. If you want to add compression on top of that,  feel free going ahead.
If you read the Avor DataFileWriter API:
http://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/file/DataFileWriter.html#setCodec(org.apache.avro.file.CodecFactory)
You will see there is a setCodec method, which allow to you specify any codec to compress your data.
The compression can be either per block, or per record. Per block is recommended, as it will be more efficient.
You can use bzip2 or gzip or snappy or any other compression. You just need to to use the above api, and make sure the compression codec is available in all your task nodes.
splitable or unsplitable compression doesn't matter to you in this case, as you are using AVRO, which is splitable.
What you need to choose is which compression is better, or fit your application usage case.
In our production, we use snappy, as it gives us a good balance between compression ratio and read/decompression speed and CPU usage.
Different compressions have trade off. You need to compare them based on your case.
Yong

Date: Mon, 22 Sep 2014 17:21:29 +0200
From: ivanov@vesseltracker.com
To: user@hadoop.apache.org
Subject: Re: Bzip2 files as an input to MR job


  
    
  
  
    Hi Niels,

      Thanks for the reply.

      Changing the avro files is not really an option for me as it will
      require a lot of time( i have a lot ).

      The Avro files themself are compressed a bit.

      But still bzip2 gives 50% compression on one avro file.

      

      So what i want is , to use Bzip2 compressed file as an input to my
      MR jobs.

      Bzip2 is splittable.

      Should be possible somehow , but i don't seem to find it atm.

      

      On 22.09.2014 17:13, Niels Basjes wrote:

    
    
      
        Hi,
        

        
        You can use the GZip inside the AVRO files and still have
        splittable AVRO files.
        This has the to with the fact that there is a block
          structure inside the AVRO and these blocks are gzipped.

        
        

        
        I suggest you simply try it.
        

        
        Niels
        

        
        

          On Mon, Sep 22, 2014 at 4:40 PM,
            Georgi Ivanov <iv...@vesseltracker.com>
            wrote:

            Hi guys,

              I would like to compress the files on HDFS to save some
              storage.

              

              As far as i see bzip2 is the only format which is
              splitable (and slow).

              

              The actual files are Avro.

              

              So in my driver class i have :

              

              job.setInputFormatClass(AvroKeyInputFormat.class);

              

              I have number of jobs running processing Avro files so i
              would like to keep the code change to a minimum.

              

              Is it possible to comrpess these avro files with bzip2 and
              keep the code of MR jobs the same (or with little change)

              If it is , please give me some hints as so far i don't
              seem to find any good resources on the Internet.

                  

                  

                  Georgi

                
          
          

          
          

          
          -- 

          Best regards / Met vriendelijke groeten,

          

          Niels Basjes
        
      
    
    
 		 	   		  

RE: Bzip2 files as an input to MR job

Posted by java8964 <ja...@hotmail.com>.
Georgi:
I think  you misunderstand the originally answer.
If you already use Avor format, then the file will be splitable. If you want to add compression on top of that,  feel free going ahead.
If you read the Avor DataFileWriter API:
http://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/file/DataFileWriter.html#setCodec(org.apache.avro.file.CodecFactory)
You will see there is a setCodec method, which allow to you specify any codec to compress your data.
The compression can be either per block, or per record. Per block is recommended, as it will be more efficient.
You can use bzip2 or gzip or snappy or any other compression. You just need to to use the above api, and make sure the compression codec is available in all your task nodes.
splitable or unsplitable compression doesn't matter to you in this case, as you are using AVRO, which is splitable.
What you need to choose is which compression is better, or fit your application usage case.
In our production, we use snappy, as it gives us a good balance between compression ratio and read/decompression speed and CPU usage.
Different compressions have trade off. You need to compare them based on your case.
Yong

Date: Mon, 22 Sep 2014 17:21:29 +0200
From: ivanov@vesseltracker.com
To: user@hadoop.apache.org
Subject: Re: Bzip2 files as an input to MR job


  
    
  
  
    Hi Niels,

      Thanks for the reply.

      Changing the avro files is not really an option for me as it will
      require a lot of time( i have a lot ).

      The Avro files themself are compressed a bit.

      But still bzip2 gives 50% compression on one avro file.

      

      So what i want is , to use Bzip2 compressed file as an input to my
      MR jobs.

      Bzip2 is splittable.

      Should be possible somehow , but i don't seem to find it atm.

      

      On 22.09.2014 17:13, Niels Basjes wrote:

    
    
      
        Hi,
        

        
        You can use the GZip inside the AVRO files and still have
        splittable AVRO files.
        This has the to with the fact that there is a block
          structure inside the AVRO and these blocks are gzipped.

        
        

        
        I suggest you simply try it.
        

        
        Niels
        

        
        

          On Mon, Sep 22, 2014 at 4:40 PM,
            Georgi Ivanov <iv...@vesseltracker.com>
            wrote:

            Hi guys,

              I would like to compress the files on HDFS to save some
              storage.

              

              As far as i see bzip2 is the only format which is
              splitable (and slow).

              

              The actual files are Avro.

              

              So in my driver class i have :

              

              job.setInputFormatClass(AvroKeyInputFormat.class);

              

              I have number of jobs running processing Avro files so i
              would like to keep the code change to a minimum.

              

              Is it possible to comrpess these avro files with bzip2 and
              keep the code of MR jobs the same (or with little change)

              If it is , please give me some hints as so far i don't
              seem to find any good resources on the Internet.

                  

                  

                  Georgi

                
          
          

          
          

          
          -- 

          Best regards / Met vriendelijke groeten,

          

          Niels Basjes
        
      
    
    
 		 	   		  

RE: Bzip2 files as an input to MR job

Posted by java8964 <ja...@hotmail.com>.
Georgi:
I think  you misunderstand the originally answer.
If you already use Avor format, then the file will be splitable. If you want to add compression on top of that,  feel free going ahead.
If you read the Avor DataFileWriter API:
http://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/file/DataFileWriter.html#setCodec(org.apache.avro.file.CodecFactory)
You will see there is a setCodec method, which allow to you specify any codec to compress your data.
The compression can be either per block, or per record. Per block is recommended, as it will be more efficient.
You can use bzip2 or gzip or snappy or any other compression. You just need to to use the above api, and make sure the compression codec is available in all your task nodes.
splitable or unsplitable compression doesn't matter to you in this case, as you are using AVRO, which is splitable.
What you need to choose is which compression is better, or fit your application usage case.
In our production, we use snappy, as it gives us a good balance between compression ratio and read/decompression speed and CPU usage.
Different compressions have trade off. You need to compare them based on your case.
Yong

Date: Mon, 22 Sep 2014 17:21:29 +0200
From: ivanov@vesseltracker.com
To: user@hadoop.apache.org
Subject: Re: Bzip2 files as an input to MR job


  
    
  
  
    Hi Niels,

      Thanks for the reply.

      Changing the avro files is not really an option for me as it will
      require a lot of time( i have a lot ).

      The Avro files themself are compressed a bit.

      But still bzip2 gives 50% compression on one avro file.

      

      So what i want is , to use Bzip2 compressed file as an input to my
      MR jobs.

      Bzip2 is splittable.

      Should be possible somehow , but i don't seem to find it atm.

      

      On 22.09.2014 17:13, Niels Basjes wrote:

    
    
      
        Hi,
        

        
        You can use the GZip inside the AVRO files and still have
        splittable AVRO files.
        This has the to with the fact that there is a block
          structure inside the AVRO and these blocks are gzipped.

        
        

        
        I suggest you simply try it.
        

        
        Niels
        

        
        

          On Mon, Sep 22, 2014 at 4:40 PM,
            Georgi Ivanov <iv...@vesseltracker.com>
            wrote:

            Hi guys,

              I would like to compress the files on HDFS to save some
              storage.

              

              As far as i see bzip2 is the only format which is
              splitable (and slow).

              

              The actual files are Avro.

              

              So in my driver class i have :

              

              job.setInputFormatClass(AvroKeyInputFormat.class);

              

              I have number of jobs running processing Avro files so i
              would like to keep the code change to a minimum.

              

              Is it possible to comrpess these avro files with bzip2 and
              keep the code of MR jobs the same (or with little change)

              If it is , please give me some hints as so far i don't
              seem to find any good resources on the Internet.

                  

                  

                  Georgi

                
          
          

          
          

          
          -- 

          Best regards / Met vriendelijke groeten,

          

          Niels Basjes
        
      
    
    
 		 	   		  

Re: Bzip2 files as an input to MR job

Posted by Georgi Ivanov <iv...@vesseltracker.com>.
Hi Niels,
Thanks for the reply.
Changing the avro files is not really an option for me as it will 
require a lot of time( i have a lot ).
The Avro files themself are compressed a bit.
But still bzip2 gives 50% compression on one avro file.

So what i want is , to use Bzip2 compressed file as an input to my MR jobs.
Bzip2 is splittable.
Should be possible somehow , but i don't seem to find it atm.

On 22.09.2014 17:13, Niels Basjes wrote:
> Hi,
>
> You can use the GZip inside the AVRO files and still have splittable 
> AVRO files.
> This has the to with the fact that there is a block structure inside 
> the AVRO and these blocks are gzipped.
>
> I suggest you simply try it.
>
> Niels
>
>
> On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov 
> <ivanov@vesseltracker.com <ma...@vesseltracker.com>> wrote:
>
>     Hi guys,
>     I would like to compress the files on HDFS to save some storage.
>
>     As far as i see bzip2 is the only format which is splitable (and
>     slow).
>
>     The actual files are Avro.
>
>     So in my driver class i have :
>
>     job.setInputFormatClass(AvroKeyInputFormat.class);
>
>     I have number of jobs running processing Avro files so i would
>     like to keep the code change to a minimum.
>
>     Is it possible to comrpess these avro files with bzip2 and keep
>     the code of MR jobs the same (or with little change)
>     If it is , please give me some hints as so far i don't seem to
>     find any good resources on the Internet.
>
>
>     Georgi
>
>
>
>
> -- 
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes


Re: Bzip2 files as an input to MR job

Posted by Georgi Ivanov <iv...@vesseltracker.com>.
Hi Niels,
Thanks for the reply.
Changing the avro files is not really an option for me as it will 
require a lot of time( i have a lot ).
The Avro files themself are compressed a bit.
But still bzip2 gives 50% compression on one avro file.

So what i want is , to use Bzip2 compressed file as an input to my MR jobs.
Bzip2 is splittable.
Should be possible somehow , but i don't seem to find it atm.

On 22.09.2014 17:13, Niels Basjes wrote:
> Hi,
>
> You can use the GZip inside the AVRO files and still have splittable 
> AVRO files.
> This has the to with the fact that there is a block structure inside 
> the AVRO and these blocks are gzipped.
>
> I suggest you simply try it.
>
> Niels
>
>
> On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov 
> <ivanov@vesseltracker.com <ma...@vesseltracker.com>> wrote:
>
>     Hi guys,
>     I would like to compress the files on HDFS to save some storage.
>
>     As far as i see bzip2 is the only format which is splitable (and
>     slow).
>
>     The actual files are Avro.
>
>     So in my driver class i have :
>
>     job.setInputFormatClass(AvroKeyInputFormat.class);
>
>     I have number of jobs running processing Avro files so i would
>     like to keep the code change to a minimum.
>
>     Is it possible to comrpess these avro files with bzip2 and keep
>     the code of MR jobs the same (or with little change)
>     If it is , please give me some hints as so far i don't seem to
>     find any good resources on the Internet.
>
>
>     Georgi
>
>
>
>
> -- 
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes


Re: Bzip2 files as an input to MR job

Posted by Georgi Ivanov <iv...@vesseltracker.com>.
Hi Niels,
Thanks for the reply.
Changing the avro files is not really an option for me as it will 
require a lot of time( i have a lot ).
The Avro files themself are compressed a bit.
But still bzip2 gives 50% compression on one avro file.

So what i want is , to use Bzip2 compressed file as an input to my MR jobs.
Bzip2 is splittable.
Should be possible somehow , but i don't seem to find it atm.

On 22.09.2014 17:13, Niels Basjes wrote:
> Hi,
>
> You can use the GZip inside the AVRO files and still have splittable 
> AVRO files.
> This has the to with the fact that there is a block structure inside 
> the AVRO and these blocks are gzipped.
>
> I suggest you simply try it.
>
> Niels
>
>
> On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov 
> <ivanov@vesseltracker.com <ma...@vesseltracker.com>> wrote:
>
>     Hi guys,
>     I would like to compress the files on HDFS to save some storage.
>
>     As far as i see bzip2 is the only format which is splitable (and
>     slow).
>
>     The actual files are Avro.
>
>     So in my driver class i have :
>
>     job.setInputFormatClass(AvroKeyInputFormat.class);
>
>     I have number of jobs running processing Avro files so i would
>     like to keep the code change to a minimum.
>
>     Is it possible to comrpess these avro files with bzip2 and keep
>     the code of MR jobs the same (or with little change)
>     If it is , please give me some hints as so far i don't seem to
>     find any good resources on the Internet.
>
>
>     Georgi
>
>
>
>
> -- 
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes


Re: Bzip2 files as an input to MR job

Posted by Georgi Ivanov <iv...@vesseltracker.com>.
Hi Niels,
Thanks for the reply.
Changing the avro files is not really an option for me as it will 
require a lot of time( i have a lot ).
The Avro files themself are compressed a bit.
But still bzip2 gives 50% compression on one avro file.

So what i want is , to use Bzip2 compressed file as an input to my MR jobs.
Bzip2 is splittable.
Should be possible somehow , but i don't seem to find it atm.

On 22.09.2014 17:13, Niels Basjes wrote:
> Hi,
>
> You can use the GZip inside the AVRO files and still have splittable 
> AVRO files.
> This has the to with the fact that there is a block structure inside 
> the AVRO and these blocks are gzipped.
>
> I suggest you simply try it.
>
> Niels
>
>
> On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov 
> <ivanov@vesseltracker.com <ma...@vesseltracker.com>> wrote:
>
>     Hi guys,
>     I would like to compress the files on HDFS to save some storage.
>
>     As far as i see bzip2 is the only format which is splitable (and
>     slow).
>
>     The actual files are Avro.
>
>     So in my driver class i have :
>
>     job.setInputFormatClass(AvroKeyInputFormat.class);
>
>     I have number of jobs running processing Avro files so i would
>     like to keep the code change to a minimum.
>
>     Is it possible to comrpess these avro files with bzip2 and keep
>     the code of MR jobs the same (or with little change)
>     If it is , please give me some hints as so far i don't seem to
>     find any good resources on the Internet.
>
>
>     Georgi
>
>
>
>
> -- 
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes


Re: Bzip2 files as an input to MR job

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

You can use the GZip inside the AVRO files and still have splittable AVRO
files.
This has the to with the fact that there is a block structure inside the
AVRO and these blocks are gzipped.

I suggest you simply try it.

Niels


On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov <iv...@vesseltracker.com>
wrote:

> Hi guys,
> I would like to compress the files on HDFS to save some storage.
>
> As far as i see bzip2 is the only format which is splitable (and slow).
>
> The actual files are Avro.
>
> So in my driver class i have :
>
> job.setInputFormatClass(AvroKeyInputFormat.class);
>
> I have number of jobs running processing Avro files so i would like to
> keep the code change to a minimum.
>
> Is it possible to comrpess these avro files with bzip2 and keep the code
> of MR jobs the same (or with little change)
> If it is , please give me some hints as so far i don't seem to find any
> good resources on the Internet.
>
>
> Georgi
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Bzip2 files as an input to MR job

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

You can use the GZip inside the AVRO files and still have splittable AVRO
files.
This has the to with the fact that there is a block structure inside the
AVRO and these blocks are gzipped.

I suggest you simply try it.

Niels


On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov <iv...@vesseltracker.com>
wrote:

> Hi guys,
> I would like to compress the files on HDFS to save some storage.
>
> As far as i see bzip2 is the only format which is splitable (and slow).
>
> The actual files are Avro.
>
> So in my driver class i have :
>
> job.setInputFormatClass(AvroKeyInputFormat.class);
>
> I have number of jobs running processing Avro files so i would like to
> keep the code change to a minimum.
>
> Is it possible to comrpess these avro files with bzip2 and keep the code
> of MR jobs the same (or with little change)
> If it is , please give me some hints as so far i don't seem to find any
> good resources on the Internet.
>
>
> Georgi
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Bzip2 files as an input to MR job

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

You can use the GZip inside the AVRO files and still have splittable AVRO
files.
This has the to with the fact that there is a block structure inside the
AVRO and these blocks are gzipped.

I suggest you simply try it.

Niels


On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov <iv...@vesseltracker.com>
wrote:

> Hi guys,
> I would like to compress the files on HDFS to save some storage.
>
> As far as i see bzip2 is the only format which is splitable (and slow).
>
> The actual files are Avro.
>
> So in my driver class i have :
>
> job.setInputFormatClass(AvroKeyInputFormat.class);
>
> I have number of jobs running processing Avro files so i would like to
> keep the code change to a minimum.
>
> Is it possible to comrpess these avro files with bzip2 and keep the code
> of MR jobs the same (or with little change)
> If it is , please give me some hints as so far i don't seem to find any
> good resources on the Internet.
>
>
> Georgi
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Bzip2 files as an input to MR job

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

You can use the GZip inside the AVRO files and still have splittable AVRO
files.
This has the to with the fact that there is a block structure inside the
AVRO and these blocks are gzipped.

I suggest you simply try it.

Niels


On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov <iv...@vesseltracker.com>
wrote:

> Hi guys,
> I would like to compress the files on HDFS to save some storage.
>
> As far as i see bzip2 is the only format which is splitable (and slow).
>
> The actual files are Avro.
>
> So in my driver class i have :
>
> job.setInputFormatClass(AvroKeyInputFormat.class);
>
> I have number of jobs running processing Avro files so i would like to
> keep the code change to a minimum.
>
> Is it possible to comrpess these avro files with bzip2 and keep the code
> of MR jobs the same (or with little change)
> If it is , please give me some hints as so far i don't seem to find any
> good resources on the Internet.
>
>
> Georgi
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes