You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Ramasubramanian Narayanan <ra...@gmail.com> on 2012/11/07 13:22:31 UTC

Doubts on compressed file

Hi,

If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks
and store in HDFS?

I understand that a single mapper can work with GZip as it reads the entire
file from beginning to end... In that case if the GZip file size is larget
than 128 MB will it get splitted into blocks and stored in HDFS?

regards,
Rams

Re: Doubts on compressed file

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

> If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and
> store in HDFS?

Yes.

> I understand that a single mapper can work with GZip as it reads the entire
> file from beginning to end... In that case if the GZip file size is larget
> than 128 MB will it get splitted into blocks and stored in HDFS?

Yes, and then the mapper will read the other parts of the file over the network.
So what I do is I upload such files with a bigger HDFS blocksize so
the mapper has "the entire file" locally.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Doubts on compressed file

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Yes all files are split into block-size chunks in HDFS. HDFS is
agnostic about what the file's content is, and its attributes (such as
compression, etc.). This is left to the file reader logic to handle.

When a GZip reader initializes, it reads the whole file length, across
all the blocks the file may have, which HDFS lets you do transparently
by just requesting the data length to read. It ends up reading blocks
serially for you, and your app just has to take care of reading actual
gzip data without worrying about block split boundaries.

On Wed, Nov 7, 2012 at 5:52 PM, Ramasubramanian Narayanan
<ra...@gmail.com> wrote:
> Hi,
>
> If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and
> store in HDFS?
>
> I understand that a single mapper can work with GZip as it reads the entire
> file from beginning to end... In that case if the GZip file size is larget
> than 128 MB will it get splitted into blocks and stored in HDFS?
>
> regards,
> Rams

-- 
Harsh J

RE: Doubts on compressed file

Posted by Jim Neofotistos <ji...@oracle.com>.

Gzip is decently fast, but cannot take advantage of Hadoop's natural map splits because it's impossible to start decompressing a gzip stream starting at a random offset in the file.

 

LZO is a wonderful compression scheme to use with Hadoop because it's incredibly fast, and (with a bit of work) it's splittable LZO's block format makes it possible to start decompressing at certain specific offsets of the file -- those that start new LZO block boundaries.

 

 

 

James Neofotistos 

Senior Sales Consultant

Emerging Markets East

Phone: 1-781-565-1890| Mobile: 1-603-759-7889

Email:jim.neofotistos@oracle.com

 

HYPERLINK "http://www.oracle.com/"http://www.oracleimg.com/us/assets/oralogo-small.gif

Software, Hardware, Complete.

 

 

 

 

 

From: Ramasubramanian Narayanan [mailto:ramasubramanian.narayanan@gmail.com] 
Sent: Wednesday, November 07, 2012 7:23 AM
To: user@hadoop.apache.org
Subject: Doubts on compressed file

 

Hi,

 

If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and store in HDFS?

 

I understand that a single mapper can work with GZip as it reads the entire file from beginning to end... In that case if the GZip file size is larget than 128 MB will it get splitted into blocks and stored in HDFS?

 

regards,

Rams

Re: Doubts on compressed file

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

> If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and
> store in HDFS?

Yes.

> I understand that a single mapper can work with GZip as it reads the entire
> file from beginning to end... In that case if the GZip file size is larget
> than 128 MB will it get splitted into blocks and stored in HDFS?

Yes, and then the mapper will read the other parts of the file over the network.
So what I do is I upload such files with a bigger HDFS blocksize so
the mapper has "the entire file" locally.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Doubts on compressed file

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Yes all files are split into block-size chunks in HDFS. HDFS is
agnostic about what the file's content is, and its attributes (such as
compression, etc.). This is left to the file reader logic to handle.

When a GZip reader initializes, it reads the whole file length, across
all the blocks the file may have, which HDFS lets you do transparently
by just requesting the data length to read. It ends up reading blocks
serially for you, and your app just has to take care of reading actual
gzip data without worrying about block split boundaries.

On Wed, Nov 7, 2012 at 5:52 PM, Ramasubramanian Narayanan
<ra...@gmail.com> wrote:
> Hi,
>
> If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and
> store in HDFS?
>
> I understand that a single mapper can work with GZip as it reads the entire
> file from beginning to end... In that case if the GZip file size is larget
> than 128 MB will it get splitted into blocks and stored in HDFS?
>
> regards,
> Rams

-- 
Harsh J

Re: Doubts on compressed file

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

> If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and
> store in HDFS?

Yes.

> I understand that a single mapper can work with GZip as it reads the entire
> file from beginning to end... In that case if the GZip file size is larget
> than 128 MB will it get splitted into blocks and stored in HDFS?

Yes, and then the mapper will read the other parts of the file over the network.
So what I do is I upload such files with a bigger HDFS blocksize so
the mapper has "the entire file" locally.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

RE: Doubts on compressed file

Posted by Jim Neofotistos <ji...@oracle.com>.

Gzip is decently fast, but cannot take advantage of Hadoop's natural map splits because it's impossible to start decompressing a gzip stream starting at a random offset in the file.

 

LZO is a wonderful compression scheme to use with Hadoop because it's incredibly fast, and (with a bit of work) it's splittable LZO's block format makes it possible to start decompressing at certain specific offsets of the file -- those that start new LZO block boundaries.

 

 

 

James Neofotistos 

Senior Sales Consultant

Emerging Markets East

Phone: 1-781-565-1890| Mobile: 1-603-759-7889

Email:jim.neofotistos@oracle.com

 

HYPERLINK "http://www.oracle.com/"http://www.oracleimg.com/us/assets/oralogo-small.gif

Software, Hardware, Complete.

 

 

 

 

 

From: Ramasubramanian Narayanan [mailto:ramasubramanian.narayanan@gmail.com] 
Sent: Wednesday, November 07, 2012 7:23 AM
To: user@hadoop.apache.org
Subject: Doubts on compressed file

 

Hi,

 

If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and store in HDFS?

 

I understand that a single mapper can work with GZip as it reads the entire file from beginning to end... In that case if the GZip file size is larget than 128 MB will it get splitted into blocks and stored in HDFS?

 

regards,

Rams

RE: Doubts on compressed file

Posted by Jim Neofotistos <ji...@oracle.com>.

Gzip is decently fast, but cannot take advantage of Hadoop's natural map splits because it's impossible to start decompressing a gzip stream starting at a random offset in the file.

 

LZO is a wonderful compression scheme to use with Hadoop because it's incredibly fast, and (with a bit of work) it's splittable LZO's block format makes it possible to start decompressing at certain specific offsets of the file -- those that start new LZO block boundaries.

 

 

 

James Neofotistos 

Senior Sales Consultant

Emerging Markets East

Phone: 1-781-565-1890| Mobile: 1-603-759-7889

Email:jim.neofotistos@oracle.com

 

HYPERLINK "http://www.oracle.com/"http://www.oracleimg.com/us/assets/oralogo-small.gif

Software, Hardware, Complete.

 

 

 

 

 

From: Ramasubramanian Narayanan [mailto:ramasubramanian.narayanan@gmail.com] 
Sent: Wednesday, November 07, 2012 7:23 AM
To: user@hadoop.apache.org
Subject: Doubts on compressed file

 

Hi,

 

If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and store in HDFS?

 

I understand that a single mapper can work with GZip as it reads the entire file from beginning to end... In that case if the GZip file size is larget than 128 MB will it get splitted into blocks and stored in HDFS?

 

regards,

Rams

Re: Doubts on compressed file

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Yes all files are split into block-size chunks in HDFS. HDFS is
agnostic about what the file's content is, and its attributes (such as
compression, etc.). This is left to the file reader logic to handle.

When a GZip reader initializes, it reads the whole file length, across
all the blocks the file may have, which HDFS lets you do transparently
by just requesting the data length to read. It ends up reading blocks
serially for you, and your app just has to take care of reading actual
gzip data without worrying about block split boundaries.

On Wed, Nov 7, 2012 at 5:52 PM, Ramasubramanian Narayanan
<ra...@gmail.com> wrote:
> Hi,
>
> If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and
> store in HDFS?
>
> I understand that a single mapper can work with GZip as it reads the entire
> file from beginning to end... In that case if the GZip file size is larget
> than 128 MB will it get splitted into blocks and stored in HDFS?
>
> regards,
> Rams

-- 
Harsh J

RE: Doubts on compressed file

Posted by Jim Neofotistos <ji...@oracle.com>.

Gzip is decently fast, but cannot take advantage of Hadoop's natural map splits because it's impossible to start decompressing a gzip stream starting at a random offset in the file.

 

LZO is a wonderful compression scheme to use with Hadoop because it's incredibly fast, and (with a bit of work) it's splittable LZO's block format makes it possible to start decompressing at certain specific offsets of the file -- those that start new LZO block boundaries.

 

 

 

James Neofotistos 

Senior Sales Consultant

Emerging Markets East

Phone: 1-781-565-1890| Mobile: 1-603-759-7889

Email:jim.neofotistos@oracle.com

 

HYPERLINK "http://www.oracle.com/"http://www.oracleimg.com/us/assets/oralogo-small.gif

Software, Hardware, Complete.

 

 

 

 

 

From: Ramasubramanian Narayanan [mailto:ramasubramanian.narayanan@gmail.com] 
Sent: Wednesday, November 07, 2012 7:23 AM
To: user@hadoop.apache.org
Subject: Doubts on compressed file

 

Hi,

 

If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and store in HDFS?

 

I understand that a single mapper can work with GZip as it reads the entire file from beginning to end... In that case if the GZip file size is larget than 128 MB will it get splitted into blocks and stored in HDFS?

 

regards,

Rams

Re: Doubts on compressed file

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

> If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and
> store in HDFS?

Yes.

> I understand that a single mapper can work with GZip as it reads the entire
> file from beginning to end... In that case if the GZip file size is larget
> than 128 MB will it get splitted into blocks and stored in HDFS?

Yes, and then the mapper will read the other parts of the file over the network.
So what I do is I upload such files with a bigger HDFS blocksize so
the mapper has "the entire file" locally.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Doubts on compressed file

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Yes all files are split into block-size chunks in HDFS. HDFS is
agnostic about what the file's content is, and its attributes (such as
compression, etc.). This is left to the file reader logic to handle.

When a GZip reader initializes, it reads the whole file length, across
all the blocks the file may have, which HDFS lets you do transparently
by just requesting the data length to read. It ends up reading blocks
serially for you, and your app just has to take care of reading actual
gzip data without worrying about block split boundaries.

On Wed, Nov 7, 2012 at 5:52 PM, Ramasubramanian Narayanan
<ra...@gmail.com> wrote:
> Hi,
>
> If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and
> store in HDFS?
>
> I understand that a single mapper can work with GZip as it reads the entire
> file from beginning to end... In that case if the GZip file size is larget
> than 128 MB will it get splitted into blocks and stored in HDFS?
>
> regards,
> Rams

-- 
Harsh J