You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Mapred Learn <ma...@gmail.com> on 2011/06/20 06:02:02 UTC

How to split a big file in HDFS by size

Hi,
I am trying to upload text files in size 60 GB or more.
I want to split these files into smaller files of say 1 GB each so that I can run further map-red jobs on it.

Anybody has any idea how can I do this ?
Thanks a lot in advance ! Any ideas are greatly appreciated !

-JJ

Re: AW: How to split a big file in HDFS by size

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

On Tue, Jun 21, 2011 at 16:14, Mapred Learn <ma...@gmail.com> wrote:
> The problem is when 1 text file goes on HDFS as 60 GB file, one mapper takes
> more than an hour to convert it to sequence file and finally fails.
>
> I was thinking how to split it from the client box before uploading to HDFS.

Have a look at this:

http://stackoverflow.com/questions/3960651/splitting-gzipped-logfiles-without-storing-the-ungzipped-splits-on-disk


> If I read file and split it with filestream. Read() based on size, it takes
> 2 hours to process 1 60 gb file and upload on HDFS as 120 500 mb files.
> Sent from my iPhone
> On Jun 21, 2011, at 2:57 AM, Evert Lammerts <Ev...@sara.nl> wrote:
>
> What we did was on not-hadoop hardware. We streamed the file from a storage
> cluster to a single machine and cut it up while streaming the pieces back to
> the storage cluster. That will probably not work for you, unless you have
> the hardware for it. But then still its inefficient.
>
>
>
> You should be able to unzip your file in a MR job. If you still want to use
> compression you can install LZO and rezip the file from within the same job.
> (LZO uses block-compression, which allows Hadoop to process all blocks in
> parallel.) Note that you’ll need enough storage capacity. I don’t have
> example code, but I’m guessing Google can help.
>
>
>
>
>
>
>
> From: Mapred Learn [mailto:mapred.learn@gmail.com]
> Sent: maandag 20 juni 2011 18:09
> To: Niels Basjes; Evert Lammerts
> Subject: Re: AW: How to split a big file in HDFS by size
>
>
>
> Thanks for sharing.
>
>
>
> Could you guys share how are you divinding your 2.7 TB into 10 Gb file each
> on HDFS ? That wud be helpful for me !
>
>
>
>
>
>
>
> On Mon, Jun 20, 2011 at 8:39 AM, Marcos Ortiz <ml...@uci.cu> wrote:
>
> Evert Lammerts at Sara.nl did something seemed to your problem, spliting a
> big 2.7 TB file to chunks of 10 GB.
> This work was presented on the BioAssist Programmers' Day on January of this
> year and its name was
> "Large-Scale Data Storage and Processing for Scientist in The Netherlands"
>
> http://www.slideshare.net/evertlammerts
>
> P.D: I sent the message with a copy to him
>
> El 6/20/2011 10:38 AM, Niels Basjes escribió:
>
>
>
> Hi,
>
> On Mon, Jun 20, 2011 at 16:13, Mapred Learn<ma...@gmail.com>  wrote:
>
>
> But this file is a gzipped text file. In this case, it will only go to 1
> mapper than the case if it was
> split into 60 1 GB files which will make map-red job finish earlier than one
> 60 GB file as it will
> Hv 60 mappers running in parallel. Isn't it so ?
>
>
> Yes, that is very true.
>
>
>
> --
>
> Marcos Luís Ortíz Valmaseda
>  Software Engineer (UCI)
>  http://marcosluis2186.posterous.com
>  http://twitter.com/marcosluis2186
>
>
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: AW: How to split a big file in HDFS by size

Posted by Marcos Ortiz <ml...@uci.cu>.

Evert Lammerts at Sara.nl did something seemed to your problem, spliting 
a big 2.7 TB file to chunks of 10 GB.
This work was presented on the BioAssist Programmers' Day on January of 
this year and its name was
"Large-Scale Data Storage and Processing for Scientist in The Netherlands"

http://www.slideshare.net/evertlammerts

P.D: I sent the message with a copy to him

El 6/20/2011 10:38 AM, Niels Basjes escribió:
> Hi,
>
> On Mon, Jun 20, 2011 at 16:13, Mapred Learn<ma...@gmail.com>  wrote:
>    
>> But this file is a gzipped text file. In this case, it will only go to 1 mapper than the case if it was
>> split into 60 1 GB files which will make map-red job finish earlier than one 60 GB file as it will
>> Hv 60 mappers running in parallel. Isn't it so ?
>>      
> Yes, that is very true.
>
>    

-- 
Marcos Luís Ortíz Valmaseda
  Software Engineer (UCI)
  http://marcosluis2186.posterous.com
  http://twitter.com/marcosluis2186

Re: AW: How to split a big file in HDFS by size

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

On Mon, Jun 20, 2011 at 16:13, Mapred Learn <ma...@gmail.com> wrote:
> But this file is a gzipped text file. In this case, it will only go to 1 mapper than the case if it was
> split into 60 1 GB files which will make map-red job finish earlier than one 60 GB file as it will
> Hv 60 mappers running in parallel. Isn't it so ?

Yes, that is very true.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

AW: AW: How to split a big file in HDFS by size

Posted by Christoph Schmitz <Ch...@1und1.de>.

You need to figure out where your actual computation takes place. I would recommend going in two steps:

1. Import your 60 GB file into HDFS. Transform it into a splittable format (e.g. a SequenceFile).
2. Do computation on that file in HDFS.

In that case, step 2 is parallelized, no matter what your file looks like in the HDFS. Your problem would be step 1 in that case - that part is what I was talking about.

In theory, you could skip step 1 and try to work directly on your non-splittable 60 GB file, but then no parallelization could take place.

Regards,
Christoph

-----Ursprüngliche Nachricht-----
Von: Mapred Learn [mailto:mapred.learn@gmail.com] 
Gesendet: Montag, 20. Juni 2011 16:14
An: mapreduce-user@hadoop.apache.org
Cc: mapreduce-user@hadoop.apache.org
Betreff: Re: AW: How to split a big file in HDFS by size

But this file is a gzipped text file. In this case, it will only go to 1 mapper than the case if it was split into 60 1 GB files which will make map-red job finish earlier than one 60 GB file as it will Hv 60 mappers running in parallel. Isn't it so ?

Sent from my iPhone

On Jun 20, 2011, at 12:59 AM, Christoph Schmitz <Ch...@1und1.de> wrote:

> Simple answer: don't. The Hadoop framework will take care of that for you and split the file. The logical 60 GB file you see in the HDFS actually *is* split into smaller chunks (default size is 64 MB) and physically distributed across the cluster.
> 
> Regards,
> Christoph
> 
> -----Ursprüngliche Nachricht-----
> Von: Mapred Learn [mailto:mapred.learn@gmail.com] 
> Gesendet: Montag, 20. Juni 2011 08:36
> An: mapreduce-user@hadoop.apache.org
> Betreff: Re: How to split a big file in HDFS by size
> 
> Hi Christopher,
> If I get all 60 Gb on HDFs, can I then split it into 60 1 Gb files and then run a map-red job on those 60 text fixed length files ? If yes, do you have any idea how to do this ?
> 
> 
> 
> 
> On Sun, Jun 19, 2011 at 11:28 PM, Christoph Schmitz <Ch...@1und1.de> wrote:
> 
> 
>    JJ,
>    
>    uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) will be slow. If possible, try to get the files in smaller chunks where they are created, and upload them in parallel with a simple MapReduce job that only passes the data through (i.e. uses the standard Mapper and Reducer classes). This job should read from your local input directory and output into the HDFS.
>    
>    If you cannot split the 60 GB where they are created, IMHO there is not much you can do. If you have a file format with, say, fixed length records, you could try to create your own InputFormat that splits the file logically without creating the actual splits locally (which would be too costly, I assume).
>    
>    The performance of reading in parallel, though, will depend to a large extent on the nature of your local storage. If you have a single hard drive, reading in parallel might actually be slower than reading serially because it means a lot of random disk accesses.
>    
>    Regards,
>    Christoph
>    
>    -----Ursprüngliche Nachricht-----
>    Von: Mapred Learn [mailto:mapred.learn@gmail.com]
>    Gesendet: Montag, 20. Juni 2011 06:02
>    An: mapreduce-user@hadoop.apache.org; cdh-user@cloudera.org
>    Betreff: How to split a big file in HDFS by size
>    
> 
>    Hi,
>    I am trying to upload text files in size 60 GB or more.
>    I want to split these files into smaller files of say 1 GB each so that I can run further map-red jobs on it.
>    
>    Anybody has any idea how can I do this ?
>    Thanks a lot in advance ! Any ideas are greatly appreciated !
>    
>    -JJ
>    
> 
>

Re: AW: How to split a big file in HDFS by size

Posted by Mapred Learn <ma...@gmail.com>.

But this file is a gzipped text file. In this case, it will only go to 1 mapper than the case if it was split into 60 1 GB files which will make map-red job finish earlier than one 60 GB file as it will Hv 60 mappers running in parallel. Isn't it so ?

Sent from my iPhone

On Jun 20, 2011, at 12:59 AM, Christoph Schmitz <Ch...@1und1.de> wrote:

> Simple answer: don't. The Hadoop framework will take care of that for you and split the file. The logical 60 GB file you see in the HDFS actually *is* split into smaller chunks (default size is 64 MB) and physically distributed across the cluster.
> 
> Regards,
> Christoph
> 
> -----Ursprüngliche Nachricht-----
> Von: Mapred Learn [mailto:mapred.learn@gmail.com] 
> Gesendet: Montag, 20. Juni 2011 08:36
> An: mapreduce-user@hadoop.apache.org
> Betreff: Re: How to split a big file in HDFS by size
> 
> Hi Christopher,
> If I get all 60 Gb on HDFs, can I then split it into 60 1 Gb files and then run a map-red job on those 60 text fixed length files ? If yes, do you have any idea how to do this ?
> 
> 
> 
> 
> On Sun, Jun 19, 2011 at 11:28 PM, Christoph Schmitz <Ch...@1und1.de> wrote:
> 
> 
>    JJ,
>    
>    uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) will be slow. If possible, try to get the files in smaller chunks where they are created, and upload them in parallel with a simple MapReduce job that only passes the data through (i.e. uses the standard Mapper and Reducer classes). This job should read from your local input directory and output into the HDFS.
>    
>    If you cannot split the 60 GB where they are created, IMHO there is not much you can do. If you have a file format with, say, fixed length records, you could try to create your own InputFormat that splits the file logically without creating the actual splits locally (which would be too costly, I assume).
>    
>    The performance of reading in parallel, though, will depend to a large extent on the nature of your local storage. If you have a single hard drive, reading in parallel might actually be slower than reading serially because it means a lot of random disk accesses.
>    
>    Regards,
>    Christoph
>    
>    -----Ursprüngliche Nachricht-----
>    Von: Mapred Learn [mailto:mapred.learn@gmail.com]
>    Gesendet: Montag, 20. Juni 2011 06:02
>    An: mapreduce-user@hadoop.apache.org; cdh-user@cloudera.org
>    Betreff: How to split a big file in HDFS by size
>    
> 
>    Hi,
>    I am trying to upload text files in size 60 GB or more.
>    I want to split these files into smaller files of say 1 GB each so that I can run further map-red jobs on it.
>    
>    Anybody has any idea how can I do this ?
>    Thanks a lot in advance ! Any ideas are greatly appreciated !
>    
>    -JJ
>    
> 
>

AW: How to split a big file in HDFS by size

Posted by Christoph Schmitz <Ch...@1und1.de>.

Simple answer: don't. The Hadoop framework will take care of that for you and split the file. The logical 60 GB file you see in the HDFS actually *is* split into smaller chunks (default size is 64 MB) and physically distributed across the cluster.

Regards,
Christoph

-----Ursprüngliche Nachricht-----
Von: Mapred Learn [mailto:mapred.learn@gmail.com]
Gesendet: Montag, 20. Juni 2011 08:36
An: mapreduce-user@hadoop.apache.org
Betreff: Re: How to split a big file in HDFS by size

Hi Christopher,
If I get all 60 Gb on HDFs, can I then split it into 60 1 Gb files and then run a map-red job on those 60 text fixed length files ? If yes, do you have any idea how to do this ?

On Sun, Jun 19, 2011 at 11:28 PM, Christoph Schmitz <Ch...@1und1.de> wrote:

JJ,

uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) will be slow. If possible, try to get the files in smaller chunks where they are created, and upload them in parallel with a simple MapReduce job that only passes the data through (i.e. uses the standard Mapper and Reducer classes). This job should read from your local input directory and output into the HDFS.

If you cannot split the 60 GB where they are created, IMHO there is not much you can do. If you have a file format with, say, fixed length records, you could try to create your own InputFormat that splits the file logically without creating the actual splits locally (which would be too costly, I assume).

The performance of reading in parallel, though, will depend to a large extent on the nature of your local storage. If you have a single hard drive, reading in parallel might actually be slower than reading serially because it means a lot of random disk accesses.

Regards,
Christoph

-----Ursprüngliche Nachricht-----
Von: Mapred Learn [mailto:mapred.learn@gmail.com]
Gesendet: Montag, 20. Juni 2011 06:02
An: mapreduce-user@hadoop.apache.org; cdh-user@cloudera.org
Betreff: How to split a big file in HDFS by size

Hi,
I am trying to upload text files in size 60 GB or more.
I want to split these files into smaller files of say 1 GB each so that I can run further map-red jobs on it.

Anybody has any idea how can I do this ?
Thanks a lot in advance ! Any ideas are greatly appreciated !

-JJ

Re: How to split a big file in HDFS by size

Posted by Mapred Learn <ma...@gmail.com>.

Hi Christopher,
If I get all 60 Gb on HDFs, can I then split it into 60 1 Gb files and then
run a map-red job on those 60 text fixed length files ? If yes, do you have
any idea how to do this ?




On Sun, Jun 19, 2011 at 11:28 PM, Christoph Schmitz <
Christoph.Schmitz@1und1.de> wrote:

> JJ,
>
> uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) will
> be slow. If possible, try to get the files in smaller chunks where they are
> created, and upload them in parallel with a simple MapReduce job that only
> passes the data through (i.e. uses the standard Mapper and Reducer classes).
> This job should read from your local input directory and output into the
> HDFS.
>
> If you cannot split the 60 GB where they are created, IMHO there is not
> much you can do. If you have a file format with, say, fixed length records,
> you could try to create your own InputFormat that splits the file logically
> without creating the actual splits locally (which would be too costly, I
> assume).
>
> The performance of reading in parallel, though, will depend to a large
> extent on the nature of your local storage. If you have a single hard drive,
> reading in parallel might actually be slower than reading serially because
> it means a lot of random disk accesses.
>
> Regards,
> Christoph
>
> -----Ursprüngliche Nachricht-----
> Von: Mapred Learn [mailto:mapred.learn@gmail.com]
> Gesendet: Montag, 20. Juni 2011 06:02
> An: mapreduce-user@hadoop.apache.org; cdh-user@cloudera.org
> Betreff: How to split a big file in HDFS by size
>
> Hi,
> I am trying to upload text files in size 60 GB or more.
> I want to split these files into smaller files of say 1 GB each so that I
> can run further map-red jobs on it.
>
> Anybody has any idea how can I do this ?
> Thanks a lot in advance ! Any ideas are greatly appreciated !
>
> -JJ
>

AW: How to split a big file in HDFS by size

Posted by Christoph Schmitz <Ch...@1und1.de>.

JJ,

uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) will be slow. If possible, try to get the files in smaller chunks where they are created, and upload them in parallel with a simple MapReduce job that only passes the data through (i.e. uses the standard Mapper and Reducer classes). This job should read from your local input directory and output into the HDFS.

If you cannot split the 60 GB where they are created, IMHO there is not much you can do. If you have a file format with, say, fixed length records, you could try to create your own InputFormat that splits the file logically without creating the actual splits locally (which would be too costly, I assume). 

The performance of reading in parallel, though, will depend to a large extent on the nature of your local storage. If you have a single hard drive, reading in parallel might actually be slower than reading serially because it means a lot of random disk accesses.

Regards,
Christoph

-----Ursprüngliche Nachricht-----
Von: Mapred Learn [mailto:mapred.learn@gmail.com] 
Gesendet: Montag, 20. Juni 2011 06:02
An: mapreduce-user@hadoop.apache.org; cdh-user@cloudera.org
Betreff: How to split a big file in HDFS by size

Hi,
I am trying to upload text files in size 60 GB or more.
I want to split these files into smaller files of say 1 GB each so that I can run further map-red jobs on it.

Anybody has any idea how can I do this ?
Thanks a lot in advance ! Any ideas are greatly appreciated !

-JJ