You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by jagaran das <ja...@yahoo.co.in> on 2011/07/16 08:17:41 UTC

Hadoop Production Issue


Hi,

Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
Like this we have 20 applications that would run in parallel

So one set would have 11520 files of total size 12 GB
Like this we would have 15 sets in parallel, 

We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins. 

What we do:

1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
2. Copy to cluster
3. Execute PIG job
4. copy to local
5 Sql loader

Can we perform merge and copy to cluster from a different host other than the Namenode?
We want an out of cluster machine running a java process that would
1. Run periodically
2. Merge Files
3. Copy to Cluster 

Secondly,
If we can append to an existing file in cluster?

Please provide your thoughts as maintaing the SLA is becoming tough. 

Regards,
Jagaran

Re: Hadoop Production Issue

Posted by Jeremy Hanna <je...@gmail.com>.

One thing that we use is filecrush to merge small files below a threshold.  It works pretty well.
http://www.jointhegrid.com/hadoop_filecrush/index.jsp

On Jul 16, 2011, at 1:17 AM, jagaran das wrote:

> 
> 
> Hi,
> 
> Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
> Like this we have 20 applications that would run in parallel
> 
> So one set would have 11520 files of total size 12 GB
> Like this we would have 15 sets in parallel, 
> 
> We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins. 
> 
> What we do:
> 
> 1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
> 2. Copy to cluster
> 3. Execute PIG job
> 4. copy to local
> 5 Sql loader
> 
> Can we perform merge and copy to cluster from a different host other than the Namenode?
> We want an out of cluster machine running a java process that would
> 1. Run periodically
> 2. Merge Files
> 3. Copy to Cluster 
> 
> Secondly,
> If we can append to an existing file in cluster?
> 
> Please provide your thoughts as maintaing the SLA is becoming tough. 
> 
> Regards,
> Jagaran

Re: Hadoop Production Issue

Posted by jagaran das <ja...@yahoo.co.in>.

Our Config:

72 G RAM 4 Quad Core processor 1.8 TB local memory

10 node CDH3 cluster 

________________________________
From: jagaran das <ja...@yahoo.co.in>
To: "user@pig.apache.org" <us...@pig.apache.org>
Sent: Saturday, 16 July 2011 11:00 AM
Subject: Re: Hadoop Production Issue

ok then

1. We have to write a pig job for merging or PIG itself merges so that less number of mappers are invoked.

2. Can we copy to a cluster from a non cluster machine, using the namespace URI of the NN ? - We can dedicate some good config boxes to do our merging and copying and then copy it to NN over network.

3. How is the performance of FileCrusher tool ?

We found that to copy 12 GB of data for all the 15 apps in parallel took 35 mins.

we ran 15 copy from local each having 12 GB data.

Thanks 
JD

________________________________
From: Dmitriy Ryaboy <dv...@gmail.com>
To: user@pig.apache.org; jagaran das <ja...@yahoo.co.in>
Sent: Saturday, 16 July 2011 7:58 AM
Subject: Re: Hadoop Production Issue

Merging: doesn't actually speed things up all that much; reduces load
on the Namenode, and speeds up job initialization some. You don't have
to do it on the namenode itself. Neither do you have to do copying on
the NN. In fact, don't do anything but run the NameNode on the
namenode.

Pig jobs can transparently merge small jobs into larger chunks, so you
won't be stuck with 11K mappers.

Don't copy to local an then run SQL loader. Use Sqoop export, and load
directly from Hadoop.

You cannot append to a file that already exists in the cluster. This
will be available in one of the coming Hadoop releases. You can
certainly create a new file in a directory, and load whole
directories.

-D

On Sat, Jul 16, 2011 at 1:17 AM, jagaran das <ja...@yahoo.co.in> wrote:
>
>
> Hi,
>
> Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
> Like this we have 20 applications that would run in parallel
>
> So one set would have 11520 files of total size 12 GB
> Like this we would have 15 sets in parallel,
>
> We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.
>
> What we do:
>
> 1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
> 2. Copy to cluster
> 3. Execute PIG job
> 4. copy to local
> 5 Sql loader
>
> Can we perform merge and copy to cluster from a different host other than the Namenode?
> We want an out of cluster machine running a java process that would
> 1. Run periodically
> 2. Merge Files
> 3. Copy to Cluster
>
> Secondly,
> If we can append to an existing file in cluster?
>
> Please provide your thoughts as maintaing the SLA is becoming tough.
>
> Regards,
> Jagaran
>

Re: Hadoop Production Issue

Posted by jagaran das <ja...@yahoo.co.in>.

Thanks Dimitry.

1. So I can write a pig job that would do merging files.
2. But again the above Pig job would work around small files, would that not affect performance.

3. For Copy again, if want to un the copy command, we need hadoop installed in the machine or you are suggesting  to use java api to invoke copy command??

Thanks a lot

Regards,
JD


________________________________
From: Dmitriy Ryaboy <dv...@gmail.com>
To: user@pig.apache.org; jagaran das <ja...@yahoo.co.in>
Sent: Saturday, 16 July 2011 8:38 PM
Subject: Re: Hadoop Production Issue

1) Correct.

2) Copy to the cluster from any machine, just have the config on the
classpath or specify the full path in your copy command
(hdfs://my-nn/path/to/destination).



On Sat, Jul 16, 2011 at 1:00 PM, jagaran das <ja...@yahoo.co.in> wrote:
> ok then
>
> 1. We have to write a pig job for merging or PIG itself merges so that less number of mappers are invoked.
>
> 2. Can we copy to a cluster from a non cluster machine, using the namespace URI of the NN ? - We can dedicate some good config boxes to do our merging and copying and then copy it to NN over network.
>
> 3. How is the performance of FileCrusher tool ?
>
> We found that to copy 12 GB of data for all the 15 apps in parallel took 35 mins.
>
> we ran 15 copy from local each having 12 GB data.
>
> Thanks
> JD
>
>
> ________________________________
> From: Dmitriy Ryaboy <dv...@gmail.com>
> To: user@pig.apache.org; jagaran das <ja...@yahoo.co.in>
> Sent: Saturday, 16 July 2011 7:58 AM
> Subject: Re: Hadoop Production Issue
>
> Merging: doesn't actually speed things up all that much; reduces load
> on the Namenode, and speeds up job initialization some. You don't have
> to do it on the namenode itself. Neither do you have to do copying on
> the NN. In fact, don't do anything but run the NameNode on the
> namenode.
>
> Pig jobs can transparently merge small jobs into larger chunks, so you
> won't be stuck with 11K mappers.
>
> Don't copy to local an then run SQL loader. Use Sqoop export, and load
> directly from Hadoop.
>
> You cannot append to a file that already exists in the cluster. This
> will be available in one of the coming Hadoop releases. You can
> certainly create a new file in a directory, and load whole
> directories.
>
> -D
>
> On Sat, Jul 16, 2011 at 1:17 AM, jagaran das <ja...@yahoo.co.in> wrote:
>>
>>
>> Hi,
>>
>> Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
>> Like this we have 20 applications that would run in parallel
>>
>> So one set would have 11520 files of total size 12 GB
>> Like this we would have 15 sets in parallel,
>>
>> We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.
>>
>> What we do:
>>
>> 1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
>> 2. Copy to cluster
>> 3. Execute PIG job
>> 4. copy to local
>> 5 Sql loader
>>
>> Can we perform merge and copy to cluster from a different host other than the Namenode?
>> We want an out of cluster machine running a java process that would
>> 1. Run periodically
>> 2. Merge Files
>> 3. Copy to Cluster
>>
>> Secondly,
>> If we can append to an existing file in cluster?
>>
>> Please provide your thoughts as maintaing the SLA is becoming tough.
>>
>> Regards,
>> Jagaran
>>

Re: Hadoop Production Issue

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

1) Correct.

2) Copy to the cluster from any machine, just have the config on the
classpath or specify the full path in your copy command
(hdfs://my-nn/path/to/destination).



On Sat, Jul 16, 2011 at 1:00 PM, jagaran das <ja...@yahoo.co.in> wrote:
> ok then
>
> 1. We have to write a pig job for merging or PIG itself merges so that less number of mappers are invoked.
>
> 2. Can we copy to a cluster from a non cluster machine, using the namespace URI of the NN ? - We can dedicate some good config boxes to do our merging and copying and then copy it to NN over network.
>
> 3. How is the performance of FileCrusher tool ?
>
> We found that to copy 12 GB of data for all the 15 apps in parallel took 35 mins.
>
> we ran 15 copy from local each having 12 GB data.
>
> Thanks
> JD
>
>
> ________________________________
> From: Dmitriy Ryaboy <dv...@gmail.com>
> To: user@pig.apache.org; jagaran das <ja...@yahoo.co.in>
> Sent: Saturday, 16 July 2011 7:58 AM
> Subject: Re: Hadoop Production Issue
>
> Merging: doesn't actually speed things up all that much; reduces load
> on the Namenode, and speeds up job initialization some. You don't have
> to do it on the namenode itself. Neither do you have to do copying on
> the NN. In fact, don't do anything but run the NameNode on the
> namenode.
>
> Pig jobs can transparently merge small jobs into larger chunks, so you
> won't be stuck with 11K mappers.
>
> Don't copy to local an then run SQL loader. Use Sqoop export, and load
> directly from Hadoop.
>
> You cannot append to a file that already exists in the cluster. This
> will be available in one of the coming Hadoop releases. You can
> certainly create a new file in a directory, and load whole
> directories.
>
> -D
>
> On Sat, Jul 16, 2011 at 1:17 AM, jagaran das <ja...@yahoo.co.in> wrote:
>>
>>
>> Hi,
>>
>> Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
>> Like this we have 20 applications that would run in parallel
>>
>> So one set would have 11520 files of total size 12 GB
>> Like this we would have 15 sets in parallel,
>>
>> We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.
>>
>> What we do:
>>
>> 1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
>> 2. Copy to cluster
>> 3. Execute PIG job
>> 4. copy to local
>> 5 Sql loader
>>
>> Can we perform merge and copy to cluster from a different host other than the Namenode?
>> We want an out of cluster machine running a java process that would
>> 1. Run periodically
>> 2. Merge Files
>> 3. Copy to Cluster
>>
>> Secondly,
>> If we can append to an existing file in cluster?
>>
>> Please provide your thoughts as maintaing the SLA is becoming tough.
>>
>> Regards,
>> Jagaran
>>

Re: Hadoop Production Issue

Posted by jagaran das <ja...@yahoo.co.in>.

ok then

1. We have to write a pig job for merging or PIG itself merges so that less number of mappers are invoked.

2. Can we copy to a cluster from a non cluster machine, using the namespace URI of the NN ? - We can dedicate some good config boxes to do our merging and copying and then copy it to NN over network.

3. How is the performance of FileCrusher tool ?

We found that to copy 12 GB of data for all the 15 apps in parallel took 35 mins.

we ran 15 copy from local each having 12 GB data.

Thanks 
JD

________________________________
From: Dmitriy Ryaboy <dv...@gmail.com>
To: user@pig.apache.org; jagaran das <ja...@yahoo.co.in>
Sent: Saturday, 16 July 2011 7:58 AM
Subject: Re: Hadoop Production Issue

Merging: doesn't actually speed things up all that much; reduces load
on the Namenode, and speeds up job initialization some. You don't have
to do it on the namenode itself. Neither do you have to do copying on
the NN. In fact, don't do anything but run the NameNode on the
namenode.

Pig jobs can transparently merge small jobs into larger chunks, so you
won't be stuck with 11K mappers.

Don't copy to local an then run SQL loader. Use Sqoop export, and load
directly from Hadoop.

You cannot append to a file that already exists in the cluster. This
will be available in one of the coming Hadoop releases. You can
certainly create a new file in a directory, and load whole
directories.

-D

On Sat, Jul 16, 2011 at 1:17 AM, jagaran das <ja...@yahoo.co.in> wrote:
>
>
> Hi,
>
> Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
> Like this we have 20 applications that would run in parallel
>
> So one set would have 11520 files of total size 12 GB
> Like this we would have 15 sets in parallel,
>
> We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.
>
> What we do:
>
> 1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
> 2. Copy to cluster
> 3. Execute PIG job
> 4. copy to local
> 5 Sql loader
>
> Can we perform merge and copy to cluster from a different host other than the Namenode?
> We want an out of cluster machine running a java process that would
> 1. Run periodically
> 2. Merge Files
> 3. Copy to Cluster
>
> Secondly,
> If we can append to an existing file in cluster?
>
> Please provide your thoughts as maintaing the SLA is becoming tough.
>
> Regards,
> Jagaran
>

Re: Hadoop Production Issue

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Merging: doesn't actually speed things up all that much; reduces load
on the Namenode, and speeds up job initialization some. You don't have
to do it on the namenode itself. Neither do you have to do copying on
the NN. In fact, don't do anything but run the NameNode on the
namenode.

Pig jobs can transparently merge small jobs into larger chunks, so you
won't be stuck with 11K mappers.

Don't copy to local an then run SQL loader. Use Sqoop export, and load
directly from Hadoop.

You cannot append to a file that already exists in the cluster. This
will be available in one of the coming Hadoop releases. You can
certainly create a new file in a directory, and load whole
directories.

-D

On Sat, Jul 16, 2011 at 1:17 AM, jagaran das <ja...@yahoo.co.in> wrote:
>
>
> Hi,
>
> Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
> Like this we have 20 applications that would run in parallel
>
> So one set would have 11520 files of total size 12 GB
> Like this we would have 15 sets in parallel,
>
> We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.
>
> What we do:
>
> 1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
> 2. Copy to cluster
> 3. Execute PIG job
> 4. copy to local
> 5 Sql loader
>
> Can we perform merge and copy to cluster from a different host other than the Namenode?
> We want an out of cluster machine running a java process that would
> 1. Run periodically
> 2. Merge Files
> 3. Copy to Cluster
>
> Secondly,
> If we can append to an existing file in cluster?
>
> Please provide your thoughts as maintaing the SLA is becoming tough.
>
> Regards,
> Jagaran
>