You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by innowireless TaeYun Kim <ta...@innowireless.co.kr> on 2014/07/08 04:38:51 UTC

Help for the large number of the input data files

Hi,

 

A help for the implementation best practice is needed.

The operating environment is as follows:

 

- Log data file arrives irregularly.

- The size of a log data file is from 3.9KB to 8.5MB. The average is about
1MB.

- The number of records of a data file is from 13 lines to 22000 lines. The
average is about 2700 lines.

- Data file must be post-processed before aggregation.

- Post-processing algorithm can be changed.

- Post-processed file is managed separately with original data file, since
the post-processing algorithm might be changed.

- Daily aggregation is performed. All post-processed data file must be
filtered record-by-record and aggregation(average, max min.) is calculated.

- Since aggregation is fine-grained, the number of records after the
aggregation is not so small. It can be about half of the number of the
original records.

- At a point, the number of the post-processed file can be about 200,000.

- A data file should be able to be deleted individually.

 

In a test, I tried to process 160,000 post-processed files by Spark starting
with sc.textFile() with glob path, it failed with OutOfMemory exception on
the driver process.

 

What is the best practice to handle this kind of data?

Should I use HBase instead of plain files to save post-processed data?

 

Thank you.

Re: Help for the large number of the input data files

Posted by Xiangrui Meng <me...@gmail.com>.

You can either use sc.wholeTextFiles and then a flatMap to reduce the
number of partitions, or give more memory to the driver process by
using --driver-memory 20g and then call RDD.repartition(small number)
after you load the data in. -Xiangrui

On Mon, Jul 7, 2014 at 7:38 PM, innowireless TaeYun Kim
<ta...@innowireless.co.kr> wrote:
> Hi,
>
>
>
> A help for the implementation best practice is needed.
>
> The operating environment is as follows:
>
>
>
> - Log data file arrives irregularly.
>
> - The size of a log data file is from 3.9KB to 8.5MB. The average is about
> 1MB.
>
> - The number of records of a data file is from 13 lines to 22000 lines. The
> average is about 2700 lines.
>
> - Data file must be post-processed before aggregation.
>
> - Post-processing algorithm can be changed.
>
> - Post-processed file is managed separately with original data file, since
> the post-processing algorithm might be changed.
>
> - Daily aggregation is performed. All post-processed data file must be
> filtered record-by-record and aggregation(average, max min…) is calculated.
>
> - Since aggregation is fine-grained, the number of records after the
> aggregation is not so small. It can be about half of the number of the
> original records.
>
> - At a point, the number of the post-processed file can be about 200,000.
>
> - A data file should be able to be deleted individually.
>
>
>
> In a test, I tried to process 160,000 post-processed files by Spark starting
> with sc.textFile() with glob path, it failed with OutOfMemory exception on
> the driver process.
>
>
>
> What is the best practice to handle this kind of data?
>
> Should I use HBase instead of plain files to save post-processed data?
>
>
>
> Thank you.
>
>