You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Andy Davidson <An...@SantaCruzIntegration.com> on 2016/07/28 02:10:42 UTC

performance problem when reading lots of small files created by spark streaming.

I have a relatively small data set however it is split into many small JSON
files. Each file is between maybe 4K and 400K
This is probably a very common issue for anyone using spark streaming. My
streaming app works fine, how ever my batch application takes several hours
to run. 

All I am doing is calling count(). Currently I am trying to read the files
from s3. When I look at the app UI it looks like spark is blocked probably
on IO? Adding additional workers and memory does not improve performance.

I am able to copy the files from s3 to a worker relatively quickly. So I do
not think s3 read time is the problem.

In the past when I had similar data sets stored on HDFS I was able to use
coalesce() to reduce the number of partition from 200K to 30. This made a
big improvement in processing time. How ever when I read from s3 coalesce()
does not improve performance.

I tried copying the files to a normal file system and then using hadoop fs
put¹ to copy the files to hdfs how ever this takes several hours and is no
where near completion. It appears hdfs does not deal with small files well.

I am considering copying the files from s3 to a normal file system on one of
my workers and then concatenating the files into a few much large files,
then using hadoop fs put¹ to move them to hdfs. Do you think this would
improve the spark count() performance issue?

Does anyone know of heuristics for determining the number or size of the
concatenated files?

Thanks in advance

Andy

Re: performance problem when reading lots of small files created by spark streaming.

Posted by Andy Davidson <An...@SantaCruzIntegration.com>.

Opps sorry , wrong mail list

From:  Andrew Davidson <An...@SantaCruzIntegration.com>
Date:  Wednesday, July 27, 2016 at 7:10 PM
To:  "users@kafka.apache.org" <us...@kafka.apache.org>
Subject:  performance problem when reading lots of small files created by
spark streaming.

> I have a relatively small data set however it is split into many small JSON
> files. Each file is between maybe 4K and 400K
> This is probably a very common issue for anyone using spark streaming. My
> streaming app works fine, how ever my batch application takes several hours to
> run. 
> 
> All I am doing is calling count(). Currently I am trying to read the files
> from s3. When I look at the app UI it looks like spark is blocked probably on
> IO? Adding additional workers and memory does not improve performance.
> 
> I am able to copy the files from s3 to a worker relatively quickly. So I do
> not think s3 read time is the problem.
> 
> In the past when I had similar data sets stored on HDFS I was able to use
> coalesce() to reduce the number of partition from 200K to 30. This made a big
> improvement in processing time. How ever when I read from s3 coalesce() does
> not improve performance.
> 
> I tried copying the files to a normal file system and then using hadoop fs
> put¹ to copy the files to hdfs how ever this takes several hours and is no
> where near completion. It appears hdfs does not deal with small files well.
> 
> I am considering copying the files from s3 to a normal file system on one of
> my workers and then concatenating the files into a few much large files, then
> using hadoop fs put¹ to move them to hdfs. Do you think this would improve
> the spark count() performance issue?
> 
> Does anyone know of heuristics for determining the number or size of the
> concatenated files?
> 
> Thanks in advance
> 
> Andy
>