You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by David Rosenstrauch <da...@darose.net> on 2014/07/15 23:58:39 UTC

Spark misconfigured? Small input split sizes in shark query

Got a spark/shark cluster up and running recently, and have been kicking 
the tires on it.  However, been wrestling with an issue on it that I'm 
not quite sure how to solve.  (Or, at least, not quite sure about the 
correct way to solve it.)

I ran a simple Hive query (select count ...) against a dataset of .tsv 
files stored in S3, and then ran the same query on shark for comparison. 
  But the shark query took 3x as long.

After a bit of digging, I was able to find out what was happening: 
apparently with the hive query each map task was reading an input split 
consisting of 2 entire files from the dataset (approximately 180MB 
each), while with shark each task was reading an input split consisting 
of a 64MB chunk from one of the files.  This made sense:  since the 
shark query had to open each S3 file 3 separate times (and had to run 3x 
as many tasks) it made sense that it took much longer.

After much experimentation I was finally able to work around this issue 
by overriding the value of mapreduce.input.fileinputformat.split.minsize 
in my hive-site.xml file.  (Bumping it up to 512MB.)  However, I'm 
feeling like this isn't really the "right" way to solve the issue:

a) That parm is normally set to 1.  It doesn't seem right that I should 
need to override it - or set it to a value as large as 512MB.

b) We only seem to experience this issue on an existing Hadoop cluster 
that we've deployed spark/shark onto.  When we run the same query on a 
new cluster launched via the spark ec2 scripts, the number of splits 
seems to get calculated correctly - without the need for overriding that 
parm.  This leads me to believe we may just have something misconfigured 
on our existing cluster.

c) This seems like an error prone way to overcome this issue.  512MB is 
an arbitrary value, and should I happen to be running a query against 
files that are larger than 512MB, I'll again run into the chunking issue.

So my gut tells me there's a better way to solve this issue - i.e., 
somehow configuring spark so that the input splits it generates won't 
chunk the input files.  Anyone know how to accomplish this / what I 
might have misconfigured?

Thanks,

DR

Small input split sizes

Posted by David Rosenstrauch <da...@darose.net>.

I'm still bumping up against this issue:  spark (and shark) are breaking 
my inputs into 64MB-sized splits.  Anyone know where/how to configure 
spark so that it either doesn't split the inputs, or at least uses a 
much large split size?  (E.g., 512MB.)

Thanks,

DR

On 07/15/2014 05:58 PM, David Rosenstrauch wrote:
> Got a spark/shark cluster up and running recently, and have been kicking
> the tires on it.  However, been wrestling with an issue on it that I'm
> not quite sure how to solve.  (Or, at least, not quite sure about the
> correct way to solve it.)
>
> I ran a simple Hive query (select count ...) against a dataset of .tsv
> files stored in S3, and then ran the same query on shark for comparison.
>   But the shark query took 3x as long.
>
> After a bit of digging, I was able to find out what was happening:
> apparently with the hive query each map task was reading an input split
> consisting of 2 entire files from the dataset (approximately 180MB
> each), while with shark each task was reading an input split consisting
> of a 64MB chunk from one of the files.  This made sense:  since the
> shark query had to open each S3 file 3 separate times (and had to run 3x
> as many tasks) it made sense that it took much longer.
>
> After much experimentation I was finally able to work around this issue
> by overriding the value of mapreduce.input.fileinputformat.split.minsize
> in my hive-site.xml file.  (Bumping it up to 512MB.)  However, I'm
> feeling like this isn't really the "right" way to solve the issue:
>
> a) That parm is normally set to 1.  It doesn't seem right that I should
> need to override it - or set it to a value as large as 512MB.
>
> b) We only seem to experience this issue on an existing Hadoop cluster
> that we've deployed spark/shark onto.  When we run the same query on a
> new cluster launched via the spark ec2 scripts, the number of splits
> seems to get calculated correctly - without the need for overriding that
> parm.  This leads me to believe we may just have something misconfigured
> on our existing cluster.
>
> c) This seems like an error prone way to overcome this issue.  512MB is
> an arbitrary value, and should I happen to be running a query against
> files that are larger than 512MB, I'll again run into the chunking issue.
>
> So my gut tells me there's a better way to solve this issue - i.e.,
> somehow configuring spark so that the input splits it generates won't
> chunk the input files.  Anyone know how to accomplish this / what I
> might have misconfigured?
>
> Thanks,
>
> DR


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org