You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Nezih Yigitbasi <ny...@netflix.com.INVALID> on 2015/11/26 18:43:38 UTC

question about combining small parquet files

Hi Spark people,
I have a Hive table that has a lot of small parquet files and I am
creating a data frame out of it to do some processing, but since I have a
large number of splits/files my job creates a lot of tasks, which I don't
want. Basically what I want is the same functionality that Hive provides,
that is, to combine these small input splits into larger ones by specifying
a max split size setting. Is this currently possible with Spark?

I look at coalesce() but with coalesce I can only control the number
of output files not their sizes. And since the total input dataset size
can vary significantly in my case, I cannot just use a fixed partition
count as the size of each output file can get very large. I then looked for
getting the total input size from an rdd to come up with some heuristic to
set the partition count, but I couldn't find any ways to do it (without
modifying the spark source).

Any help is appreciated.

Thanks,
Nezih

PS: this email is the same as my previous email as I learned that my
previous email ended up as spam for many people since I sent it through
nabble, sorry for the double post.

Re: question about combining small parquet files

Posted by Sabarish Sasidharan <sa...@manthan.com>.
You could use the number of input files to determine the number of output
partitions. This assumes your input file sizes are deterministic.

Else, you could also persist the RDD and then determine it's size using the
apis.

Regards
Sab
On 26-Nov-2015 11:13 pm, "Nezih Yigitbasi" <ny...@netflix.com.invalid>
wrote:

> Hi Spark people,
> I have a Hive table that has a lot of small parquet files and I am
> creating a data frame out of it to do some processing, but since I have a
> large number of splits/files my job creates a lot of tasks, which I don't
> want. Basically what I want is the same functionality that Hive provides,
> that is, to combine these small input splits into larger ones by specifying
> a max split size setting. Is this currently possible with Spark?
>
> I look at coalesce() but with coalesce I can only control the number
> of output files not their sizes. And since the total input dataset size
> can vary significantly in my case, I cannot just use a fixed partition
> count as the size of each output file can get very large. I then looked for
> getting the total input size from an rdd to come up with some heuristic to
> set the partition count, but I couldn't find any ways to do it (without
> modifying the spark source).
>
> Any help is appreciated.
>
> Thanks,
> Nezih
>
> PS: this email is the same as my previous email as I learned that my
> previous email ended up as spam for many people since I sent it through
> nabble, sorry for the double post.
>

Re: question about combining small parquet files

Posted by Nezih Yigitbasi <ny...@netflix.com.INVALID>.
This looks interesting, thanks Ruslan. But, compaction with Hive is as
simple as an insert overwrite statement as Hive
supports CombineFileInputFormat, is it possible to do the same with Spark?

On Thu, Nov 26, 2015 at 9:47 AM, Ruslan Dautkhanov <da...@gmail.com>
wrote:

> An interesting compaction approach of small files is discussed recently
>
> http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
>
>
> AFAIK Spark supports views too.
>
>
> --
> Ruslan Dautkhanov
>
> On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi <
> nyigitbasi@netflix.com.invalid> wrote:
>
>> Hi Spark people,
>> I have a Hive table that has a lot of small parquet files and I am
>> creating a data frame out of it to do some processing, but since I have a
>> large number of splits/files my job creates a lot of tasks, which I don't
>> want. Basically what I want is the same functionality that Hive provides,
>> that is, to combine these small input splits into larger ones by specifying
>> a max split size setting. Is this currently possible with Spark?
>>
>> I look at coalesce() but with coalesce I can only control the number
>> of output files not their sizes. And since the total input dataset size
>> can vary significantly in my case, I cannot just use a fixed partition
>> count as the size of each output file can get very large. I then looked for
>> getting the total input size from an rdd to come up with some heuristic to
>> set the partition count, but I couldn't find any ways to do it (without
>> modifying the spark source).
>>
>> Any help is appreciated.
>>
>> Thanks,
>> Nezih
>>
>> PS: this email is the same as my previous email as I learned that my
>> previous email ended up as spam for many people since I sent it through
>> nabble, sorry for the double post.
>>
>
>

Re: question about combining small parquet files

Posted by Nezih Yigitbasi <ny...@netflix.com.INVALID>.
This looks interesting, thanks Ruslan. But, compaction with Hive is as
simple as an insert overwrite statement as Hive
supports CombineFileInputFormat, is it possible to do the same with Spark?

On Thu, Nov 26, 2015 at 9:47 AM, Ruslan Dautkhanov <da...@gmail.com>
wrote:

> An interesting compaction approach of small files is discussed recently
>
> http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
>
>
> AFAIK Spark supports views too.
>
>
> --
> Ruslan Dautkhanov
>
> On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi <
> nyigitbasi@netflix.com.invalid> wrote:
>
>> Hi Spark people,
>> I have a Hive table that has a lot of small parquet files and I am
>> creating a data frame out of it to do some processing, but since I have a
>> large number of splits/files my job creates a lot of tasks, which I don't
>> want. Basically what I want is the same functionality that Hive provides,
>> that is, to combine these small input splits into larger ones by specifying
>> a max split size setting. Is this currently possible with Spark?
>>
>> I look at coalesce() but with coalesce I can only control the number
>> of output files not their sizes. And since the total input dataset size
>> can vary significantly in my case, I cannot just use a fixed partition
>> count as the size of each output file can get very large. I then looked for
>> getting the total input size from an rdd to come up with some heuristic to
>> set the partition count, but I couldn't find any ways to do it (without
>> modifying the spark source).
>>
>> Any help is appreciated.
>>
>> Thanks,
>> Nezih
>>
>> PS: this email is the same as my previous email as I learned that my
>> previous email ended up as spam for many people since I sent it through
>> nabble, sorry for the double post.
>>
>
>

Re: question about combining small parquet files

Posted by Ruslan Dautkhanov <da...@gmail.com>.
An interesting compaction approach of small files is discussed recently
http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/


AFAIK Spark supports views too.


-- 
Ruslan Dautkhanov

On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi <
nyigitbasi@netflix.com.invalid> wrote:

> Hi Spark people,
> I have a Hive table that has a lot of small parquet files and I am
> creating a data frame out of it to do some processing, but since I have a
> large number of splits/files my job creates a lot of tasks, which I don't
> want. Basically what I want is the same functionality that Hive provides,
> that is, to combine these small input splits into larger ones by specifying
> a max split size setting. Is this currently possible with Spark?
>
> I look at coalesce() but with coalesce I can only control the number
> of output files not their sizes. And since the total input dataset size
> can vary significantly in my case, I cannot just use a fixed partition
> count as the size of each output file can get very large. I then looked for
> getting the total input size from an rdd to come up with some heuristic to
> set the partition count, but I couldn't find any ways to do it (without
> modifying the spark source).
>
> Any help is appreciated.
>
> Thanks,
> Nezih
>
> PS: this email is the same as my previous email as I learned that my
> previous email ended up as spam for many people since I sent it through
> nabble, sorry for the double post.
>