You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by KhajaAsmath Mohammed <md...@gmail.com> on 2017/10/29 15:03:04 UTC

Spark Streaming Small files in Hive

Hi,

I am using spark streaming to write data back into hive with the below code
snippet


eventHubsWindowedStream.map(x => EventContent(new String(x)))

      .foreachRDD(rdd => {

        val sparkSession = SparkSession
.builder.enableHiveSupport.getOrCreate

        import sparkSession.implicits._

        rdd.toDS.write.mode(org.apache.spark.sql.SaveMode.Append
).insertInto(hiveTableName)

      })

Hive table is partitioned by year,month,day so we end up getting less data
for some days and it in turn results in smaller files inside hive. Since
the data is being written in smaller files, there is lot of performance on
Impala/Hive when reading it? is there a way to merge files while inserting
data into hive?

It would be really helpful too if you anyone can provide suggestions on how
to design it in better way. we cannot use Hbase/kudu in this current
scenario due to space issue with clusters .

Thanks,

Asmath

Re: Spark Streaming Small files in Hive

Posted by Siva Gudavalli <gu...@yahoo.com.INVALID>.

Hello Asmath,

We had a similar challenge recently.

When you write back to hive, you are creating files on HDFS, and it depends on your batch window. 
If you increase your batch window lets say from 1 min to 5 mins you will end up creating 5x times less.

The other factor is your partitioning. For instance, if your spark application is working on 5 partitions, you can repartition to 1, this will again reduce the number of files to 5x.

You can create staging to hold small files and once a decent amount of data is accumulated you can prepare large files and load to your final hive table.

hope this helps.

Regards
Shiv

> On Oct 29, 2017, at 11:03 AM, KhajaAsmath Mohammed <md...@gmail.com> wrote:
> 
> Hi,
> 
> I am using spark streaming to write data back into hive with the below code snippet
> 
> 
> eventHubsWindowedStream.map(x => EventContent(new String(x)))
> 
>       .foreachRDD(rdd => {
> 
>         val sparkSession = SparkSession.builder.enableHiveSupport.getOrCreate
> 
>         import sparkSession.implicits._
> 
>         rdd.toDS.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto(hiveTableName)
> 
>       })
> 
> 
> Hive table is partitioned by year,month,day so we end up getting less data for some days and it in turn results in smaller files inside hive. Since the data is being written in smaller files, there is lot of performance on Impala/Hive when reading it? is there a way to merge files while inserting data into hive?
> 
> It would be really helpful too if you anyone can provide suggestions on how to design it in better way. we cannot use Hbase/kudu in this current scenario due to space issue with clusters .
> 
> Thanks,
> 
> Asmath