You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by DanteSama <ch...@sojo.com> on 2014/09/04 19:40:49 UTC

SchemaRDD - Parquet - "insertInto" makes many files

It seems that running insertInto on an SchemaRDD with a ParquetRelation
creates an individual file for each item in the RDD. Sometimes, it has
multiple rows in one file, and sometimes it only writes the column headers.

My question is, is it possible to have it write the entire RDD as 1 file,
but still be associated and registered as a table? Right now I'm doing the
following:

// Create the Parquet "file"
createParquetFile[T]("hdfs://somewhere/folder").registerAsTable("table")

val rdd = some RDD

// Insert the RDD's items into the table
createSchemaRDD[T](rdd).insertInto("table")

However, this ends up with a single file for each row of the format
"part-r-${partition + offset}.parquet" (snagged from ParquetTableOperations
> AppendingParquetOutputFormat)

I know that I can create a single parquet file from an RDD by using
SchemaRDD.saveAsParquetFile, but that prevents me from being able to load a
table once and be aware of any changes.

I'm fine with each insertInto call making a new parquet file in the table
directory. But a file per row is a little over the top... Perhaps there are
Hadoop confgs that I'm missing?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: SchemaRDD - Parquet - "insertInto" makes many files

Posted by chutium <te...@gmail.com>.

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-NumberofTasks

it will be great, if something like  hive.exec.reducers.bytes.per.reducer 
could be implemented.

one idea is, get total size of all target blocks, then set number of
partitions



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13740.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: SchemaRDD - Parquet - "insertInto" makes many files

Posted by "Cheng, Hao" <ha...@intel.com>.

Hive can launch another job with strategy to merged the small files, probably we can also do that in the future release.

From: Michael Armbrust [mailto:michael@databricks.com]
Sent: Friday, September 05, 2014 8:59 AM
To: DanteSama
Cc: user@spark.incubator.apache.org
Subject: Re: SchemaRDD - Parquet - "insertInto" makes many files

It depends on the RDD in question exactly where the work will be done. I believe that if you do a repartition(1) instead of a coalesce it will force a shuffle so the work will be done distributed and then a single node will read that shuffled data and write it out.

If you want to write to a single parquet file however, you will at some point need to block on a single node.

On Thu, Sep 4, 2014 at 2:02 PM, DanteSama <ch...@sojo.com>> wrote:
Yep, that worked out. Does this solution have any performance implications
past all the work being done on (probably) 1 node?

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13501.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>

Re: SchemaRDD - Parquet - "insertInto" makes many files

Posted by Michael Armbrust <mi...@databricks.com>.

It depends on the RDD in question exactly where the work will be done. I
believe that if you do a repartition(1) instead of a coalesce it will force
a shuffle so the work will be done distributed and then a single node will
read that shuffled data and write it out.

If you want to write to a single parquet file however, you will at some
point need to block on a single node.

On Thu, Sep 4, 2014 at 2:02 PM, DanteSama <ch...@sojo.com> wrote:

> Yep, that worked out. Does this solution have any performance implications
> past all the work being done on (probably) 1 node?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13501.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: SchemaRDD - Parquet - "insertInto" makes many files

Posted by DanteSama <ch...@sojo.com>.

Yep, that worked out. Does this solution have any performance implications
past all the work being done on (probably) 1 node?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13501.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SchemaRDD - Parquet - "insertInto" makes many files

Posted by Michael Armbrust <mi...@databricks.com>.

Try doing coalesce(1) on the rdd before insert into.


On Thu, Sep 4, 2014 at 10:40 AM, DanteSama <ch...@sojo.com> wrote:

> It seems that running insertInto on an SchemaRDD with a ParquetRelation
> creates an individual file for each item in the RDD. Sometimes, it has
> multiple rows in one file, and sometimes it only writes the column headers.
>
> My question is, is it possible to have it write the entire RDD as 1 file,
> but still be associated and registered as a table? Right now I'm doing the
> following:
>
> // Create the Parquet "file"
> createParquetFile[T]("hdfs://somewhere/folder").registerAsTable("table")
>
> val rdd = some RDD
>
> // Insert the RDD's items into the table
> createSchemaRDD[T](rdd).insertInto("table")
>
> However, this ends up with a single file for each row of the format
> "part-r-${partition + offset}.parquet" (snagged from ParquetTableOperations
> > AppendingParquetOutputFormat)
>
> I know that I can create a single parquet file from an RDD by using
> SchemaRDD.saveAsParquetFile, but that prevents me from being able to load a
> table once and be aware of any changes.
>
> I'm fine with each insertInto call making a new parquet file in the table
> directory. But a file per row is a little over the top... Perhaps there are
> Hadoop confgs that I'm missing?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>