You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by ag007 <ag...@mac.com> on 2014/11/03 08:42:40 UTC
Parquet files are only 6-20MB in size?
Hi there,
I have a pySpark job that is simply taking a tab separated CSV outputting it
to a Parquet file. The code is based on the SQL write parquet example.
(Using a different inferred schema, only 35 columns). The input files range
from 100MB to 12 Gb.
I have tried different different block sizes from 10MB through to 1 Gb, I
have tried different parallelism. The total part files total about 1:5
compression.
I am trying to get large parquet files. Having this many small files will
cause problems to my name node. I have over 500,000 of these files.
Your assistance would be greatly appreciated.
cheers,
Ag
PS Another solution may be if there is a parquet concat tool around. I
couldn't see one. I understand that this tool would have to adjust the
footer.
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Parquet files are only 6-20MB in size?
Posted by ag007 <ag...@mac.com>.
David, that's exactly what I was after :) Awesome, thanks.
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935p18002.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Parquet files are only 6-20MB in size?
Posted by Davies Liu <da...@databricks.com>.
Befire saveAsParquetFile(), you can call coalesce(N), then you will
have N files,
it will keep the order as before (repartition() will not).
On Mon, Nov 3, 2014 at 1:16 AM, ag007 <ag...@mac.com> wrote:
> Thanks Akhil,
>
> Am I right in saying that the repartition will spread the data randomly so I
> loose chronological order?
>
> I really just want the csv --> parquet format in the same order it came in.
> If I set repartition with 1 will this not be random?
>
> cheers,
> Ag
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935p17941.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Parquet files are only 6-20MB in size?
Posted by ag007 <ag...@mac.com>.
Thanks Akhil,
Am I right in saying that the repartition will spread the data randomly so I
loose chronological order?
I really just want the csv --> parquet format in the same order it came in.
If I set repartition with 1 will this not be random?
cheers,
Ag
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935p17941.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Parquet files are only 6-20MB in size?
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Before doing saveAsParquetFile, you can call the repartition and provide a
decent number which will result in the total number of output files
generated.
Thanks
Best Regards
On Mon, Nov 3, 2014 at 1:12 PM, ag007 <ag...@mac.com> wrote:
> Hi there,
>
> I have a pySpark job that is simply taking a tab separated CSV outputting
> it
> to a Parquet file. The code is based on the SQL write parquet example.
> (Using a different inferred schema, only 35 columns). The input files range
> from 100MB to 12 Gb.
>
> I have tried different different block sizes from 10MB through to 1 Gb, I
> have tried different parallelism. The total part files total about 1:5
> compression.
>
> I am trying to get large parquet files. Having this many small files will
> cause problems to my name node. I have over 500,000 of these files.
>
> Your assistance would be greatly appreciated.
>
> cheers,
> Ag
>
> PS Another solution may be if there is a parquet concat tool around. I
> couldn't see one. I understand that this tool would have to adjust the
> footer.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>