You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by James Barney <ja...@gmail.com> on 2016/02/23 15:05:31 UTC

ORC file writing hangs in pyspark

I'm trying to write an ORC file after running the FPGrowth algorithm on a
dataset of around just 2GB in size. The algorithm performs well and can
display results if I take(n) the freqItemSets() of the result after
converting that to a DF.

I'm using Spark 1.5.2 on HDP 2.3.4 and Python 3.4.2 on Yarn.

I get the results from querying a Hive table, also ORC format, running a
number of maps, joins, and filters on the data.

When the program attempts to write the files:
    result.write.orc('/data/staged/raw_result')
  size_1_buckets.write.orc('/data/staged/size_1_results')
  filter_size_2_buckets.write.orc('/data/staged/size_2_results')

The first path, /data/staged/raw_result, is created with a _temporary
folder, but the data is never written. The job hangs at this point,
apparently indefinitely.

Additionally, no logs are recorded or available for the jobs on the history
server.

What could be the problem?

Re: ORC file writing hangs in pyspark

Posted by James Barney <ja...@gmail.com>.
Thank you for the suggestions. We looked at the live spark UI and yarn app
logs and found what we think to be the issue: in spark 1.5.2, the FPGrowth
algorithm doesn't require you to specify the number of partitions in your
input data. Without specifying, FPGrowth puts all of its data into one
partition however. Subsequently, only one executor is responsible for
writing the ORC file from the resultant dataframe that FPGrowth puts out.
That's what was causing it to hang.

After specifying the number of partitions in FPGrowth, upon writing, the
writing step continues and finishes quickly.

Thank you again for the suggestions

On Tue, Feb 23, 2016 at 9:28 PM, Zhan Zhang <zz...@hortonworks.com> wrote:

> Hi James,
>
> You can try to write with other format, e.g., parquet to see whether it is
> a orc specific issue or more generic issue.
>
> Thanks.
>
> Zhan Zhang
>
> On Feb 23, 2016, at 6:05 AM, James Barney <ja...@gmail.com> wrote:
>
> I'm trying to write an ORC file after running the FPGrowth algorithm on a
> dataset of around just 2GB in size. The algorithm performs well and can
> display results if I take(n) the freqItemSets() of the result after
> converting that to a DF.
>
> I'm using Spark 1.5.2 on HDP 2.3.4 and Python 3.4.2 on Yarn.
>
> I get the results from querying a Hive table, also ORC format, running a
> number of maps, joins, and filters on the data.
>
> When the program attempts to write the files:
>     result.write.orc('/data/staged/raw_result')
>   size_1_buckets.write.orc('/data/staged/size_1_results')
>   filter_size_2_buckets.write.orc('/data/staged/size_2_results')
>
> The first path, /data/staged/raw_result, is created with a _temporary
> folder, but the data is never written. The job hangs at this point,
> apparently indefinitely.
>
> Additionally, no logs are recorded or available for the jobs on the
> history server.
>
> What could be the problem?
>
>
>

Re: ORC file writing hangs in pyspark

Posted by Zhan Zhang <zz...@hortonworks.com>.
Hi James,

You can try to write with other format, e.g., parquet to see whether it is a orc specific issue or more generic issue.

Thanks.

Zhan Zhang

On Feb 23, 2016, at 6:05 AM, James Barney <ja...@gmail.com>> wrote:

I'm trying to write an ORC file after running the FPGrowth algorithm on a dataset of around just 2GB in size. The algorithm performs well and can display results if I take(n) the freqItemSets() of the result after converting that to a DF.

I'm using Spark 1.5.2 on HDP 2.3.4 and Python 3.4.2 on Yarn.

I get the results from querying a Hive table, also ORC format, running a number of maps, joins, and filters on the data.

When the program attempts to write the files:
    result.write.orc('/data/staged/raw_result')
  size_1_buckets.write.orc('/data/staged/size_1_results')
  filter_size_2_buckets.write.orc('/data/staged/size_2_results')

The first path, /data/staged/raw_result, is created with a _temporary folder, but the data is never written. The job hangs at this point, apparently indefinitely.

Additionally, no logs are recorded or available for the jobs on the history server.

What could be the problem?


Re: ORC file writing hangs in pyspark

Posted by Jeff Zhang <zj...@gmail.com>.
Have you checked the live spark UI and yarn app logs ?

On Tue, Feb 23, 2016 at 10:05 PM, James Barney <ja...@gmail.com>
wrote:

> I'm trying to write an ORC file after running the FPGrowth algorithm on a
> dataset of around just 2GB in size. The algorithm performs well and can
> display results if I take(n) the freqItemSets() of the result after
> converting that to a DF.
>
> I'm using Spark 1.5.2 on HDP 2.3.4 and Python 3.4.2 on Yarn.
>
> I get the results from querying a Hive table, also ORC format, running a
> number of maps, joins, and filters on the data.
>
> When the program attempts to write the files:
>     result.write.orc('/data/staged/raw_result')
>   size_1_buckets.write.orc('/data/staged/size_1_results')
>   filter_size_2_buckets.write.orc('/data/staged/size_2_results')
>
> The first path, /data/staged/raw_result, is created with a _temporary
> folder, but the data is never written. The job hangs at this point,
> apparently indefinitely.
>
> Additionally, no logs are recorded or available for the jobs on the
> history server.
>
> What could be the problem?
>



-- 
Best Regards

Jeff Zhang