You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Brian Wylie <br...@gmail.com> on 2017/09/08 18:36:17 UTC
spark error with reading parquet file created vis pandas/pyarrow
Apologies if this isn't quite the right place to ask this question, but I
figured Wes/others might know right off the bat :)
Context:
- Mac OSX Laptop
- PySpark: 2.2.0
- PyArrow: 0.6.0
- Pandas: 0.19.2
Issue Explanation:
- I'm converting my Pandas dataframe to a Parquet file with code very
similar to
- http://wesmckinney.com/blog/python-parquet-update/
- My Pandas DataFrame has a datetime index: http_df.index.dtype =
dtype('<M8[ns]')
- When loading the saved parquet file I get the error below
- If I remove that index everything works fine
ERROR:
- Py4JJavaError: An error occurred while calling o34.parquet.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
in stage 0.0 (TID 0, localhost, executor driver):
org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64
(TIMESTAMP_MICROS);
Full Code to reproduce:
- https://github.com/Kitware/bat/blob/master/notebooks/Bro_to_Parquet.ipynb
Thanks in advance, also big fan of all this stuff... "be the chicken" :)
-Brian
Re: spark error with reading parquet file created vis pandas/pyarrow
Posted by Wes McKinney <we...@gmail.com>.
The option
pq.write_table(..., flavor='spark')
made it into the 0.7.0 release
- Wes
On Fri, Sep 8, 2017 at 6:28 PM, Julien Le Dem <ju...@gmail.com> wrote:
> The int96 deprecation is slowly bubbling up the stack. There are still discussions in spark on how to make the change. So for now even though it's deprecated it is still used in some places. This should get resolved in the near future.
>
> Julien
>
>> On Sep 8, 2017, at 14:12, Wes McKinney <we...@gmail.com> wrote:
>>
>> Turning on int96 timestamps is the solution right now. To save
>> yourself some typing, you could declare
>>
>> parquet_options = {
>> 'compression': ...,
>> 'use_deprecated_int96_timestamps': True
>> }
>>
>> pq.write_table(..., **parquet_options)
>>
>>> On Fri, Sep 8, 2017 at 5:08 PM, Brian Wylie <br...@gmail.com> wrote:
>>> So, this is certainly good for future versions of Arrow. Do you have any
>>> specific recommendations for a workaround currently?
>>>
>>> Saving a parquet file with datetimes will obviously be a common use case
>>> and if I'm understanding it correctly, right now saving a Parquet file with
>>> PyArrow that file will not be readable by Spark at this point. Yes? (I'm
>>> asking this as opposed to stating this).
>>>
>>> -Brian
>>>
>>>> On Fri, Sep 8, 2017 at 2:58 PM, Wes McKinney <we...@gmail.com> wrote:
>>>>
>>>> Indeed, INT96 is deprecated in the Parquet format. There are other
>>>> issues with Spark (it places restrictions on table field names, for
>>>> example), so it may be worth adding an option like
>>>>
>>>> pq.write_table(table, where, flavor='spark')
>>>>
>>>> or maybe better
>>>>
>>>> pq.write_table(table, where, flavor='spark-2.2')
>>>>
>>>> and this would set the correct options for that version of Spark.
>>>>
>>>> I created https://issues.apache.org/jira/browse/ARROW-1499 as a place
>>>> to discuss further
>>>>
>>>> - Wes
>>>>
>>>>
>>>> On Fri, Sep 8, 2017 at 4:28 PM, Brian Wylie <br...@gmail.com>
>>>> wrote:
>>>>> Okay,
>>>>>
>>>>> So after some additional debugging, I can get around this if I set
>>>>>
>>>>> use_deprecated_int96_timestamps=True
>>>>>
>>>>> on the pq.write_table(arrow_table, filename, compression=compression,
>>>>> use_deprecated_int96_timestamps=True) call.
>>>>>
>>>>> But that just feels SO wrong....as I'm sure it's deprecated for a reason
>>>>> (i.e. this will bite me later and badly)
>>>>>
>>>>>
>>>>> I also see this issue (or at least a related issue) reference in this
>>>> Jeff
>>>>> Knupp blog...
>>>>>
>>>>> https://www.enigma.com/blog/moving-to-parquet-files-as-a-
>>>> system-of-record
>>>>>
>>>>> So shrug... any suggestions are greatly appreciated :)
>>>>>
>>>>> -Brian
>>>>>
>>>>> On Fri, Sep 8, 2017 at 12:36 PM, Brian Wylie <br...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Apologies if this isn't quite the right place to ask this question, but
>>>> I
>>>>>> figured Wes/others might know right off the bat :)
>>>>>>
>>>>>>
>>>>>> Context:
>>>>>> - Mac OSX Laptop
>>>>>> - PySpark: 2.2.0
>>>>>> - PyArrow: 0.6.0
>>>>>> - Pandas: 0.19.2
>>>>>>
>>>>>> Issue Explanation:
>>>>>> - I'm converting my Pandas dataframe to a Parquet file with code very
>>>>>> similar to
>>>>>> - http://wesmckinney.com/blog/python-parquet-update/
>>>>>> - My Pandas DataFrame has a datetime index: http_df.index.dtype =
>>>>>> dtype('<M8[ns]')
>>>>>> - When loading the saved parquet file I get the error below
>>>>>> - If I remove that index everything works fine
>>>>>>
>>>>>> ERROR:
>>>>>> - Py4JJavaError: An error occurred while calling o34.parquet.
>>>>>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>>>>>> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
>>>>>> in stage 0.0 (TID 0, localhost, executor driver):
>>>>>> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64
>>>>>> (TIMESTAMP_MICROS);
>>>>>>
>>>>>> Full Code to reproduce:
>>>>>> - https://github.com/Kitware/bat/blob/master/notebooks/Bro_
>>>>>> to_Parquet.ipynb
>>>>>>
>>>>>>
>>>>>> Thanks in advance, also big fan of all this stuff... "be the chicken" :)
>>>>>>
>>>>>> -Brian
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
Re: spark error with reading parquet file created vis pandas/pyarrow
Posted by Julien Le Dem <ju...@gmail.com>.
The int96 deprecation is slowly bubbling up the stack. There are still discussions in spark on how to make the change. So for now even though it's deprecated it is still used in some places. This should get resolved in the near future.
Julien
> On Sep 8, 2017, at 14:12, Wes McKinney <we...@gmail.com> wrote:
>
> Turning on int96 timestamps is the solution right now. To save
> yourself some typing, you could declare
>
> parquet_options = {
> 'compression': ...,
> 'use_deprecated_int96_timestamps': True
> }
>
> pq.write_table(..., **parquet_options)
>
>> On Fri, Sep 8, 2017 at 5:08 PM, Brian Wylie <br...@gmail.com> wrote:
>> So, this is certainly good for future versions of Arrow. Do you have any
>> specific recommendations for a workaround currently?
>>
>> Saving a parquet file with datetimes will obviously be a common use case
>> and if I'm understanding it correctly, right now saving a Parquet file with
>> PyArrow that file will not be readable by Spark at this point. Yes? (I'm
>> asking this as opposed to stating this).
>>
>> -Brian
>>
>>> On Fri, Sep 8, 2017 at 2:58 PM, Wes McKinney <we...@gmail.com> wrote:
>>>
>>> Indeed, INT96 is deprecated in the Parquet format. There are other
>>> issues with Spark (it places restrictions on table field names, for
>>> example), so it may be worth adding an option like
>>>
>>> pq.write_table(table, where, flavor='spark')
>>>
>>> or maybe better
>>>
>>> pq.write_table(table, where, flavor='spark-2.2')
>>>
>>> and this would set the correct options for that version of Spark.
>>>
>>> I created https://issues.apache.org/jira/browse/ARROW-1499 as a place
>>> to discuss further
>>>
>>> - Wes
>>>
>>>
>>> On Fri, Sep 8, 2017 at 4:28 PM, Brian Wylie <br...@gmail.com>
>>> wrote:
>>>> Okay,
>>>>
>>>> So after some additional debugging, I can get around this if I set
>>>>
>>>> use_deprecated_int96_timestamps=True
>>>>
>>>> on the pq.write_table(arrow_table, filename, compression=compression,
>>>> use_deprecated_int96_timestamps=True) call.
>>>>
>>>> But that just feels SO wrong....as I'm sure it's deprecated for a reason
>>>> (i.e. this will bite me later and badly)
>>>>
>>>>
>>>> I also see this issue (or at least a related issue) reference in this
>>> Jeff
>>>> Knupp blog...
>>>>
>>>> https://www.enigma.com/blog/moving-to-parquet-files-as-a-
>>> system-of-record
>>>>
>>>> So shrug... any suggestions are greatly appreciated :)
>>>>
>>>> -Brian
>>>>
>>>> On Fri, Sep 8, 2017 at 12:36 PM, Brian Wylie <br...@gmail.com>
>>>> wrote:
>>>>
>>>>> Apologies if this isn't quite the right place to ask this question, but
>>> I
>>>>> figured Wes/others might know right off the bat :)
>>>>>
>>>>>
>>>>> Context:
>>>>> - Mac OSX Laptop
>>>>> - PySpark: 2.2.0
>>>>> - PyArrow: 0.6.0
>>>>> - Pandas: 0.19.2
>>>>>
>>>>> Issue Explanation:
>>>>> - I'm converting my Pandas dataframe to a Parquet file with code very
>>>>> similar to
>>>>> - http://wesmckinney.com/blog/python-parquet-update/
>>>>> - My Pandas DataFrame has a datetime index: http_df.index.dtype =
>>>>> dtype('<M8[ns]')
>>>>> - When loading the saved parquet file I get the error below
>>>>> - If I remove that index everything works fine
>>>>>
>>>>> ERROR:
>>>>> - Py4JJavaError: An error occurred while calling o34.parquet.
>>>>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>>>>> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
>>>>> in stage 0.0 (TID 0, localhost, executor driver):
>>>>> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64
>>>>> (TIMESTAMP_MICROS);
>>>>>
>>>>> Full Code to reproduce:
>>>>> - https://github.com/Kitware/bat/blob/master/notebooks/Bro_
>>>>> to_Parquet.ipynb
>>>>>
>>>>>
>>>>> Thanks in advance, also big fan of all this stuff... "be the chicken" :)
>>>>>
>>>>> -Brian
>>>>>
>>>>>
>>>>>
>>>>>
>>>
Re: spark error with reading parquet file created vis pandas/pyarrow
Posted by Wes McKinney <we...@gmail.com>.
Turning on int96 timestamps is the solution right now. To save
yourself some typing, you could declare
parquet_options = {
'compression': ...,
'use_deprecated_int96_timestamps': True
}
pq.write_table(..., **parquet_options)
On Fri, Sep 8, 2017 at 5:08 PM, Brian Wylie <br...@gmail.com> wrote:
> So, this is certainly good for future versions of Arrow. Do you have any
> specific recommendations for a workaround currently?
>
> Saving a parquet file with datetimes will obviously be a common use case
> and if I'm understanding it correctly, right now saving a Parquet file with
> PyArrow that file will not be readable by Spark at this point. Yes? (I'm
> asking this as opposed to stating this).
>
> -Brian
>
> On Fri, Sep 8, 2017 at 2:58 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> Indeed, INT96 is deprecated in the Parquet format. There are other
>> issues with Spark (it places restrictions on table field names, for
>> example), so it may be worth adding an option like
>>
>> pq.write_table(table, where, flavor='spark')
>>
>> or maybe better
>>
>> pq.write_table(table, where, flavor='spark-2.2')
>>
>> and this would set the correct options for that version of Spark.
>>
>> I created https://issues.apache.org/jira/browse/ARROW-1499 as a place
>> to discuss further
>>
>> - Wes
>>
>>
>> On Fri, Sep 8, 2017 at 4:28 PM, Brian Wylie <br...@gmail.com>
>> wrote:
>> > Okay,
>> >
>> > So after some additional debugging, I can get around this if I set
>> >
>> > use_deprecated_int96_timestamps=True
>> >
>> > on the pq.write_table(arrow_table, filename, compression=compression,
>> > use_deprecated_int96_timestamps=True) call.
>> >
>> > But that just feels SO wrong....as I'm sure it's deprecated for a reason
>> > (i.e. this will bite me later and badly)
>> >
>> >
>> > I also see this issue (or at least a related issue) reference in this
>> Jeff
>> > Knupp blog...
>> >
>> > https://www.enigma.com/blog/moving-to-parquet-files-as-a-
>> system-of-record
>> >
>> > So shrug... any suggestions are greatly appreciated :)
>> >
>> > -Brian
>> >
>> > On Fri, Sep 8, 2017 at 12:36 PM, Brian Wylie <br...@gmail.com>
>> > wrote:
>> >
>> >> Apologies if this isn't quite the right place to ask this question, but
>> I
>> >> figured Wes/others might know right off the bat :)
>> >>
>> >>
>> >> Context:
>> >> - Mac OSX Laptop
>> >> - PySpark: 2.2.0
>> >> - PyArrow: 0.6.0
>> >> - Pandas: 0.19.2
>> >>
>> >> Issue Explanation:
>> >> - I'm converting my Pandas dataframe to a Parquet file with code very
>> >> similar to
>> >> - http://wesmckinney.com/blog/python-parquet-update/
>> >> - My Pandas DataFrame has a datetime index: http_df.index.dtype =
>> >> dtype('<M8[ns]')
>> >> - When loading the saved parquet file I get the error below
>> >> - If I remove that index everything works fine
>> >>
>> >> ERROR:
>> >> - Py4JJavaError: An error occurred while calling o34.parquet.
>> >> : org.apache.spark.SparkException: Job aborted due to stage failure:
>> >> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
>> >> in stage 0.0 (TID 0, localhost, executor driver):
>> >> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64
>> >> (TIMESTAMP_MICROS);
>> >>
>> >> Full Code to reproduce:
>> >> - https://github.com/Kitware/bat/blob/master/notebooks/Bro_
>> >> to_Parquet.ipynb
>> >>
>> >>
>> >> Thanks in advance, also big fan of all this stuff... "be the chicken" :)
>> >>
>> >> -Brian
>> >>
>> >>
>> >>
>> >>
>>
Re: spark error with reading parquet file created vis pandas/pyarrow
Posted by Brian Wylie <br...@gmail.com>.
So, this is certainly good for future versions of Arrow. Do you have any
specific recommendations for a workaround currently?
Saving a parquet file with datetimes will obviously be a common use case
and if I'm understanding it correctly, right now saving a Parquet file with
PyArrow that file will not be readable by Spark at this point. Yes? (I'm
asking this as opposed to stating this).
-Brian
On Fri, Sep 8, 2017 at 2:58 PM, Wes McKinney <we...@gmail.com> wrote:
> Indeed, INT96 is deprecated in the Parquet format. There are other
> issues with Spark (it places restrictions on table field names, for
> example), so it may be worth adding an option like
>
> pq.write_table(table, where, flavor='spark')
>
> or maybe better
>
> pq.write_table(table, where, flavor='spark-2.2')
>
> and this would set the correct options for that version of Spark.
>
> I created https://issues.apache.org/jira/browse/ARROW-1499 as a place
> to discuss further
>
> - Wes
>
>
> On Fri, Sep 8, 2017 at 4:28 PM, Brian Wylie <br...@gmail.com>
> wrote:
> > Okay,
> >
> > So after some additional debugging, I can get around this if I set
> >
> > use_deprecated_int96_timestamps=True
> >
> > on the pq.write_table(arrow_table, filename, compression=compression,
> > use_deprecated_int96_timestamps=True) call.
> >
> > But that just feels SO wrong....as I'm sure it's deprecated for a reason
> > (i.e. this will bite me later and badly)
> >
> >
> > I also see this issue (or at least a related issue) reference in this
> Jeff
> > Knupp blog...
> >
> > https://www.enigma.com/blog/moving-to-parquet-files-as-a-
> system-of-record
> >
> > So shrug... any suggestions are greatly appreciated :)
> >
> > -Brian
> >
> > On Fri, Sep 8, 2017 at 12:36 PM, Brian Wylie <br...@gmail.com>
> > wrote:
> >
> >> Apologies if this isn't quite the right place to ask this question, but
> I
> >> figured Wes/others might know right off the bat :)
> >>
> >>
> >> Context:
> >> - Mac OSX Laptop
> >> - PySpark: 2.2.0
> >> - PyArrow: 0.6.0
> >> - Pandas: 0.19.2
> >>
> >> Issue Explanation:
> >> - I'm converting my Pandas dataframe to a Parquet file with code very
> >> similar to
> >> - http://wesmckinney.com/blog/python-parquet-update/
> >> - My Pandas DataFrame has a datetime index: http_df.index.dtype =
> >> dtype('<M8[ns]')
> >> - When loading the saved parquet file I get the error below
> >> - If I remove that index everything works fine
> >>
> >> ERROR:
> >> - Py4JJavaError: An error occurred while calling o34.parquet.
> >> : org.apache.spark.SparkException: Job aborted due to stage failure:
> >> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
> >> in stage 0.0 (TID 0, localhost, executor driver):
> >> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64
> >> (TIMESTAMP_MICROS);
> >>
> >> Full Code to reproduce:
> >> - https://github.com/Kitware/bat/blob/master/notebooks/Bro_
> >> to_Parquet.ipynb
> >>
> >>
> >> Thanks in advance, also big fan of all this stuff... "be the chicken" :)
> >>
> >> -Brian
> >>
> >>
> >>
> >>
>
Re: spark error with reading parquet file created vis pandas/pyarrow
Posted by Wes McKinney <we...@gmail.com>.
Indeed, INT96 is deprecated in the Parquet format. There are other
issues with Spark (it places restrictions on table field names, for
example), so it may be worth adding an option like
pq.write_table(table, where, flavor='spark')
or maybe better
pq.write_table(table, where, flavor='spark-2.2')
and this would set the correct options for that version of Spark.
I created https://issues.apache.org/jira/browse/ARROW-1499 as a place
to discuss further
- Wes
On Fri, Sep 8, 2017 at 4:28 PM, Brian Wylie <br...@gmail.com> wrote:
> Okay,
>
> So after some additional debugging, I can get around this if I set
>
> use_deprecated_int96_timestamps=True
>
> on the pq.write_table(arrow_table, filename, compression=compression,
> use_deprecated_int96_timestamps=True) call.
>
> But that just feels SO wrong....as I'm sure it's deprecated for a reason
> (i.e. this will bite me later and badly)
>
>
> I also see this issue (or at least a related issue) reference in this Jeff
> Knupp blog...
>
> https://www.enigma.com/blog/moving-to-parquet-files-as-a-system-of-record
>
> So shrug... any suggestions are greatly appreciated :)
>
> -Brian
>
> On Fri, Sep 8, 2017 at 12:36 PM, Brian Wylie <br...@gmail.com>
> wrote:
>
>> Apologies if this isn't quite the right place to ask this question, but I
>> figured Wes/others might know right off the bat :)
>>
>>
>> Context:
>> - Mac OSX Laptop
>> - PySpark: 2.2.0
>> - PyArrow: 0.6.0
>> - Pandas: 0.19.2
>>
>> Issue Explanation:
>> - I'm converting my Pandas dataframe to a Parquet file with code very
>> similar to
>> - http://wesmckinney.com/blog/python-parquet-update/
>> - My Pandas DataFrame has a datetime index: http_df.index.dtype =
>> dtype('<M8[ns]')
>> - When loading the saved parquet file I get the error below
>> - If I remove that index everything works fine
>>
>> ERROR:
>> - Py4JJavaError: An error occurred while calling o34.parquet.
>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
>> in stage 0.0 (TID 0, localhost, executor driver):
>> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64
>> (TIMESTAMP_MICROS);
>>
>> Full Code to reproduce:
>> - https://github.com/Kitware/bat/blob/master/notebooks/Bro_
>> to_Parquet.ipynb
>>
>>
>> Thanks in advance, also big fan of all this stuff... "be the chicken" :)
>>
>> -Brian
>>
>>
>>
>>
Re: spark error with reading parquet file created vis pandas/pyarrow
Posted by Brian Wylie <br...@gmail.com>.
Okay,
So after some additional debugging, I can get around this if I set
use_deprecated_int96_timestamps=True
on the pq.write_table(arrow_table, filename, compression=compression,
use_deprecated_int96_timestamps=True) call.
But that just feels SO wrong....as I'm sure it's deprecated for a reason
(i.e. this will bite me later and badly)
I also see this issue (or at least a related issue) reference in this Jeff
Knupp blog...
https://www.enigma.com/blog/moving-to-parquet-files-as-a-system-of-record
So shrug... any suggestions are greatly appreciated :)
-Brian
On Fri, Sep 8, 2017 at 12:36 PM, Brian Wylie <br...@gmail.com>
wrote:
> Apologies if this isn't quite the right place to ask this question, but I
> figured Wes/others might know right off the bat :)
>
>
> Context:
> - Mac OSX Laptop
> - PySpark: 2.2.0
> - PyArrow: 0.6.0
> - Pandas: 0.19.2
>
> Issue Explanation:
> - I'm converting my Pandas dataframe to a Parquet file with code very
> similar to
> - http://wesmckinney.com/blog/python-parquet-update/
> - My Pandas DataFrame has a datetime index: http_df.index.dtype =
> dtype('<M8[ns]')
> - When loading the saved parquet file I get the error below
> - If I remove that index everything works fine
>
> ERROR:
> - Py4JJavaError: An error occurred while calling o34.parquet.
> : org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
> in stage 0.0 (TID 0, localhost, executor driver):
> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64
> (TIMESTAMP_MICROS);
>
> Full Code to reproduce:
> - https://github.com/Kitware/bat/blob/master/notebooks/Bro_
> to_Parquet.ipynb
>
>
> Thanks in advance, also big fan of all this stuff... "be the chicken" :)
>
> -Brian
>
>
>
>