You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by sc...@ravenpack.com on 2018/05/03 06:47:40 UTC

Writing empty strings to parquet files

Hi:

I would like to know if there is any way in PyArrow to write empty string values to a parquet file.
When I use Parquet.write_table, if any column contains empty string values, they end up as None in the parquet file.
My process depends on these values to be properly written as empty strings in the parquet files.

To provide some context, my current worflow is the following:

- Read content from json files (using Pandas.read_json)
- Convert the corresponding dataframe to a PyArrow table (using PyArrow.Table.from_pandas)
- Finally, write the table to a parquet file (using Parquet.write_table)

I have done some checks during the process, and the empty string values are being honored until the writing step to a parquet file.

The options for the write_table method don't provide any specific for this, is this behavior (write '' as None) an unavoidable default?
Is there any other way to write the parquet files where I have more options to deal with this?

Any hint or feedback will be greatly appreciated.

Thanks a lot in advance, all the best.

Sergio Carrascoso


Re: Writing empty strings to parquet files

Posted by sc...@ravenpack.com.
Hi Wes:

Thanks for your message.

I would say that both test_pandas_parquet_1_0_rountrip and test_pandas_parquet_2_0_rountrip (in arrow/python/pyarrow/tests/test_parquet.py) already test this.
Sorry I didn’t realize this sooner.

All the best,

Sergio Carrascoso

> On 5 May 2018, at 01:31, Wes McKinney <we...@gmail.com> wrote:
> 
> Thanks Sergio. If we don't have any unit tests explicitly testing
> this, it would be a good idea to add some anyway.
> 
> - Wes
> 
> On Fri, May 4, 2018 at 12:26 PM,  <sc...@ravenpack.com> wrote:
>> Hi Uwe:
>> 
>> Thanks a lot for your feedback.
>> 
>> While preparing a simple example to reproduce this issue, I have been able to get the expected behavior (empty strings properly written as ‘’ in the parquet file).
>> So actually there’s no problem with the Parquet.write_table
>> 
>> The problem was rather in a bug whereas two steps in my process were in the wrong order, so None values were being applied unicode formatting earlier than expected, thus becoming ‘None’.
>> 
>> Again, thank you very much and apologies for the noise.
>> 
>> Best,
>> 
>> Sergio Carrascoso
>> 
>>> On 4 May 2018, at 10:54, Uwe L. Korn <uw...@xhochy.com> wrote:
>>> 
>>> Hello Sergio,
>>> 
>>> this is definitely unwanted behaviour. Can you open an issue on https://issues.apache.org/jira/projects/PARQUET and provide a minimal reproducing example. There is definitely a difference between empty strings and null strings. Parquet also supports the differentiation thus we should support roundtripping them.
>>> 
>>> Uwe
>>> 
>>> On Thu, May 3, 2018, at 8:47 AM, scarrascoso@ravenpack.com wrote:
>>>> 
>>>> Hi:
>>>> 
>>>> I would like to know if there is any way in PyArrow to write empty
>>>> string values to a parquet file.
>>>> When I use Parquet.write_table, if any column contains empty string
>>>> values, they end up as None in the parquet file.
>>>> My process depends on these values to be properly written as empty
>>>> strings in the parquet files.
>>>> 
>>>> To provide some context, my current worflow is the following:
>>>> 
>>>> - Read content from json files (using Pandas.read_json)
>>>> - Convert the corresponding dataframe to a PyArrow table (using
>>>> PyArrow.Table.from_pandas)
>>>> - Finally, write the table to a parquet file (using Parquet.write_table)
>>>> 
>>>> I have done some checks during the process, and the empty string values
>>>> are being honored until the writing step to a parquet file.
>>>> 
>>>> The options for the write_table method don't provide any specific for
>>>> this, is this behavior (write '' as None) an unavoidable default?
>>>> Is there any other way to write the parquet files where I have more
>>>> options to deal with this?
>>>> 
>>>> Any hint or feedback will be greatly appreciated.
>>>> 
>>>> Thanks a lot in advance, all the best.
>>>> 
>>>> Sergio Carrascoso
>>>> 
>> 


Re: Writing empty strings to parquet files

Posted by Wes McKinney <we...@gmail.com>.
Thanks Sergio. If we don't have any unit tests explicitly testing
this, it would be a good idea to add some anyway.

- Wes

On Fri, May 4, 2018 at 12:26 PM,  <sc...@ravenpack.com> wrote:
> Hi Uwe:
>
> Thanks a lot for your feedback.
>
> While preparing a simple example to reproduce this issue, I have been able to get the expected behavior (empty strings properly written as ‘’ in the parquet file).
> So actually there’s no problem with the Parquet.write_table
>
> The problem was rather in a bug whereas two steps in my process were in the wrong order, so None values were being applied unicode formatting earlier than expected, thus becoming ‘None’.
>
> Again, thank you very much and apologies for the noise.
>
> Best,
>
> Sergio Carrascoso
>
>> On 4 May 2018, at 10:54, Uwe L. Korn <uw...@xhochy.com> wrote:
>>
>> Hello Sergio,
>>
>> this is definitely unwanted behaviour. Can you open an issue on https://issues.apache.org/jira/projects/PARQUET and provide a minimal reproducing example. There is definitely a difference between empty strings and null strings. Parquet also supports the differentiation thus we should support roundtripping them.
>>
>> Uwe
>>
>> On Thu, May 3, 2018, at 8:47 AM, scarrascoso@ravenpack.com wrote:
>>>
>>> Hi:
>>>
>>> I would like to know if there is any way in PyArrow to write empty
>>> string values to a parquet file.
>>> When I use Parquet.write_table, if any column contains empty string
>>> values, they end up as None in the parquet file.
>>> My process depends on these values to be properly written as empty
>>> strings in the parquet files.
>>>
>>> To provide some context, my current worflow is the following:
>>>
>>> - Read content from json files (using Pandas.read_json)
>>> - Convert the corresponding dataframe to a PyArrow table (using
>>> PyArrow.Table.from_pandas)
>>> - Finally, write the table to a parquet file (using Parquet.write_table)
>>>
>>> I have done some checks during the process, and the empty string values
>>> are being honored until the writing step to a parquet file.
>>>
>>> The options for the write_table method don't provide any specific for
>>> this, is this behavior (write '' as None) an unavoidable default?
>>> Is there any other way to write the parquet files where I have more
>>> options to deal with this?
>>>
>>> Any hint or feedback will be greatly appreciated.
>>>
>>> Thanks a lot in advance, all the best.
>>>
>>> Sergio Carrascoso
>>>
>

Re: Writing empty strings to parquet files

Posted by sc...@ravenpack.com.
Hi Uwe:

Thanks a lot for your feedback.

While preparing a simple example to reproduce this issue, I have been able to get the expected behavior (empty strings properly written as ‘’ in the parquet file).
So actually there’s no problem with the Parquet.write_table

The problem was rather in a bug whereas two steps in my process were in the wrong order, so None values were being applied unicode formatting earlier than expected, thus becoming ‘None’.

Again, thank you very much and apologies for the noise.

Best,

Sergio Carrascoso

> On 4 May 2018, at 10:54, Uwe L. Korn <uw...@xhochy.com> wrote:
> 
> Hello Sergio,
> 
> this is definitely unwanted behaviour. Can you open an issue on https://issues.apache.org/jira/projects/PARQUET and provide a minimal reproducing example. There is definitely a difference between empty strings and null strings. Parquet also supports the differentiation thus we should support roundtripping them.
> 
> Uwe
> 
> On Thu, May 3, 2018, at 8:47 AM, scarrascoso@ravenpack.com wrote:
>> 
>> Hi:
>> 
>> I would like to know if there is any way in PyArrow to write empty 
>> string values to a parquet file.
>> When I use Parquet.write_table, if any column contains empty string 
>> values, they end up as None in the parquet file.
>> My process depends on these values to be properly written as empty 
>> strings in the parquet files.
>> 
>> To provide some context, my current worflow is the following:
>> 
>> - Read content from json files (using Pandas.read_json)
>> - Convert the corresponding dataframe to a PyArrow table (using 
>> PyArrow.Table.from_pandas)
>> - Finally, write the table to a parquet file (using Parquet.write_table)
>> 
>> I have done some checks during the process, and the empty string values 
>> are being honored until the writing step to a parquet file.
>> 
>> The options for the write_table method don't provide any specific for 
>> this, is this behavior (write '' as None) an unavoidable default?
>> Is there any other way to write the parquet files where I have more 
>> options to deal with this?
>> 
>> Any hint or feedback will be greatly appreciated.
>> 
>> Thanks a lot in advance, all the best.
>> 
>> Sergio Carrascoso
>> 


Re: Writing empty strings to parquet files

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Sergio,

this is definitely unwanted behaviour. Can you open an issue on https://issues.apache.org/jira/projects/PARQUET and provide a minimal reproducing example. There is definitely a difference between empty strings and null strings. Parquet also supports the differentiation thus we should support roundtripping them.

Uwe

On Thu, May 3, 2018, at 8:47 AM, scarrascoso@ravenpack.com wrote:
> 
> Hi:
> 
> I would like to know if there is any way in PyArrow to write empty 
> string values to a parquet file.
> When I use Parquet.write_table, if any column contains empty string 
> values, they end up as None in the parquet file.
> My process depends on these values to be properly written as empty 
> strings in the parquet files.
> 
> To provide some context, my current worflow is the following:
> 
> - Read content from json files (using Pandas.read_json)
> - Convert the corresponding dataframe to a PyArrow table (using 
> PyArrow.Table.from_pandas)
> - Finally, write the table to a parquet file (using Parquet.write_table)
> 
> I have done some checks during the process, and the empty string values 
> are being honored until the writing step to a parquet file.
> 
> The options for the write_table method don't provide any specific for 
> this, is this behavior (write '' as None) an unavoidable default?
> Is there any other way to write the parquet files where I have more 
> options to deal with this?
> 
> Any hint or feedback will be greatly appreciated.
> 
> Thanks a lot in advance, all the best.
> 
> Sergio Carrascoso
>