You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Joris Van den Bossche <jo...@gmail.com> on 2021/10/01 15:06:20 UTC

Re: [Python,Parquet] How to specify VARCHAR max_length?

Hi Will,

I am not familiar with Redshift, so take the following with a grain of
salt. But a few notes about Parquet/Arrow:

Parquet has "binary" (or "byte_array") and "fixed_len_byte_array" physical
types (
https://github.com/apache/parquet-format/blob/43c891a4494f85e2fe0e56f4ef408bcc60e8da48/src/main/thrift/parquet.thrift#L39-L40),
and a "string" logical type which annotates the binary physical type (
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string
).

When Arrow writes a Parquet file, it thus creates such "string" logical
typed columns for string data. And you are correct that there is no way to
influence how this is physically stored in the Parquet file (using anything
else as "binary" would also violate the spec, AFAIU), and Parquet also
doesn't know the concept of "VARCHAR(n)" (it's either variable length
(without configurable max length), or either fixed length)

But it would seem strange if Redshift did not support string columns in
Parquet whatsoever. Do you have control over the table's schema into which
the Parquet file is read? Eg I noticed
https://aws-data-wrangler.readthedocs.io/en/2.4.0-docs/stubs/awswrangler.redshift.copy_from_files.html
has a "varchar_lengths" keyword to control this.

Best,
Joris

On Thu, 30 Sept 2021 at 23:52, William Ayd <wi...@icloud.com> wrote:

> Greetings Arrow Community,
>
> I am working with on a project that drops data into Parquet files on an
> Amazon S3 bucket and reads from there into Redshift. From what I can tell,
> AWS Redshift will not read a parquet file properly that contains a VARCHAR
> data type with no max_length specification.
>
> Is there any way to pass that type information through to the parquet
> serializer? I searched through the documentation but nothing stood out at
> me.
>
> For reference, the best documentation I could find on the AWS side about
> not supporting a blanket VARCHAR without max_length is in the "Invalid
> Column Type Error" from this documentation:
>
>
> https://aws.amazon.com/premiumsupport/knowledge-center/redshift-spectrum-data-errors/
>
> Thanks,
> Will
>
>
>

Re: [Python,Parquet] How to specify VARCHAR max_length?

Posted by Weston Pace <we...@gmail.com>.

My Redshift knowledge is probably both rusty and inadequate but I
believe Redshift requires a data catalog in addition to your data.
This is basically a dedicated dataset schema.  AWS glue is the
technology I think they use for that data catalog.  If I read that
link you posted correctly then it is complaining that the data catalog
should have a parameterized varchar type (and not the data itself).
So you might look at whatever process you are using to generate your
data catalog (e.g. are you using `CREATE EXTERNAL TABLE` or are you
using "crawlers" and see if you can update that catalog creation
process to define the varchar length).


On Fri, Oct 1, 2021 at 7:24 AM William Ayd <wi...@icloud.com> wrote:
>
> Thanks Joris for all the great info. I thought the VARCHAR(max_length) was part of the Parquet file specification, but there must be some intermediary layer AWS is using to map to that from the Logical string type you’ve referenced.
>
> The library you referenced looks promising. I’ll definitely give that a shot.
>
> Thanks again!
> Will
>
> Sent from my iPhone
>
> On Oct 1, 2021, at 8:06 AM, Joris Van den Bossche <jo...@gmail.com> wrote:
>
> 
> Hi Will,
>
> I am not familiar with Redshift, so take the following with a grain of salt. But a few notes about Parquet/Arrow:
>
> Parquet has "binary" (or "byte_array") and "fixed_len_byte_array" physical types (https://github.com/apache/parquet-format/blob/43c891a4494f85e2fe0e56f4ef408bcc60e8da48/src/main/thrift/parquet.thrift#L39-L40), and a "string" logical type which annotates the binary physical type (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string).
>
> When Arrow writes a Parquet file, it thus creates such "string" logical typed columns for string data. And you are correct that there is no way to influence how this is physically stored in the Parquet file (using anything else as "binary" would also violate the spec, AFAIU), and Parquet also doesn't know the concept of "VARCHAR(n)" (it's either variable length (without configurable max length), or either fixed length)
>
> But it would seem strange if Redshift did not support string columns in Parquet whatsoever. Do you have control over the table's schema into which the Parquet file is read? Eg I noticed https://aws-data-wrangler.readthedocs.io/en/2.4.0-docs/stubs/awswrangler.redshift.copy_from_files.html has a "varchar_lengths" keyword to control this.
>
> Best,
> Joris
>
> On Thu, 30 Sept 2021 at 23:52, William Ayd <wi...@icloud.com> wrote:
>>
>> Greetings Arrow Community,
>>
>> I am working with on a project that drops data into Parquet files on an Amazon S3 bucket and reads from there into Redshift. From what I can tell, AWS Redshift will not read a parquet file properly that contains a VARCHAR data type with no max_length specification.
>>
>> Is there any way to pass that type information through to the parquet serializer? I searched through the documentation but nothing stood out at me.
>>
>> For reference, the best documentation I could find on the AWS side about not supporting a blanket VARCHAR without max_length is in the "Invalid Column Type Error" from this documentation:
>>
>> https://aws.amazon.com/premiumsupport/knowledge-center/redshift-spectrum-data-errors/
>>
>> Thanks,
>> Will
>>
>>

Re: [Python,Parquet] How to specify VARCHAR max_length?

Posted by William Ayd <wi...@icloud.com>.

Thanks Joris for all the great info. I thought the VARCHAR(max_length) was part of the Parquet file specification, but there must be some intermediary layer AWS is using to map to that from the Logical string type you’ve referenced. 

The library you referenced looks promising. I’ll definitely give that a shot. 

Thanks again!
Will

Sent from my iPhone

> On Oct 1, 2021, at 8:06 AM, Joris Van den Bossche <jo...@gmail.com> wrote:
> 
> 
> Hi Will,
> 
> I am not familiar with Redshift, so take the following with a grain of salt. But a few notes about Parquet/Arrow:
> 
> Parquet has "binary" (or "byte_array") and "fixed_len_byte_array" physical types (https://github.com/apache/parquet-format/blob/43c891a4494f85e2fe0e56f4ef408bcc60e8da48/src/main/thrift/parquet.thrift#L39-L40), and a "string" logical type which annotates the binary physical type (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string).
> 
> When Arrow writes a Parquet file, it thus creates such "string" logical typed columns for string data. And you are correct that there is no way to influence how this is physically stored in the Parquet file (using anything else as "binary" would also violate the spec, AFAIU), and Parquet also doesn't know the concept of "VARCHAR(n)" (it's either variable length (without configurable max length), or either fixed length)
> 
> But it would seem strange if Redshift did not support string columns in Parquet whatsoever. Do you have control over the table's schema into which the Parquet file is read? Eg I noticed https://aws-data-wrangler.readthedocs.io/en/2.4.0-docs/stubs/awswrangler.redshift.copy_from_files.html has a "varchar_lengths" keyword to control this.
> 
> Best,
> Joris
> 
>> On Thu, 30 Sept 2021 at 23:52, William Ayd <wi...@icloud.com> wrote:
>> Greetings Arrow Community,
>> 
>> I am working with on a project that drops data into Parquet files on an Amazon S3 bucket and reads from there into Redshift. From what I can tell, AWS Redshift will not read a parquet file properly that contains a VARCHAR data type with no max_length specification. 
>> 
>> Is there any way to pass that type information through to the parquet serializer? I searched through the documentation but nothing stood out at me.
>> 
>> For reference, the best documentation I could find on the AWS side about not supporting a blanket VARCHAR without max_length is in the "Invalid Column Type Error" from this documentation:
>> 
>> https://aws.amazon.com/premiumsupport/knowledge-center/redshift-spectrum-data-errors/
>> 
>> Thanks,
>> Will
>> 
>>