You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Ying Zhou <yz...@gmail.com> on 2021/01/10 16:48:57 UTC

[C++] Shall we modify the ORC reader?

Hi,

While finishing the ORC writer in C++ I found that the ORC reader treats certain types in rather awkward ways. Hence I filed this Jira ticket: https://issues.apache.org/jira/browse/ARROW-11117 <https://issues.apache.org/jira/browse/ARROW-11117>

After starting to work on ORC tickets mostly filed by myself I began to worry that the type mappings in the ORC reader might already be used by users of Arrow. I wonder whether we should grandfather the issues or gradually switch to a new type mapping.

Here are my proposed changes:
1. The ORC STRING type should be converted to the Arrow LARGE_STRING type instead of STRING type since it is large.
2. The ORC LIST type should be converted to the Arrow LARGE_LIST type instead of LIST type since it is large.
3. The ORC MAP type should be converted to the Arrow MAP type instead of list of structs with hardcoded field names as long as 
the offsets fit into int32. Otherwise we shouldn't return OK.

Thanks,
Ying

Re: [C++] Shall we modify the ORC reader?

Posted by Micah Kornfield <em...@gmail.com>.

It probably makes sense to make this option configurable.  I think it is OK
to change the default to use Maps.  My guess is the initial ORC
implementation predated having a Map type in the specification.

On Thu, Jan 28, 2021 at 9:28 AM Ying Zhou <yz...@gmail.com> wrote:

> Hi,
>
> Really thanks Deepak!
>
> I really want to edit the ORC reader to read ORC MAPs as Arrow MAPs now
> and it’s not a serious hassle to do so. Is there anyone who needs the
> read-ORC-maps-as-lists-of-structs functionality? If not I will do it likely
> in my current PR.
>
> Ying
>
> > On Jan 19, 2021, at 8:45 PM, Deepak Majeti <ma...@gmail.com>
> wrote:
> >
> > Hi Ying,
> >
> > I can help review/merge any ORC C++ contributions.
> >
> >
> > On Thu, Jan 14, 2021 at 6:57 PM Ying Zhou <yz...@gmail.com> wrote:
> >
> >> Well, I haven’t found any. Thankfully ORC does work and I can figure out
> >> how it works by testing using simple examples. However I have never
> managed
> >> to contact the ORC community at all. They have never responded to any
> of my
> >> emails to dev@orc.apache.org <ma...@orc.apache.org> I do want to
> add
> >> write Snappy support (which was actually already done 2 years ago by
> >> someone else but due to lack of unit testing it was never merged into
> >> master. I can write the tests.) and maybe Decimal256 to ORC C++ if they
> are
> >> wiling to review and merge them. If anyone has successfully contacted
> the
> >> ORC community please let me know how.
> >>
> >> Best,
> >> Ying
> >>
> >>> On Jan 14, 2021, at 8:39 AM, Antoine Pitrou <an...@python.org>
> wrote:
> >>>
> >>>
> >>> Hi Ying,
> >>>
> >>> Is there a semantic description of the ORC data types somewhere?
> >>> I've read through https://orc.apache.org/docs/types.html and
> >>> https://orc.apache.org/specification/ORCv1/ but those docs don't seem
> >>> to explain the intent and constraints of each of the data types.
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, 11 Jan 2021 21:15:05 -0500
> >>> Ying Zhou <yz...@gmail.com> wrote:
> >>>> Thanks! What about 3?
> >>>> Shall we convert ORC maps to Arrow maps as opposed to lists of structs
> >> with fields of the structs named ‘key’ and ‘value’?
> >>>>
> >>>>
> >>>>
> >>>>> On Jan 10, 2021, at 6:45 PM, Jacques Nadeau <ja...@apache.org>
> >> wrote:
> >>>>>
> >>>>> I don't think 1 & 2 make sense. I don't think there are a lot of
> users
> >>>>> reading 2gb strings or lists with 2B objects in them. Saying we just
> >> don't
> >>>>> support that pattern seems fine for now. I also believe the string
> and
> >> list
> >>>>> types have better cross-language support than the large variants.
> >>>>>
> >>>>> On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou <yz...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> While finishing the ORC writer in C++ I found that the ORC reader
> >> treats
> >>>>>> certain types in rather awkward ways. Hence I filed this Jira
> ticket:
> >>>>>> https://issues.apache.org/jira/browse/ARROW-11117 <
> >>>>>> https://issues.apache.org/jira/browse/ARROW-11117>
> >>>>>>
> >>>>>> After starting to work on ORC tickets mostly filed by myself I began
> >> to
> >>>>>> worry that the type mappings in the ORC reader might already be used
> >> by
> >>>>>> users of Arrow. I wonder whether we should grandfather the issues or
> >>>>>> gradually switch to a new type mapping.
> >>>>>>
> >>>>>> Here are my proposed changes:
> >>>>>> 1. The ORC STRING type should be converted to the Arrow LARGE_STRING
> >> type
> >>>>>> instead of STRING type since it is large.
> >>>>>> 2. The ORC LIST type should be converted to the Arrow LARGE_LIST
> type
> >>>>>> instead of LIST type since it is large.
> >>>>>> 3. The ORC MAP type should be converted to the Arrow MAP type
> instead
> >> of
> >>>>>> list of structs with hardcoded field names as long as
> >>>>>> the offsets fit into int32. Otherwise we shouldn't return OK.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Ying
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>
> >>
> >
> > --
> > regards,
> > Deepak Majeti
>
>

Re: [C++] Shall we modify the ORC reader?

Posted by Ying Zhou <yz...@gmail.com>.

Hi,

Really thanks Deepak!

I really want to edit the ORC reader to read ORC MAPs as Arrow MAPs now and it’s not a serious hassle to do so. Is there anyone who needs the read-ORC-maps-as-lists-of-structs functionality? If not I will do it likely in my current PR.

Ying

> On Jan 19, 2021, at 8:45 PM, Deepak Majeti <ma...@gmail.com> wrote:
> 
> Hi Ying,
> 
> I can help review/merge any ORC C++ contributions.
> 
> 
> On Thu, Jan 14, 2021 at 6:57 PM Ying Zhou <yz...@gmail.com> wrote:
> 
>> Well, I haven’t found any. Thankfully ORC does work and I can figure out
>> how it works by testing using simple examples. However I have never managed
>> to contact the ORC community at all. They have never responded to any of my
>> emails to dev@orc.apache.org <ma...@orc.apache.org> I do want to add
>> write Snappy support (which was actually already done 2 years ago by
>> someone else but due to lack of unit testing it was never merged into
>> master. I can write the tests.) and maybe Decimal256 to ORC C++ if they are
>> wiling to review and merge them. If anyone has successfully contacted the
>> ORC community please let me know how.
>> 
>> Best,
>> Ying
>> 
>>> On Jan 14, 2021, at 8:39 AM, Antoine Pitrou <an...@python.org> wrote:
>>> 
>>> 
>>> Hi Ying,
>>> 
>>> Is there a semantic description of the ORC data types somewhere?
>>> I've read through https://orc.apache.org/docs/types.html and
>>> https://orc.apache.org/specification/ORCv1/ but those docs don't seem
>>> to explain the intent and constraints of each of the data types.
>>> 
>>> Regards
>>> 
>>> Antoine.
>>> 
>>> 
>>> 
>>> 
>>> On Mon, 11 Jan 2021 21:15:05 -0500
>>> Ying Zhou <yz...@gmail.com> wrote:
>>>> Thanks! What about 3?
>>>> Shall we convert ORC maps to Arrow maps as opposed to lists of structs
>> with fields of the structs named ‘key’ and ‘value’?
>>>> 
>>>> 
>>>> 
>>>>> On Jan 10, 2021, at 6:45 PM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>>>>> 
>>>>> I don't think 1 & 2 make sense. I don't think there are a lot of users
>>>>> reading 2gb strings or lists with 2B objects in them. Saying we just
>> don't
>>>>> support that pattern seems fine for now. I also believe the string and
>> list
>>>>> types have better cross-language support than the large variants.
>>>>> 
>>>>> On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou <yz...@gmail.com> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> While finishing the ORC writer in C++ I found that the ORC reader
>> treats
>>>>>> certain types in rather awkward ways. Hence I filed this Jira ticket:
>>>>>> https://issues.apache.org/jira/browse/ARROW-11117 <
>>>>>> https://issues.apache.org/jira/browse/ARROW-11117>
>>>>>> 
>>>>>> After starting to work on ORC tickets mostly filed by myself I began
>> to
>>>>>> worry that the type mappings in the ORC reader might already be used
>> by
>>>>>> users of Arrow. I wonder whether we should grandfather the issues or
>>>>>> gradually switch to a new type mapping.
>>>>>> 
>>>>>> Here are my proposed changes:
>>>>>> 1. The ORC STRING type should be converted to the Arrow LARGE_STRING
>> type
>>>>>> instead of STRING type since it is large.
>>>>>> 2. The ORC LIST type should be converted to the Arrow LARGE_LIST type
>>>>>> instead of LIST type since it is large.
>>>>>> 3. The ORC MAP type should be converted to the Arrow MAP type instead
>> of
>>>>>> list of structs with hardcoded field names as long as
>>>>>> the offsets fit into int32. Otherwise we shouldn't return OK.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ying
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
> -- 
> regards,
> Deepak Majeti

Re: [C++] Shall we modify the ORC reader?

Posted by Deepak Majeti <ma...@gmail.com>.

Hi Ying,

I can help review/merge any ORC C++ contributions.


On Thu, Jan 14, 2021 at 6:57 PM Ying Zhou <yz...@gmail.com> wrote:

> Well, I haven’t found any. Thankfully ORC does work and I can figure out
> how it works by testing using simple examples. However I have never managed
> to contact the ORC community at all. They have never responded to any of my
> emails to dev@orc.apache.org <ma...@orc.apache.org> I do want to add
> write Snappy support (which was actually already done 2 years ago by
> someone else but due to lack of unit testing it was never merged into
> master. I can write the tests.) and maybe Decimal256 to ORC C++ if they are
> wiling to review and merge them. If anyone has successfully contacted the
> ORC community please let me know how.
>
> Best,
> Ying
>
> > On Jan 14, 2021, at 8:39 AM, Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Hi Ying,
> >
> > Is there a semantic description of the ORC data types somewhere?
> > I've read through https://orc.apache.org/docs/types.html and
> > https://orc.apache.org/specification/ORCv1/ but those docs don't seem
> > to explain the intent and constraints of each of the data types.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> >
> > On Mon, 11 Jan 2021 21:15:05 -0500
> > Ying Zhou <yz...@gmail.com> wrote:
> >> Thanks! What about 3?
> >> Shall we convert ORC maps to Arrow maps as opposed to lists of structs
> with fields of the structs named ‘key’ and ‘value’?
> >>
> >>
> >>
> >>> On Jan 10, 2021, at 6:45 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
> >>>
> >>> I don't think 1 & 2 make sense. I don't think there are a lot of users
> >>> reading 2gb strings or lists with 2B objects in them. Saying we just
> don't
> >>> support that pattern seems fine for now. I also believe the string and
> list
> >>> types have better cross-language support than the large variants.
> >>>
> >>> On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou <yz...@gmail.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> While finishing the ORC writer in C++ I found that the ORC reader
> treats
> >>>> certain types in rather awkward ways. Hence I filed this Jira ticket:
> >>>> https://issues.apache.org/jira/browse/ARROW-11117 <
> >>>> https://issues.apache.org/jira/browse/ARROW-11117>
> >>>>
> >>>> After starting to work on ORC tickets mostly filed by myself I began
> to
> >>>> worry that the type mappings in the ORC reader might already be used
> by
> >>>> users of Arrow. I wonder whether we should grandfather the issues or
> >>>> gradually switch to a new type mapping.
> >>>>
> >>>> Here are my proposed changes:
> >>>> 1. The ORC STRING type should be converted to the Arrow LARGE_STRING
> type
> >>>> instead of STRING type since it is large.
> >>>> 2. The ORC LIST type should be converted to the Arrow LARGE_LIST type
> >>>> instead of LIST type since it is large.
> >>>> 3. The ORC MAP type should be converted to the Arrow MAP type instead
> of
> >>>> list of structs with hardcoded field names as long as
> >>>> the offsets fit into int32. Otherwise we shouldn't return OK.
> >>>>
> >>>> Thanks,
> >>>> Ying
> >>
> >>
> >
> >
> >
>
>

-- 
regards,
Deepak Majeti

Re: [C++] Shall we modify the ORC reader?

Posted by Ying Zhou <yz...@gmail.com>.

Well, I haven’t found any. Thankfully ORC does work and I can figure out how it works by testing using simple examples. However I have never managed to contact the ORC community at all. They have never responded to any of my emails to dev@orc.apache.org <ma...@orc.apache.org> I do want to add write Snappy support (which was actually already done 2 years ago by someone else but due to lack of unit testing it was never merged into master. I can write the tests.) and maybe Decimal256 to ORC C++ if they are wiling to review and merge them. If anyone has successfully contacted the ORC community please let me know how.

Best,
Ying

> On Jan 14, 2021, at 8:39 AM, Antoine Pitrou <an...@python.org> wrote:
> 
> 
> Hi Ying,
> 
> Is there a semantic description of the ORC data types somewhere?
> I've read through https://orc.apache.org/docs/types.html and
> https://orc.apache.org/specification/ORCv1/ but those docs don't seem
> to explain the intent and constraints of each of the data types.
> 
> Regards
> 
> Antoine.
> 
> 
> 
> 
> On Mon, 11 Jan 2021 21:15:05 -0500
> Ying Zhou <yz...@gmail.com> wrote:
>> Thanks! What about 3? 
>> Shall we convert ORC maps to Arrow maps as opposed to lists of structs with fields of the structs named ‘key’ and ‘value’?
>> 
>> 
>> 
>>> On Jan 10, 2021, at 6:45 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>> 
>>> I don't think 1 & 2 make sense. I don't think there are a lot of users
>>> reading 2gb strings or lists with 2B objects in them. Saying we just don't
>>> support that pattern seems fine for now. I also believe the string and list
>>> types have better cross-language support than the large variants.
>>> 
>>> On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou <yz...@gmail.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> While finishing the ORC writer in C++ I found that the ORC reader treats
>>>> certain types in rather awkward ways. Hence I filed this Jira ticket:
>>>> https://issues.apache.org/jira/browse/ARROW-11117 <
>>>> https://issues.apache.org/jira/browse/ARROW-11117>
>>>> 
>>>> After starting to work on ORC tickets mostly filed by myself I began to
>>>> worry that the type mappings in the ORC reader might already be used by
>>>> users of Arrow. I wonder whether we should grandfather the issues or
>>>> gradually switch to a new type mapping.
>>>> 
>>>> Here are my proposed changes:
>>>> 1. The ORC STRING type should be converted to the Arrow LARGE_STRING type
>>>> instead of STRING type since it is large.
>>>> 2. The ORC LIST type should be converted to the Arrow LARGE_LIST type
>>>> instead of LIST type since it is large.
>>>> 3. The ORC MAP type should be converted to the Arrow MAP type instead of
>>>> list of structs with hardcoded field names as long as
>>>> the offsets fit into int32. Otherwise we shouldn't return OK.
>>>> 
>>>> Thanks,
>>>> Ying  
>> 
>> 
> 
> 
>

Re: [C++] Shall we modify the ORC reader?

Posted by Antoine Pitrou <an...@python.org>.

Hi Ying,

Is there a semantic description of the ORC data types somewhere?
I've read through https://orc.apache.org/docs/types.html and
https://orc.apache.org/specification/ORCv1/ but those docs don't seem
to explain the intent and constraints of each of the data types.

Regards

Antoine.




On Mon, 11 Jan 2021 21:15:05 -0500
Ying Zhou <yz...@gmail.com> wrote:
> Thanks! What about 3? 
> Shall we convert ORC maps to Arrow maps as opposed to lists of structs with fields of the structs named ‘key’ and ‘value’?
> 
> 
> 
> > On Jan 10, 2021, at 6:45 PM, Jacques Nadeau <ja...@apache.org> wrote:
> > 
> > I don't think 1 & 2 make sense. I don't think there are a lot of users
> > reading 2gb strings or lists with 2B objects in them. Saying we just don't
> > support that pattern seems fine for now. I also believe the string and list
> > types have better cross-language support than the large variants.
> > 
> > On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou <yz...@gmail.com> wrote:
> >   
> >> Hi,
> >> 
> >> While finishing the ORC writer in C++ I found that the ORC reader treats
> >> certain types in rather awkward ways. Hence I filed this Jira ticket:
> >> https://issues.apache.org/jira/browse/ARROW-11117 <
> >> https://issues.apache.org/jira/browse/ARROW-11117>
> >> 
> >> After starting to work on ORC tickets mostly filed by myself I began to
> >> worry that the type mappings in the ORC reader might already be used by
> >> users of Arrow. I wonder whether we should grandfather the issues or
> >> gradually switch to a new type mapping.
> >> 
> >> Here are my proposed changes:
> >> 1. The ORC STRING type should be converted to the Arrow LARGE_STRING type
> >> instead of STRING type since it is large.
> >> 2. The ORC LIST type should be converted to the Arrow LARGE_LIST type
> >> instead of LIST type since it is large.
> >> 3. The ORC MAP type should be converted to the Arrow MAP type instead of
> >> list of structs with hardcoded field names as long as
> >> the offsets fit into int32. Otherwise we shouldn't return OK.
> >> 
> >> Thanks,
> >> Ying  
> 
>

Re: [C++] Shall we modify the ORC reader?

Posted by Ying Zhou <yz...@gmail.com>.

Thanks! What about 3? 
Shall we convert ORC maps to Arrow maps as opposed to lists of structs with fields of the structs named ‘key’ and ‘value’?



> On Jan 10, 2021, at 6:45 PM, Jacques Nadeau <ja...@apache.org> wrote:
> 
> I don't think 1 & 2 make sense. I don't think there are a lot of users
> reading 2gb strings or lists with 2B objects in them. Saying we just don't
> support that pattern seems fine for now. I also believe the string and list
> types have better cross-language support than the large variants.
> 
> On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou <yz...@gmail.com> wrote:
> 
>> Hi,
>> 
>> While finishing the ORC writer in C++ I found that the ORC reader treats
>> certain types in rather awkward ways. Hence I filed this Jira ticket:
>> https://issues.apache.org/jira/browse/ARROW-11117 <
>> https://issues.apache.org/jira/browse/ARROW-11117>
>> 
>> After starting to work on ORC tickets mostly filed by myself I began to
>> worry that the type mappings in the ORC reader might already be used by
>> users of Arrow. I wonder whether we should grandfather the issues or
>> gradually switch to a new type mapping.
>> 
>> Here are my proposed changes:
>> 1. The ORC STRING type should be converted to the Arrow LARGE_STRING type
>> instead of STRING type since it is large.
>> 2. The ORC LIST type should be converted to the Arrow LARGE_LIST type
>> instead of LIST type since it is large.
>> 3. The ORC MAP type should be converted to the Arrow MAP type instead of
>> list of structs with hardcoded field names as long as
>> the offsets fit into int32. Otherwise we shouldn't return OK.
>> 
>> Thanks,
>> Ying

Re: [C++] Shall we modify the ORC reader?

Posted by Jacques Nadeau <ja...@apache.org>.

I don't think 1 & 2 make sense. I don't think there are a lot of users
reading 2gb strings or lists with 2B objects in them. Saying we just don't
support that pattern seems fine for now. I also believe the string and list
types have better cross-language support than the large variants.

On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou <yz...@gmail.com> wrote:

> Hi,
>
> While finishing the ORC writer in C++ I found that the ORC reader treats
> certain types in rather awkward ways. Hence I filed this Jira ticket:
> https://issues.apache.org/jira/browse/ARROW-11117 <
> https://issues.apache.org/jira/browse/ARROW-11117>
>
> After starting to work on ORC tickets mostly filed by myself I began to
> worry that the type mappings in the ORC reader might already be used by
> users of Arrow. I wonder whether we should grandfather the issues or
> gradually switch to a new type mapping.
>
> Here are my proposed changes:
> 1. The ORC STRING type should be converted to the Arrow LARGE_STRING type
> instead of STRING type since it is large.
> 2. The ORC LIST type should be converted to the Arrow LARGE_LIST type
> instead of LIST type since it is large.
> 3. The ORC MAP type should be converted to the Arrow MAP type instead of
> list of structs with hardcoded field names as long as
> the offsets fit into int32. Otherwise we shouldn't return OK.
>
> Thanks,
> Ying