You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Suresh V <su...@gmail.com> on 2022/04/20 19:59:59 UTC

Questions on Dictionary Array types.

Hi .. I created a pyarrow table from a dictionary array as shown
below. However I cannot figure out any easy way to get the mapping
used to create the dictionary array (vals) easily from the table. Can
you please let me know the easiest way? Other than the ones which
involve pyarrow.compute/conversion to pandas as they are expensive
operations for large datasets.

import pyarrow as pa
import pyarrow.compute as pc
import numpy as np

vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
tab = pa.Table.from_arrays([x], names=['x'])

Also since this is effectively a string array which is dictionary
encoded, is there any way to use string compute kernels like
starts_with etc. Right now I am aware of two methods and they are not
straightforward.

approach 1:
Cast to string and then run string kernel
expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
columns={'x': pc.field('x')}, filter=expr).to_table()

approach 2:
filter using the corresponding indices assuming we have access to the dictionary
filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
pc.is_in(x.indices, filter_)

Approach 2 is better/faster .. but I am not able to figure out how to
get the dictionary/indices assuming we start from a table read from
parquet/feather.

Thanks

Re: Questions on Dictionary Array types.

Posted by Aldrin <ak...@ucsc.edu>.

I'd like to add a question that I am not sure how to answer from looking at
the code:
- When writing a table in IPC format (in my case to a file), how does the
writer write a DictionaryArray if it spans multiple chunks?

My intuition is that if a DictionaryArray is split into multiple chunks,
then the dictionary would be split into multiple chunks. I would guess it's
straightforward to split the indices, and then the dictionary ("symbol
dictionary"?) would either have some duplicates across chunks (each chunk
is independent), or it would be common across the chunks (optimize overall
storage/representation, but chunks are not independent). I think this is
relevant to delta dictionaries, but I haven't mentally parsed what a delta
dictionary is yet.

When writing to storage, I assume that independent dictionaries per batch
is the most straightforward, but I can imagine various trade-offs for other
approaches.

Pointers to where to look for this would be appreciated too! I looked at
IpcPayload stuff before when looking at the feather readers/writers but the
hierarchy was initially hard to follow.

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Thu, Apr 21, 2022 at 2:52 PM Weston Pace <we...@gmail.com> wrote:

> There is some overhead with dataset.to_table() but I would expect it
> to be mostly amortized out if you have 50m rows (assuming this is 50
> million).  It may also depend on how many row groups you have.  Feel
> free to create a JIRA with a reproducible example.
>
> On Thu, Apr 21, 2022 at 10:29 AM Suresh V <su...@gmail.com> wrote:
> >
> > Thanks for the explanation.  But I am now worried that the approach 2
> for using a string kernel like starts with will become more complicated as
> i have to run the filter in a python loop effectively slowing it down.
> >
> > If we had access to the indices directly, approach 2 was 100% faster
> than 1 for an array of 50m entries.
> >
> > Completely unrelated,  I also noticed that reading feather file from
> ipc/feather API is 10x faster (0.2 vs 2s) than the dataset.to_table() for
> 50m rows. I wouldn't expect there to be such a big difference. Will create
> a bug if it isn't a known issue.
> >
> > Thanks
> > On Thu, Apr 21, 2022, 4:07 PM Aldrin <ak...@ucsc.edu> wrote:
> >>
> >> I could be wrong, but in my experience the dictionary is available at
> the chunk level, because that is where you know it is a DictionaryArray (or
> at least, an Array). At the column level, you only know it's a
> ChunkedArray, which seems to roughly be an alias to a vector<Array>
> (list[Array]) at least type-wise.
> >>
> >> Also, I think each chunk references the same dictionary, so I think you
> can access any chunk's dictionary and get the same one.
> >>
> >> Aldrin Montana
> >> Computer Science PhD Student
> >> UC Santa Cruz
> >>
> >>
> >> On Wed, Apr 20, 2022 at 10:54 PM Suresh V <su...@gmail.com> wrote:
> >>>
> >>> Thank you very much for the response. I was looking directly at
> tab['x']. Didnt realize that the dictionary is present at chunk level.
> >>>
> >>> On Thu, Apr 21, 2022, 1:17 AM Weston Pace <we...@gmail.com>
> wrote:
> >>>>
> >>>> > However I cannot figure out any easy way to get the mapping
> >>>> > used to create the dictionary array (vals) easily from the table.
> Can
> >>>> > you please let me know the easiest way?
> >>>>
> >>>> A dictionary is going to be associated with an array and not a table.
> >>>> So you first need to get the array from the table.  Tables are made of
> >>>> columns and each column is made of chunks and each chunk is an array.
> >>>> Each chunk could have a different mapping, so that is something you
> >>>> may need to deal with at some point depending on your goal.
> >>>>
> >>>> The table you are creating in your example has one column and that
> >>>> column has one chunk so we can get to the mapping with:
> >>>>
> >>>>     tab.column(0).chunks[0].dictionary
> >>>>
> >>>> And we can get to the indices with:
> >>>>
> >>>>     tab.column(0).chunks[0].indices
> >>>>
> >>>> > Also since this is effectively a string array which is dictionary
> >>>> > encoded, is there any way to use string compute kernels like
> >>>> > starts_with etc. Right now I am aware of two methods and they are
> not
> >>>> > straightforward.
> >>>>
> >>>> Regrettably, I don't think we have kernels in place for string
> >>>> functions on dictionary arrays.  At least, that is my reading of [1].
> >>>> So the two workarounds you have are may be the best there is at the
> >>>> moment.
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/ARROW-14068
> >>>>
> >>>> On Wed, Apr 20, 2022 at 10:00 AM Suresh V <su...@gmail.com>
> wrote:
> >>>> >
> >>>> > Hi .. I created a pyarrow table from a dictionary array as shown
> >>>> > below. However I cannot figure out any easy way to get the mapping
> >>>> > used to create the dictionary array (vals) easily from the table.
> Can
> >>>> > you please let me know the easiest way? Other than the ones which
> >>>> > involve pyarrow.compute/conversion to pandas as they are expensive
> >>>> > operations for large datasets.
> >>>> >
> >>>> > import pyarrow as pa
> >>>> > import pyarrow.compute as pc
> >>>> > import numpy as np
> >>>> >
> >>>> > vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
> >>>> > int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
> >>>> > x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
> >>>> > tab = pa.Table.from_arrays([x], names=['x'])
> >>>> >
> >>>> > Also since this is effectively a string array which is dictionary
> >>>> > encoded, is there any way to use string compute kernels like
> >>>> > starts_with etc. Right now I am aware of two methods and they are
> not
> >>>> > straightforward.
> >>>> >
> >>>> > approach 1:
> >>>> > Cast to string and then run string kernel
> >>>> > expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
> >>>> > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
> >>>> > columns={'x': pc.field('x')}, filter=expr).to_table()
> >>>> >
> >>>> > approach 2:
> >>>> > filter using the corresponding indices assuming we have access to
> the dictionary
> >>>> > filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
> >>>> > pc.is_in(x.indices, filter_)
> >>>> >
> >>>> > Approach 2 is better/faster .. but I am not able to figure out how
> to
> >>>> > get the dictionary/indices assuming we start from a table read from
> >>>> > parquet/feather.
> >>>> >
> >>>> > Thanks
>

Re: Questions on Dictionary Array types.

Posted by Weston Pace <we...@gmail.com>.

There is some overhead with dataset.to_table() but I would expect it
to be mostly amortized out if you have 50m rows (assuming this is 50
million).  It may also depend on how many row groups you have.  Feel
free to create a JIRA with a reproducible example.

On Thu, Apr 21, 2022 at 10:29 AM Suresh V <su...@gmail.com> wrote:
>
> Thanks for the explanation.  But I am now worried that the approach 2 for using a string kernel like starts with will become more complicated as i have to run the filter in a python loop effectively slowing it down.
>
> If we had access to the indices directly, approach 2 was 100% faster than 1 for an array of 50m entries.
>
> Completely unrelated,  I also noticed that reading feather file from ipc/feather API is 10x faster (0.2 vs 2s) than the dataset.to_table() for 50m rows. I wouldn't expect there to be such a big difference. Will create a bug if it isn't a known issue.
>
> Thanks
> On Thu, Apr 21, 2022, 4:07 PM Aldrin <ak...@ucsc.edu> wrote:
>>
>> I could be wrong, but in my experience the dictionary is available at the chunk level, because that is where you know it is a DictionaryArray (or at least, an Array). At the column level, you only know it's a ChunkedArray, which seems to roughly be an alias to a vector<Array> (list[Array]) at least type-wise.
>>
>> Also, I think each chunk references the same dictionary, so I think you can access any chunk's dictionary and get the same one.
>>
>> Aldrin Montana
>> Computer Science PhD Student
>> UC Santa Cruz
>>
>>
>> On Wed, Apr 20, 2022 at 10:54 PM Suresh V <su...@gmail.com> wrote:
>>>
>>> Thank you very much for the response. I was looking directly at tab['x']. Didnt realize that the dictionary is present at chunk level.
>>>
>>> On Thu, Apr 21, 2022, 1:17 AM Weston Pace <we...@gmail.com> wrote:
>>>>
>>>> > However I cannot figure out any easy way to get the mapping
>>>> > used to create the dictionary array (vals) easily from the table. Can
>>>> > you please let me know the easiest way?
>>>>
>>>> A dictionary is going to be associated with an array and not a table.
>>>> So you first need to get the array from the table.  Tables are made of
>>>> columns and each column is made of chunks and each chunk is an array.
>>>> Each chunk could have a different mapping, so that is something you
>>>> may need to deal with at some point depending on your goal.
>>>>
>>>> The table you are creating in your example has one column and that
>>>> column has one chunk so we can get to the mapping with:
>>>>
>>>>     tab.column(0).chunks[0].dictionary
>>>>
>>>> And we can get to the indices with:
>>>>
>>>>     tab.column(0).chunks[0].indices
>>>>
>>>> > Also since this is effectively a string array which is dictionary
>>>> > encoded, is there any way to use string compute kernels like
>>>> > starts_with etc. Right now I am aware of two methods and they are not
>>>> > straightforward.
>>>>
>>>> Regrettably, I don't think we have kernels in place for string
>>>> functions on dictionary arrays.  At least, that is my reading of [1].
>>>> So the two workarounds you have are may be the best there is at the
>>>> moment.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/ARROW-14068
>>>>
>>>> On Wed, Apr 20, 2022 at 10:00 AM Suresh V <su...@gmail.com> wrote:
>>>> >
>>>> > Hi .. I created a pyarrow table from a dictionary array as shown
>>>> > below. However I cannot figure out any easy way to get the mapping
>>>> > used to create the dictionary array (vals) easily from the table. Can
>>>> > you please let me know the easiest way? Other than the ones which
>>>> > involve pyarrow.compute/conversion to pandas as they are expensive
>>>> > operations for large datasets.
>>>> >
>>>> > import pyarrow as pa
>>>> > import pyarrow.compute as pc
>>>> > import numpy as np
>>>> >
>>>> > vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
>>>> > int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
>>>> > x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
>>>> > tab = pa.Table.from_arrays([x], names=['x'])
>>>> >
>>>> > Also since this is effectively a string array which is dictionary
>>>> > encoded, is there any way to use string compute kernels like
>>>> > starts_with etc. Right now I am aware of two methods and they are not
>>>> > straightforward.
>>>> >
>>>> > approach 1:
>>>> > Cast to string and then run string kernel
>>>> > expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
>>>> > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
>>>> > columns={'x': pc.field('x')}, filter=expr).to_table()
>>>> >
>>>> > approach 2:
>>>> > filter using the corresponding indices assuming we have access to the dictionary
>>>> > filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
>>>> > pc.is_in(x.indices, filter_)
>>>> >
>>>> > Approach 2 is better/faster .. but I am not able to figure out how to
>>>> > get the dictionary/indices assuming we start from a table read from
>>>> > parquet/feather.
>>>> >
>>>> > Thanks

Re: Questions on Dictionary Array types.

Posted by Suresh V <su...@gmail.com>.

Thanks for the explanation.  But I am now worried that the approach 2 for
using a string kernel like starts with will become more complicated as i
have to run the filter in a python loop effectively slowing it down.

If we had access to the indices directly, approach 2 was 100% faster than 1
for an array of 50m entries.

Completely unrelated,  I also noticed that reading feather file from
ipc/feather API is 10x faster (0.2 vs 2s) than the dataset.to_table() for
50m rows. I wouldn't expect there to be such a big difference. Will create
a bug if it isn't a known issue.

Thanks
On Thu, Apr 21, 2022, 4:07 PM Aldrin <ak...@ucsc.edu> wrote:

> I could be wrong, but in my experience the dictionary is available at the
> chunk level, because that is where you know it is a DictionaryArray (or at
> least, an Array). At the column level, you only know it's a ChunkedArray,
> which seems to roughly be an alias to a vector<Array> (list[Array]) at
> least type-wise.
>
> Also, I think each chunk references the same dictionary, so I think you
> can access any chunk's dictionary and get the same one.
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Wed, Apr 20, 2022 at 10:54 PM Suresh V <su...@gmail.com> wrote:
>
>> Thank you very much for the response. I was looking directly at tab['x'].
>> Didnt realize that the dictionary is present at chunk level.
>>
>> On Thu, Apr 21, 2022, 1:17 AM Weston Pace <we...@gmail.com> wrote:
>>
>>> > However I cannot figure out any easy way to get the mapping
>>> > used to create the dictionary array (vals) easily from the table. Can
>>> > you please let me know the easiest way?
>>>
>>> A dictionary is going to be associated with an array and not a table.
>>> So you first need to get the array from the table.  Tables are made of
>>> columns and each column is made of chunks and each chunk is an array.
>>> Each chunk could have a different mapping, so that is something you
>>> may need to deal with at some point depending on your goal.
>>>
>>> The table you are creating in your example has one column and that
>>> column has one chunk so we can get to the mapping with:
>>>
>>>     tab.column(0).chunks[0].dictionary
>>>
>>> And we can get to the indices with:
>>>
>>>     tab.column(0).chunks[0].indices
>>>
>>> > Also since this is effectively a string array which is dictionary
>>> > encoded, is there any way to use string compute kernels like
>>> > starts_with etc. Right now I am aware of two methods and they are not
>>> > straightforward.
>>>
>>> Regrettably, I don't think we have kernels in place for string
>>> functions on dictionary arrays.  At least, that is my reading of [1].
>>> So the two workarounds you have are may be the best there is at the
>>> moment.
>>>
>>> [1] https://issues.apache.org/jira/browse/ARROW-14068
>>>
>>> On Wed, Apr 20, 2022 at 10:00 AM Suresh V <su...@gmail.com> wrote:
>>> >
>>> > Hi .. I created a pyarrow table from a dictionary array as shown
>>> > below. However I cannot figure out any easy way to get the mapping
>>> > used to create the dictionary array (vals) easily from the table. Can
>>> > you please let me know the easiest way? Other than the ones which
>>> > involve pyarrow.compute/conversion to pandas as they are expensive
>>> > operations for large datasets.
>>> >
>>> > import pyarrow as pa
>>> > import pyarrow.compute as pc
>>> > import numpy as np
>>> >
>>> > vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
>>> > int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
>>> > x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
>>> > tab = pa.Table.from_arrays([x], names=['x'])
>>> >
>>> > Also since this is effectively a string array which is dictionary
>>> > encoded, is there any way to use string compute kernels like
>>> > starts_with etc. Right now I am aware of two methods and they are not
>>> > straightforward.
>>> >
>>> > approach 1:
>>> > Cast to string and then run string kernel
>>> > expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
>>> > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
>>> > columns={'x': pc.field('x')}, filter=expr).to_table()
>>> >
>>> > approach 2:
>>> > filter using the corresponding indices assuming we have access to the
>>> dictionary
>>> > filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
>>> > pc.is_in(x.indices, filter_)
>>> >
>>> > Approach 2 is better/faster .. but I am not able to figure out how to
>>> > get the dictionary/indices assuming we start from a table read from
>>> > parquet/feather.
>>> >
>>> > Thanks
>>>
>>

Re: Questions on Dictionary Array types.

Posted by Aldrin <ak...@ucsc.edu>.

I could be wrong, but in my experience the dictionary is available at the
chunk level, because that is where you know it is a DictionaryArray (or at
least, an Array). At the column level, you only know it's a ChunkedArray,
which seems to roughly be an alias to a vector<Array> (list[Array]) at
least type-wise.

Also, I think each chunk references the same dictionary, so I think you can
access any chunk's dictionary and get the same one.

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Wed, Apr 20, 2022 at 10:54 PM Suresh V <su...@gmail.com> wrote:

> Thank you very much for the response. I was looking directly at tab['x'].
> Didnt realize that the dictionary is present at chunk level.
>
> On Thu, Apr 21, 2022, 1:17 AM Weston Pace <we...@gmail.com> wrote:
>
>> > However I cannot figure out any easy way to get the mapping
>> > used to create the dictionary array (vals) easily from the table. Can
>> > you please let me know the easiest way?
>>
>> A dictionary is going to be associated with an array and not a table.
>> So you first need to get the array from the table.  Tables are made of
>> columns and each column is made of chunks and each chunk is an array.
>> Each chunk could have a different mapping, so that is something you
>> may need to deal with at some point depending on your goal.
>>
>> The table you are creating in your example has one column and that
>> column has one chunk so we can get to the mapping with:
>>
>>     tab.column(0).chunks[0].dictionary
>>
>> And we can get to the indices with:
>>
>>     tab.column(0).chunks[0].indices
>>
>> > Also since this is effectively a string array which is dictionary
>> > encoded, is there any way to use string compute kernels like
>> > starts_with etc. Right now I am aware of two methods and they are not
>> > straightforward.
>>
>> Regrettably, I don't think we have kernels in place for string
>> functions on dictionary arrays.  At least, that is my reading of [1].
>> So the two workarounds you have are may be the best there is at the
>> moment.
>>
>> [1] https://issues.apache.org/jira/browse/ARROW-14068
>>
>> On Wed, Apr 20, 2022 at 10:00 AM Suresh V <su...@gmail.com> wrote:
>> >
>> > Hi .. I created a pyarrow table from a dictionary array as shown
>> > below. However I cannot figure out any easy way to get the mapping
>> > used to create the dictionary array (vals) easily from the table. Can
>> > you please let me know the easiest way? Other than the ones which
>> > involve pyarrow.compute/conversion to pandas as they are expensive
>> > operations for large datasets.
>> >
>> > import pyarrow as pa
>> > import pyarrow.compute as pc
>> > import numpy as np
>> >
>> > vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
>> > int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
>> > x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
>> > tab = pa.Table.from_arrays([x], names=['x'])
>> >
>> > Also since this is effectively a string array which is dictionary
>> > encoded, is there any way to use string compute kernels like
>> > starts_with etc. Right now I am aware of two methods and they are not
>> > straightforward.
>> >
>> > approach 1:
>> > Cast to string and then run string kernel
>> > expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
>> > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
>> > columns={'x': pc.field('x')}, filter=expr).to_table()
>> >
>> > approach 2:
>> > filter using the corresponding indices assuming we have access to the
>> dictionary
>> > filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
>> > pc.is_in(x.indices, filter_)
>> >
>> > Approach 2 is better/faster .. but I am not able to figure out how to
>> > get the dictionary/indices assuming we start from a table read from
>> > parquet/feather.
>> >
>> > Thanks
>>
>

Re: Questions on Dictionary Array types.

Posted by Suresh V <su...@gmail.com>.

Thank you very much for the response. I was looking directly at tab['x'].
Didnt realize that the dictionary is present at chunk level.

On Thu, Apr 21, 2022, 1:17 AM Weston Pace <we...@gmail.com> wrote:

> > However I cannot figure out any easy way to get the mapping
> > used to create the dictionary array (vals) easily from the table. Can
> > you please let me know the easiest way?
>
> A dictionary is going to be associated with an array and not a table.
> So you first need to get the array from the table.  Tables are made of
> columns and each column is made of chunks and each chunk is an array.
> Each chunk could have a different mapping, so that is something you
> may need to deal with at some point depending on your goal.
>
> The table you are creating in your example has one column and that
> column has one chunk so we can get to the mapping with:
>
>     tab.column(0).chunks[0].dictionary
>
> And we can get to the indices with:
>
>     tab.column(0).chunks[0].indices
>
> > Also since this is effectively a string array which is dictionary
> > encoded, is there any way to use string compute kernels like
> > starts_with etc. Right now I am aware of two methods and they are not
> > straightforward.
>
> Regrettably, I don't think we have kernels in place for string
> functions on dictionary arrays.  At least, that is my reading of [1].
> So the two workarounds you have are may be the best there is at the
> moment.
>
> [1] https://issues.apache.org/jira/browse/ARROW-14068
>
> On Wed, Apr 20, 2022 at 10:00 AM Suresh V <su...@gmail.com> wrote:
> >
> > Hi .. I created a pyarrow table from a dictionary array as shown
> > below. However I cannot figure out any easy way to get the mapping
> > used to create the dictionary array (vals) easily from the table. Can
> > you please let me know the easiest way? Other than the ones which
> > involve pyarrow.compute/conversion to pandas as they are expensive
> > operations for large datasets.
> >
> > import pyarrow as pa
> > import pyarrow.compute as pc
> > import numpy as np
> >
> > vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
> > int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
> > x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
> > tab = pa.Table.from_arrays([x], names=['x'])
> >
> > Also since this is effectively a string array which is dictionary
> > encoded, is there any way to use string compute kernels like
> > starts_with etc. Right now I am aware of two methods and they are not
> > straightforward.
> >
> > approach 1:
> > Cast to string and then run string kernel
> > expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
> > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
> > columns={'x': pc.field('x')}, filter=expr).to_table()
> >
> > approach 2:
> > filter using the corresponding indices assuming we have access to the
> dictionary
> > filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
> > pc.is_in(x.indices, filter_)
> >
> > Approach 2 is better/faster .. but I am not able to figure out how to
> > get the dictionary/indices assuming we start from a table read from
> > parquet/feather.
> >
> > Thanks
>

Re: Questions on Dictionary Array types.

Posted by Weston Pace <we...@gmail.com>.

> However I cannot figure out any easy way to get the mapping
> used to create the dictionary array (vals) easily from the table. Can
> you please let me know the easiest way?

A dictionary is going to be associated with an array and not a table.
So you first need to get the array from the table.  Tables are made of
columns and each column is made of chunks and each chunk is an array.
Each chunk could have a different mapping, so that is something you
may need to deal with at some point depending on your goal.

The table you are creating in your example has one column and that
column has one chunk so we can get to the mapping with:

    tab.column(0).chunks[0].dictionary

And we can get to the indices with:

    tab.column(0).chunks[0].indices

> Also since this is effectively a string array which is dictionary
> encoded, is there any way to use string compute kernels like
> starts_with etc. Right now I am aware of two methods and they are not
> straightforward.

Regrettably, I don't think we have kernels in place for string
functions on dictionary arrays.  At least, that is my reading of [1].
So the two workarounds you have are may be the best there is at the
moment.

[1] https://issues.apache.org/jira/browse/ARROW-14068

On Wed, Apr 20, 2022 at 10:00 AM Suresh V <su...@gmail.com> wrote:
>
> Hi .. I created a pyarrow table from a dictionary array as shown
> below. However I cannot figure out any easy way to get the mapping
> used to create the dictionary array (vals) easily from the table. Can
> you please let me know the easiest way? Other than the ones which
> involve pyarrow.compute/conversion to pandas as they are expensive
> operations for large datasets.
>
> import pyarrow as pa
> import pyarrow.compute as pc
> import numpy as np
>
> vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
> int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
> x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
> tab = pa.Table.from_arrays([x], names=['x'])
>
> Also since this is effectively a string array which is dictionary
> encoded, is there any way to use string compute kernels like
> starts_with etc. Right now I am aware of two methods and they are not
> straightforward.
>
> approach 1:
> Cast to string and then run string kernel
> expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
> ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
> columns={'x': pc.field('x')}, filter=expr).to_table()
>
> approach 2:
> filter using the corresponding indices assuming we have access to the dictionary
> filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
> pc.is_in(x.indices, filter_)
>
> Approach 2 is better/faster .. but I am not able to figure out how to
> get the dictionary/indices assuming we start from a table read from
> parquet/feather.
>
> Thanks