You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/08/11 23:01:00 UTC

[jira] [Updated] (ARROW-17391) [C#] arrow::read_feather() cannot read DictionaryArray written from C#

     [ https://issues.apache.org/jira/browse/ARROW-17391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson updated ARROW-17391:
------------------------------------
    Summary: [C#] arrow::read_feather() cannot read DictionaryArray written from C#  (was: arrow::read_feather() cannot read DictionaryArray written from C#)

> [C#] arrow::read_feather() cannot read DictionaryArray written from C#
> ----------------------------------------------------------------------
>
>                 Key: ARROW-17391
>                 URL: https://issues.apache.org/jira/browse/ARROW-17391
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C#, R
>    Affects Versions: 9.0.1
>            Reporter: Todd West
>            Priority: Major
>             Fix For: 9.0.1
>
>
> This applies to Arrow 9.0.0, both the C# nuget and R package, but for some reason 9.0.0 isn't in the issue dropdowns' list of released versions. It also appears the [implementation status page|https://arrow.apache.org/docs/status.html#ipc-format] may be stale as the C#  source contains [DictionaryArray|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/DictionaryArray.cs] and a look in the debugger confirms the flags flip and the data structures update for [ArrowStreamWriter|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs] having correctly received both the dictionary index and value arrays it's given on the code paths which write a [dictionary batch|https://arrow.apache.org/docs/format/Columnar.html] . However, on the R side, read_feather() fails with
> {{Error: Key error: Dictionary with id 1 not found}}
> So it appears most likely either C# isn't properly emitting the dictionary batch, despite seeming to have all the code to do so, or something's going wrong in the C++ layers under R in the reading side.
> Setup on the C# side is simple
> {{        public static DictionaryArray CreateStringTable(Memory<byte> indicies, IList<string> values)}}
> {{        {}}
> {{            StringArray.Builder valueArray = new();}}
> {{            for (int valueIndex = 0; valueIndex < values.Count; ++valueIndex)}}
> {{            {}}
> {{                valueArray.Append(values[valueIndex]);}}
> {{            }}}{{            UInt8Array indexArray = new(ArrowArrayExtensions.WrapInArrayData(UInt8Type.Default, indicies, indicies.Length));}}
> {{            return new DictionaryArray(new(UInt8Type.Default, StringType.Default, false), indexArray, valueArray.Build());}}
> {{        }}}
> as is the R
> {{        library(arrow)}}
> {{        foo = read_feather("test.feather")}}
> If I drop the dictionary column the two Arrow implementations interop without difficulty. Same if I write only the indices as a UInt8 column. So the issue here is clearly specific to the use of DictionaryColumn. I've also tried other index sizes, so it doesn't appear specific to the use of UInt8.
> I'm therefore left with two questions:
> 1) Does DictionaryArray have working use cases in 9.0.0?
> 2) If what I'm doing's not supposed to work yet, or I'm not getting the data structures set up correctly (there's no C# DictionaryArray example [on github|https://github.com/apache/arrow/tree/master/csharp/examples]), is there an array level workaround?
> There's only one string table in this schema and it's typically tiny (five values or less) so putting its values part in the schema metadata is a viable workaround, albeit an inelegant one.
> Not seeing that there's a feather file viewer available but, if there is, I'd be happy to take a closer look. Can also link the sources after they've been committed and pushed, which should be by the end of the day tomorrow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)