You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Todd West (Jira)" <ji...@apache.org> on 2022/08/11 21:20:00 UTC

[jira] [Created] (ARROW-17391) arrow::read_feather() cannot read DictionaryArray written from C#

Todd West created ARROW-17391:
---------------------------------

             Summary: arrow::read_feather() cannot read DictionaryArray written from C#
                 Key: ARROW-17391
                 URL: https://issues.apache.org/jira/browse/ARROW-17391
             Project: Apache Arrow
          Issue Type: Bug
          Components: C#, R
    Affects Versions: 9.0.1
            Reporter: Todd West
             Fix For: 9.0.1


This applies to Arrow 9.0.0, both the C# nuget and R package, but for some reason 9.0.0 isn't in the issue dropdowns' list of released versions. It also appears the [implementation status page|https://arrow.apache.org/docs/status.html#ipc-format] may be stale as the C#  source contains [DictionaryArray|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/DictionaryArray.cs] and a look in the debugger confirms the flags flip and the data structures update for [ArrowStreamWriter|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs] having correctly received both the dictionary index and value arrays it's given on the code paths which write a [dictionary batch|https://arrow.apache.org/docs/format/Columnar.html] . However, on the R side, read_feather() fails with

{{Error: Key error: Dictionary with id 1 not found}}

So it appears most likely either C# isn't properly emitting the dictionary batch, despite seeming to have all the code to do so, or something's going wrong in the C++ layers under R in the reading side.

Setup on the C# side is simple

{{        public static DictionaryArray CreateStringTable(Memory<byte> indicies, IList<string> values)}}
{{        {}}
{{            StringArray.Builder valueArray = new();}}
{{            for (int valueIndex = 0; valueIndex < values.Count; ++valueIndex)}}
{{            {}}
{{                valueArray.Append(values[valueIndex]);}}
{{            }}}{{            UInt8Array indexArray = new(ArrowArrayExtensions.WrapInArrayData(UInt8Type.Default, indicies, indicies.Length));}}
{{            return new DictionaryArray(new(UInt8Type.Default, StringType.Default, false), indexArray, valueArray.Build());}}
{{        }}}

as is the R

{{        library(arrow)}}
{{        foo = read_feather("test.feather")}}

If I drop the dictionary column the two Arrow implementations interop without difficulty. Same if I write only the indices as a UInt8 column. So the issue here is clearly specific to the use of DictionaryColumn. I've also tried other index sizes, so it doesn't appear specific to the use of UInt8.

I'm therefore left with two questions:

1) Does DictionaryArray have working use cases in 9.0.0?

2) If what I'm doing's not supposed to work yet, or I'm not getting the data structures set up correctly (there's no C# DictionaryArray example [on github|https://github.com/apache/arrow/tree/master/csharp/examples]), is there an array level workaround?

There's only one string table in this schema and it's typically tiny (five values or less) so putting its values part in the schema metadata is a viable workaround, albeit an inelegant one.

Not seeing that there's a feather file viewer available but, if there is, I'd be happy to take a closer look. Can also link the sources after they've been committed and pushed, which should be by the end of the day tomorrow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)