You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Mark Hildreth (Jira)" <ji...@apache.org> on 2020/04/18 17:08:00 UTC

[jira] [Commented] (ARROW-8287) [Rust] Arrow examples should use utility to print results

    [ https://issues.apache.org/jira/browse/ARROW-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086534#comment-17086534 ] 

Mark Hildreth commented on ARROW-8287:
--------------------------------------

I took this PR hoping it would be a simple intro, but there's actually a bit more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public interface of the arrow crate would now include the *prettytable* crate's *Table* struct (as that is what *create_table* returns). The simplest fix is to make *create_table* private, and only expose *print_batches* for now, which is what I would recommend.
 * Second, the crate used to create the strings used in the output (*prettytable*) has a dependency on the crate *encode_unicode*. The *encode_unicode* crate does some funky stuff with implementing the trait *FromIterator* for *Vec<u8>*. This can cause issues with any code that would use the *arrow* crate that rely on there being only one way to collect an *Iterator<_>* into *Vec<u8>*, which actually [broke some code in a test in the parquet crate.|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833] This was a pretty complicated problem with someone of my Rust experience, I wrote up more information about it in [this reddit thread|https://www.reddit.com/r/rust/comments/g3iqan/crates_implementing_fromiterator_for_std/].


{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
    |
833 | Standard.sample_iter(&mut rng).take(seed_len).collect();
    | ^^^^^^^^^^^ cannot infer type for `T`
 {code}
 * Additionally, the interface for print_batches accepts a vector of multiple RecordBatches. Unfortunately, there is no static guarantee that the RecordBatches have the same schema. The C++/Python and Javascript implementations have created a new logical type called "Table" which tries to do this (although some of their APIs also don't seem to provide that guarantee). However, development of such a structure is way outside the scope of this project, so I would be happy to say forget about it and perhaps add an issue to revisit this. As a short-term solution, *print_table* could take a generic iterator of *RecordBatch* types, which if we did end up with a *Table* type later on probably wouldn't need to be changed.

 
So, here are my blocking questions: * Stick with the original prettytable crate and just add the required type annotations in the Parquet test, or find another crate that doesn't have said side effect? I recommend finding a different one.
 * Keep *create_table* public, or make it private? I recommend make it private.
 * Come up with a better wrapper for a "Table" to enforce one-schema-multiple-record batches, or don't worry about this for now? My recommendation is don't worry about it for now, but make *print_table* accept an iterator and to add an issue to think more about creating a *Table* type like other APIs do.

> [Rust] Arrow examples should use utility to print results
> ---------------------------------------------------------
>
>                 Key: ARROW-8287
>                 URL: https://issues.apache.org/jira/browse/ARROW-8287
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>            Reporter: Andy Grove
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/pull/6773] added a utility for printing record batches and the DataFusion examples were updated to use this. We should now do the same for the Arrow examples. This will require moving the utility method from the datafusion crate to the arrow crate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)