You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Chao Sun (JIRA)" <ji...@apache.org> on 2019/04/14 21:33:00 UTC

[jira] [Commented] (ARROW-5129) [Rust] Column writer bug: check dictionary encoder when adding a new data page

    [ https://issues.apache.org/jira/browse/ARROW-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16817449#comment-16817449 ] 

Chao Sun commented on ARROW-5129:
---------------------------------

[~andygrove]: could you add [~sadikovi] as contributor and assign this JIRA to him?

> [Rust] Column writer bug: check dictionary encoder when adding a new data page
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-5129
>                 URL: https://issues.apache.org/jira/browse/ARROW-5129
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust
>         Environment: N/A
>            Reporter: Ivan Sadikov
>            Priority: Major
>              Labels: parquet, pull-request-available
>             Fix For: 0.14.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> As part of my weekly routine, I glanced over code in Parquet column writer and found that the way we check when to add a new data page is buggy. The idea is checking the current encoder and deciding if we have written enough bytes for a page to construct. The problem is that we only check value encoder, regardless whether or not dictionary encoder is enabled. 
> Here is how we do it now: actual check (https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L378) and the buggy function (https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L423). 
> In the case of sparse column and dictionary  encoder we would write a single data page, even though we would have accumulated a large enough number of bytes for more than one page in encoder (value encoder will be empty, so it will always less than constant limit).
> I forgot that parquet-cpp has `current_encoder` as either value encoder or dictionary encoder (https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_writer.cc#L544), but in parquet-rs we have them separate.
> So the fix could be something like this:
> {code}
> /// Returns true if there is enough data for a data page, false otherwise.
> #[inline]
> fn should_add_data_page(&self) -> bool {
>   match self.dict_encoder {
>     Some(ref encoder) => {
>       encoder.estimated_data_encoded_size() >= self.props.data_pagesize_limit()
>     },
>     None => {
>       self.encoder.estimated_data_encoded_size() >= self.props.data_pagesize_limit()
>     }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)