You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "EMCP (via GitHub)" <gi...@apache.org> on 2023/06/03 06:54:03 UTC

[GitHub] [arrow-rs] EMCP opened a new issue, #4356: Have a parquet file not able to be deduped via arrow-rs, complains about Decimal precision?

EMCP opened a new issue, #4356:
URL: https://github.com/apache/arrow-rs/issues/4356

   **Describe the bug**
   
   I am attempting to try out arrow-rs for the first time, with the eventual goal to migrate off of the python implementation.  one of the newest files that came across my bench started to throw an exception during this routine to dedupe data.. and I am unsure why..
   
   Here's the routine :
   
   ```
   
   fn example_get_frame(some_file_path: &str) -> PolarsResult<DataFrame> {
       let r = fs::File::open(some_file_path).unwrap();
       let reader = ParquetReader::new(r);
       return reader.finish()
   }
   
   fn dedupe_parquet_file(entry: walkdir::DirEntry, output_dir: String) {
   
       println!("modifying !");
       let df = example_get_frame(entry.path().to_str().unwrap());
   
       let mut new_df = df.expect("").unique(None, UniqueKeepStrategy::First).expect("");
   
       //TODO: build and verify a proper path
       let new_output_filepath = Path::join(Path::new( output_dir.as_str()), entry.file_name().to_str().unwrap());
       println!("{}", new_output_filepath.to_str().unwrap());
       let mut file = fs::File::create(new_output_filepath).unwrap();
       ParquetWriter::new(&mut file).finish(&mut new_df).unwrap();
   
       println!();
   
   }
   ```
   The Error
   
   ```
   thread 'main' panicked at ': ArrowError(ExternalFormat("File out of specification: Invalid DECIMAL: scale (1) cannot be greater than or equal to precision (1)"))', src/main.rs:21:25
   stack backtrace:
      0: rust_begin_unwind
                at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
      1: core::panicking::panic_fmt
                at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
      2: core::result::unwrap_failed
                at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/result.rs:1750:5
      3: core::result::Result<T,E>::expect
                at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/result.rs:1047:23
      4: parquet_dedupe_data::dedupe_parquet_file
                at ./src/main.rs:21:22
      5: parquet_dedupe_data::main
                at ./src/main.rs:53:13
      6: core::ops::function::FnOnce::call_once
                at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/ops/function.rs:250:5
   note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
   ```
   
   **To Reproduce**
   As you can see I walk the input DIR.. find parquet files.. and attempt to dedupe them.
   
   **Expected behavior**
   
   I am thinking either there's an error in my data... or this case of the decimal is not supported well by arrow-rs.. 
   
   **Additional context**
   
   Here's the schema of the offending file
   
   ```
   {
     "type" : "record",
     "name" : "schema",
     "fields" : [ {
       "name" : "category",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "maturity",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "liquid_hours",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "long_name",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "contract_month",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "real_expiration_date",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "under_sec_type",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "trading_hours",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "ev_rule",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "time_zone_id",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "next_option_partial",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "next_option_date",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "price_magnifier",
       "type" : [ "null", {
         "type" : "fixed",
         "name" : "price_magnifier",
         "size" : 2,
         "logicalType" : "decimal",
         "precision" : 4,
         "scale" : 1
       } ],
       "default" : null
     }, {
       "name" : "agg_group",
       "type" : [ "null", "long" ],
       "default" : null
     }, {
       "name" : "stock_type",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "under_symbol",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "market_rule_ids",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "query_start_time",
       "type" : [ "null", "long" ],
       "default" : null
     }, {
       "name" : "last_trade_time",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "convertible",
       "type" : [ "null", "boolean" ],
       "default" : null
     }, {
       "name" : "coupon",
       "type" : [ "null", {
         "type" : "fixed",
         "name" : "coupon",
         "size" : 1,
         "logicalType" : "decimal",
         "precision" : 1,
         "scale" : 1
       } ],
       "default" : null
     }, {
       "name" : "cusip_check_digit",
       "type" : [ "null", "long" ],
       "default" : null
     }, {
       "name" : "callable",
       "type" : [ "null", "boolean" ],
       "default" : null
     }, {
       "name" : "isin",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "issue_date",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "ratings",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "putable",
       "type" : [ "null", "boolean" ],
       "default" : null
     }, {
       "name" : "min_tick",
       "type" : [ "null", {
         "type" : "fixed",
         "name" : "min_tick",
         "size" : 2,
         "logicalType" : "decimal",
         "precision" : 4,
         "scale" : 4
       } ],
       "default" : null
     }, {
       "name" : "market_name",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "order_types",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "next_option_type",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "suggested_size_increment",
       "type" : [ "null", "long" ],
       "default" : null
     }, {
       "name" : "bond_type",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "industry",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "contract_id",
       "type" : [ "null", "long" ],
       "default" : null
     }, {
       "name" : "ev_multiplier",
       "type" : [ "null", {
         "type" : "fixed",
         "name" : "ev_multiplier",
         "size" : 1,
         "logicalType" : "decimal",
         "precision" : 1,
         "scale" : 1
       } ],
       "default" : null
     }, {
       "name" : "subcategory",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "min_size",
       "type" : [ "null", "long" ],
       "default" : null
     }, {
       "name" : "under_contract_id",
       "type" : [ "null", "long" ],
       "default" : null
     }, {
       "name" : "cusip",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "coupon_type",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "desc_append",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "size_increment",
       "type" : [ "null", "long" ],
       "default" : null
     }, {
       "name" : "notes",
       "type" : [ "null", "string" ],
       "default" : null
     } ]
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] EMCP commented on issue #4356: Have a parquet file not able to be deduped via arrow-rs, complains about Decimal precision?

Posted by "EMCP (via GitHub)" <gi...@apache.org>.
EMCP commented on issue #4356:
URL: https://github.com/apache/arrow-rs/issues/4356#issuecomment-1574910148

   ah hah, thank you!.. I was thrown off as it was working with the pyarrow implementation without warning.  will close and see about the upstream data creation in target-parquet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] EMCP commented on issue #4356: Have a parquet file not able to be deduped via arrow-rs, complains about Decimal precision?

Posted by "EMCP (via GitHub)" <gi...@apache.org>.
EMCP commented on issue #4356:
URL: https://github.com/apache/arrow-rs/issues/4356#issuecomment-1579444843

   I personally pushed an update to the library https://github.com/estrategiahq/target-parquet .. however it still defaults to instantiating parquet spec 1.x files for backwards compatibility..
   
   if I bump explicitly to ouptut parquet 2.4+ will this perhaps get fixed?  
   
   https://github.com/estrategiahq/target-parquet/blob/master/setup.py#L16 here you can see it calls for pyarrow 10.x


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] EMCP commented on issue #4356: Have a parquet file not able to be deduped via arrow-rs, complains about Decimal precision?

Posted by "EMCP (via GitHub)" <gi...@apache.org>.
EMCP commented on issue #4356:
URL: https://github.com/apache/arrow-rs/issues/4356#issuecomment-1579451853

   ah my bad,
   
   checking my cargo.lock.. I am seeing 
   
   ```
   
   [[package]]
   name = "polars-arrow"
   version = "0.27.2"
   source = "registry+https://github.com/rust-lang/crates.io-index"
   checksum = "06e57a7b929edf6c73475dbc3f63d35152f14f4a9455476acc6127d770daa0f6"
   dependencies = [
    "arrow2",
    "hashbrown 0.13.2",
    "num",
    "thiserror",
   ]
   
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] EMCP closed issue #4356: Have a parquet file not able to be deduped via arrow-rs, complains about Decimal precision?

Posted by "EMCP (via GitHub)" <gi...@apache.org>.
EMCP closed issue #4356: Have a parquet file not able to be deduped via arrow-rs, complains about Decimal precision?
URL: https://github.com/apache/arrow-rs/issues/4356


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4356: Have a parquet file not able to be deduped via arrow-rs, complains about Decimal precision?

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4356:
URL: https://github.com/apache/arrow-rs/issues/4356#issuecomment-1579474743

   Aah it would appear you are using https://github.com/jorgecarleitao/arrow2 not this repo. Arrow2 forked large portions of arrow-rs and it would appear to have copied across a bug that has since been fixed in arrow-rs. 
   
   There have been discussions about polars migrating off arrow2, but they appear to have stalled, so I suspect you should probably file an issue on either the polars and/or arrow2 repositories


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #4356: Have a parquet file not able to be deduped via arrow-rs, complains about Decimal precision?

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4356:
URL: https://github.com/apache/arrow-rs/issues/4356#issuecomment-1579489588

   FYI @ritchie46 (this bug was reported against arrow-rs but is in actually a bug in polars)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4356: Have a parquet file not able to be deduped via arrow-rs, complains about Decimal precision?

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4356:
URL: https://github.com/apache/arrow-rs/issues/4356#issuecomment-1579446477

   I was referring to a very old arrow-rs version to read it, recent versions of the Rust library shouldn't produce the linked panic


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4356: Have a parquet file not able to be deduped via arrow-rs, complains about Decimal precision?

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4356:
URL: https://github.com/apache/arrow-rs/issues/4356#issuecomment-1578945595

   Actually coming back to this, I may have misled you. The precision and the scale can be equal, it merely implies a value less than 1. However, this was fixed in https://github.com/apache/arrow-rs/pull/1607. Is it posssible you are using a very old arrow version?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] EMCP commented on issue #4356: Have a parquet file not able to be deduped via arrow-rs, complains about Decimal precision?

Posted by "EMCP (via GitHub)" <gi...@apache.org>.
EMCP commented on issue #4356:
URL: https://github.com/apache/arrow-rs/issues/4356#issuecomment-1574730045

   seems perhaps related to https://github.com/apache/arrow-rs/issues/2852


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4356: Have a parquet file not able to be deduped via arrow-rs, complains about Decimal precision?

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4356:
URL: https://github.com/apache/arrow-rs/issues/4356#issuecomment-1574832167

   This is a bug in whatever produced your data, a scale of 1 implies that the data is stored multiplied by 10, but only has a precision of a single digit. The parquet data is invalid


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org