You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/24 14:56:10 UTC
[GitHub] [arrow] manojkarthick opened a new pull request #9306: ARROW-10297: [Rust] Add flag to show json output in parquet-read binary
manojkarthick opened a new pull request #9306:
URL: https://github.com/apache/arrow/pull/9306
Add an option to print output in JSON format. in the parquet-read binary. Having json output allows for easy analysis using tools like [jq](https://stedolan.github.io/jq/). This PR builds on the changes implemented in https://github.com/apache/arrow/pull/8686 and incorporates the suggestions in that PR.
**Changelog**
* Update all three binaries `parquet-schema`, `parquet-rowcount` and `parquet-read` to use [clap](https://github.com/clap-rs/clap) for argument parsing
* Add `to_json_value()` method to get `serde_json::Value` from `Row` and `Field` structs (Thanks to @jhorstmann for these changes!)
* parquet-schema:
* Convert verbose argument into `-v/--verbose` flag
* parquet-read:
* Add a new flag `-j/--json` that prints the file contents in json lines format
* The feature is gated under the `json_output` cargo feature
* Update documentation and README with instructions for running
* The binaries now use version and author information as defined in Cargo.toml
Example output:
```
❯ parquet-read cities.parquet 3 --json
{"continent":"Europe","country":{"name":"France","city":["Paris","Nice","Marseilles","Cannes"]}}
{"continent":"Europe","country":{"name":"Greece","city":["Athens","Piraeus","Hania","Heraklion","Rethymnon","Fira"]}}
{"continent":"North America","country":{"name":"Canada","city":["Toronto","Vancouver","St. John's","Saint John","Montreal","Halifax","Winnipeg","Calgary","Saskatoon","Ottawa","Yellowknife"]}}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#discussion_r564465581
##########
File path: rust/parquet/src/bin/parquet-read.rs
##########
@@ -34,39 +34,98 @@
//! ```
//!
//! # Usage
-//!
//! ```
-//! parquet-read <file-path> [num-records]
+//! parquet-read <file-path> [num-records]
//! ```
-//! where `file-path` is the path to a Parquet file and `num-records` is the optional
-//! numeric option that allows to specify number of records to read from a file.
-//! When not provided, all records are read.
+//!
+//! ## Flags
+//! -h, --help Prints help information
+//! -j, --json Print parquet file in JSON lines Format
+//! -V, --version Prints version information
+//!
+//! ## Args
+//! <file-path> Path to a parquet file
+//! <num-records> Number of records to read. When not provided, all records are read.
//!
//! Note that `parquet-read` reads full file schema, no projection or filtering is
//! applied.
extern crate parquet;
-use std::{env, fs::File, path::Path, process};
+use std::{env, fs::File, path::Path};
+
+use clap::{crate_authors, crate_version, App, Arg};
use parquet::file::reader::{FileReader, SerializedFileReader};
+use parquet::record::Row;
fn main() {
- let args: Vec<String> = env::args().collect();
- if args.len() != 2 && args.len() != 3 {
- println!("Usage: parquet-read <file-path> [num-records]");
- process::exit(1);
- }
+ #[cfg(not(feature = "json_output"))]
+ let app = App::new("parquet-read")
+ .version(crate_version!())
+ .author(crate_authors!())
+ .about("Read data from parquet file")
+ .arg(
+ Arg::with_name("file_path")
+ .value_name("file-path")
+ .required(true)
+ .index(1)
+ .help("Path to a parquet file"),
+ )
+ .arg(
+ Arg::with_name("num_records")
+ .value_name("num-records")
+ .index(2)
+ .help(
+ "Number of records to read. When not provided, all records are read.",
+ ),
+ );
+
+ #[cfg(feature = "json_output")]
+ let app = App::new("parquet-read")
+ .version(crate_version!())
+ .author(crate_authors!())
+ .about("Read data from parquet file")
+ .arg(
+ Arg::with_name("file_path")
+ .value_name("file-path")
+ .required(true)
+ .index(1)
+ .help("Path to a parquet file"),
+ )
+ .arg(
+ Arg::with_name("num_records")
+ .value_name("num-records")
+ .index(2)
+ .help(
+ "Number of records to read. When not provided, all records are read.",
+ ),
+ )
+ .arg(
Review comment:
I wonder if it would make sense to try to reduce the redundancy. Maybe something like
```
let app = App::new("parquet-read")...;
#[cfg(feature = "json_output")]
let app = app.arg(Arg::with_name("json")...)
```
##########
File path: rust/parquet/Cargo.toml
##########
@@ -43,6 +43,8 @@ chrono = "0.4"
num-bigint = "0.3"
arrow = { path = "../arrow", version = "4.0.0-SNAPSHOT", optional = true }
base64 = { version = "0.12", optional = true }
+clap = "2.33.3"
Review comment:
would it be possible to make the `clap` dependency optional as well? Or maybe we can move the command line tools into their own feature? The hope for parquet is that it becomes a widely used library at the lower levels of a software stack so I think keeping dependencies minimal is a good goal
@nevi-me what do you think?
##########
File path: rust/parquet/src/record/api.rs
##########
@@ -635,6 +648,55 @@ impl Field {
_ => nyi!(descr, value),
}
}
+
+ #[cfg(feature = "json_output")]
+ pub fn to_json_value(&self) -> Value {
+ match &self {
+ Field::Null => Value::Null,
+ Field::Bool(b) => Value::Bool(*b),
+ Field::Byte(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::Short(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::Int(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::Long(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::UByte(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::UShort(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::UInt(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::ULong(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::Float(n) => serde_json::Number::from_f64(*n as f64)
+ .map(Value::Number)
+ .unwrap_or(Value::Null),
+ Field::Double(n) => serde_json::Number::from_f64(*n)
+ .map(Value::Number)
+ .unwrap_or(Value::Null),
+ Field::Decimal(n) => Value::String(convert_decimal_to_string(&n)),
+ Field::Str(s) => Value::String(s.to_owned()),
+ Field::Bytes(b) => Value::String(base64::encode(b.data())),
+ Field::Date(d) => Value::String(convert_date_to_string(*d)),
+ Field::TimestampMillis(ts) => {
+ Value::String(convert_timestamp_millis_to_string(*ts))
+ }
+ Field::TimestampMicros(ts) => {
+ Value::String(convert_timestamp_micros_to_string(*ts))
+ }
+ Field::Group(row) => row.to_json_value(),
+ Field::ListInternal(fields) => {
+ Value::Array(fields.elements.iter().map(|f| f.to_json_value()).collect())
+ }
+ Field::MapInternal(map) => Value::Object(
+ map.entries
+ .iter()
+ .map(|(key_field, value_field)| {
+ let key_val = key_field.to_json_value();
+ let key_str = key_val
+ .as_str()
+ .map(|s| s.to_owned())
+ .unwrap_or_else(|| key_val.to_string());
+ (key_str, value_field.to_json_value())
+ })
+ .collect(),
+ ),
+ }
+ }
}
Review comment:
I suggest adding some tests to show this code working as well as avoid regressions in the future (aka avoid someone breaking this code accidentally in the future but not realizing it)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format, add "cli" feature to parquet crate
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#issuecomment-766366191
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=h1) Report
> Merging [#9306](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=desc) (d18cfd7) into [master](https://codecov.io/gh/apache/arrow/commit/437c8c944acb3479b76804f041f5f8cbce309fa7?el=desc) (437c8c9) will **increase** coverage by `0.09%`.
> The diff coverage is `8.33%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9306/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9306 +/- ##
==========================================
+ Coverage 81.88% 81.98% +0.09%
==========================================
Files 215 227 +12
Lines 52988 53363 +375
==========================================
+ Hits 43391 43751 +360
- Misses 9597 9612 +15
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/parquet/src/record/api.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9yZWNvcmQvYXBpLnJz) | `92.63% <8.33%> (-5.26%)` | :arrow_down: |
| [rust/datafusion/src/datasource/csv.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2Nzdi5ycw==) | `60.46% <0.00%> (-4.54%)` | :arrow_down: |
| [rust/datafusion/src/scalar.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zY2FsYXIucnM=) | `55.85% <0.00%> (-3.12%)` | :arrow_down: |
| [rust/arrow/src/array/equal/boolean.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvYm9vbGVhbi5ycw==) | `97.43% <0.00%> (-2.57%)` | :arrow_down: |
| [rust/datafusion/src/physical\_plan/projection.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3Byb2plY3Rpb24ucnM=) | `84.93% <0.00%> (-0.99%)` | :arrow_down: |
| [rust/datafusion/src/sql/parser.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvcGFyc2VyLnJz) | `81.45% <0.00%> (-0.90%)` | :arrow_down: |
| [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `95.41% <0.00%> (-0.80%)` | :arrow_down: |
| [...t/datafusion/src/physical\_plan/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvYWxlc2NlX2JhdGNoZXMucnM=) | `84.11% <0.00%> (-0.80%)` | :arrow_down: |
| [rust/arrow/src/compute/kernels/filter.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2ZpbHRlci5ycw==) | `97.71% <0.00%> (-0.77%)` | :arrow_down: |
| [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `86.95% <0.00%> (-0.55%)` | :arrow_down: |
| ... and [87 more](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=footer). Last update [437c8c9...d18cfd7](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format, add "cli" feature to parquet crate
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#issuecomment-770443904
https://issues.apache.org/jira/browse/ARROW-10297
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] manojkarthick commented on pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format, add "cli" feature to parquet crate
Posted by GitBox <gi...@apache.org>.
manojkarthick commented on pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#issuecomment-770432060
Thanks for the reviews @alamb and @sunchao ! I've addressed all the comments on the PR.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] manojkarthick commented on a change in pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format, add "cli" feature to parquet crate
Posted by GitBox <gi...@apache.org>.
manojkarthick commented on a change in pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#discussion_r567471430
##########
File path: rust/parquet/src/bin/parquet-read.rs
##########
@@ -34,39 +34,72 @@
//! ```
//!
//! # Usage
-//!
//! ```
//! parquet-read <file-path> [num-records]
//! ```
-//! where `file-path` is the path to a Parquet file and `num-records` is the optional
-//! numeric option that allows to specify number of records to read from a file.
-//! When not provided, all records are read.
+//!
+//! ## Flags
+//! -h, --help Prints help information
+//! -j, --json Print parquet file in JSON lines Format
+//! -V, --version Prints version information
+//!
+//! ## Args
+//! <file-path> Path to a parquet file
Review comment:
My bad: Missed capitalizing in the doc comments! Thank you :)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9306: ARROW-10297: [Rust] Add flag to show json output in parquet-read binary
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#issuecomment-766360952
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
Thanks for opening a pull request!
Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW
Then could you also rename pull request title in the following format?
ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
See also:
* [Other pull requests](https://github.com/apache/arrow/pulls/)
* [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] manojkarthick commented on a change in pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format
Posted by GitBox <gi...@apache.org>.
manojkarthick commented on a change in pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#discussion_r567264445
##########
File path: rust/parquet/src/record/api.rs
##########
@@ -635,6 +648,55 @@ impl Field {
_ => nyi!(descr, value),
}
}
+
+ #[cfg(feature = "json_output")]
+ pub fn to_json_value(&self) -> Value {
+ match &self {
+ Field::Null => Value::Null,
+ Field::Bool(b) => Value::Bool(*b),
+ Field::Byte(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::Short(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::Int(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::Long(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::UByte(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::UShort(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::UInt(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::ULong(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::Float(n) => serde_json::Number::from_f64(*n as f64)
+ .map(Value::Number)
+ .unwrap_or(Value::Null),
+ Field::Double(n) => serde_json::Number::from_f64(*n)
+ .map(Value::Number)
+ .unwrap_or(Value::Null),
+ Field::Decimal(n) => Value::String(convert_decimal_to_string(&n)),
+ Field::Str(s) => Value::String(s.to_owned()),
+ Field::Bytes(b) => Value::String(base64::encode(b.data())),
+ Field::Date(d) => Value::String(convert_date_to_string(*d)),
+ Field::TimestampMillis(ts) => {
+ Value::String(convert_timestamp_millis_to_string(*ts))
+ }
+ Field::TimestampMicros(ts) => {
+ Value::String(convert_timestamp_micros_to_string(*ts))
+ }
+ Field::Group(row) => row.to_json_value(),
+ Field::ListInternal(fields) => {
+ Value::Array(fields.elements.iter().map(|f| f.to_json_value()).collect())
+ }
+ Field::MapInternal(map) => Value::Object(
+ map.entries
+ .iter()
+ .map(|(key_field, value_field)| {
+ let key_val = key_field.to_json_value();
+ let key_str = key_val
+ .as_str()
+ .map(|s| s.to_owned())
+ .unwrap_or_else(|| key_val.to_string());
+ (key_str, value_field.to_json_value())
+ })
+ .collect(),
+ ),
+ }
+ }
}
Review comment:
I've also updated the PR and added the test: [test_to_json_value](https://github.com/apache/arrow/pull/9306/files#diff-d6de2aa116b82bf7bfdf40fcbad13dc947139ce7a703990517662ae25eab0159R1676)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] sunchao closed pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format, add "cli" feature to parquet crate
Posted by GitBox <gi...@apache.org>.
sunchao closed pull request #9306:
URL: https://github.com/apache/arrow/pull/9306
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#issuecomment-766366191
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=h1) Report
> Merging [#9306](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=desc) (0bd412e) into [master](https://codecov.io/gh/apache/arrow/commit/437c8c944acb3479b76804f041f5f8cbce309fa7?el=desc) (437c8c9) will **decrease** coverage by `0.06%`.
> The diff coverage is `0.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9306/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9306 +/- ##
==========================================
- Coverage 81.88% 81.82% -0.07%
==========================================
Files 215 215
Lines 52988 53032 +44
==========================================
+ Hits 43391 43392 +1
- Misses 9597 9640 +43
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/parquet/src/bin/parquet-read.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1yZWFkLnJz) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/bin/parquet-rowcount.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1yb3djb3VudC5ycw==) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/bin/parquet-schema.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1zY2hlbWEucnM=) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/record/api.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9yZWNvcmQvYXBpLnJz) | `92.50% <0.00%> (-5.38%)` | :arrow_down: |
| [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.43% <0.00%> (+0.19%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=footer). Last update [437c8c9...0bd412e](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] sunchao commented on a change in pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format, add "cli" feature to parquet crate
Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#discussion_r567462432
##########
File path: rust/parquet/README.md
##########
@@ -79,23 +79,23 @@ enabled by adding `RUSTFLAGS="-C target-feature=+sse4.2"` before the
`cargo build` command.
## Test
-Run `cargo test` for unit tests.
+Run `cargo test` for unit tests. To also run tests related to the binaries, use `cargo test --features cli`.
## Binaries
-The following binaries are provided (use `cargo install` to install them):
+The following binaries are provided (use `cargo install --features cli` to install them):
- **parquet-schema** for printing Parquet file schema and metadata.
-`Usage: parquet-schema <file-path> [verbose]`, where `file-path` is the path to a Parquet file,
-and optional `verbose` is the boolean flag that allows to print full metadata or schema only
-(when not specified only schema will be printed).
+`Usage: parquet-schema <file-path>`, where `file-path` is the path to a Parquet file. Use `-v/--verbose` flag
+to print full metadata or schema only (when not specified only schema will be printed).
- **parquet-read** for reading records from a Parquet file.
`Usage: parquet-read <file-path> [num-records]`, where `file-path` is the path to a Parquet file,
and `num-records` is the number of records to read from a file (when not specified all records will
-be printed).
+be printed). Use `cargo install --features json_output` to enable `-j/--json` flag for printing output
Review comment:
Hmm is this correct? I don't see `json_output` in the Cargo file. The flag is associated with the command?
##########
File path: rust/parquet/src/bin/parquet-rowcount.rs
##########
@@ -34,30 +34,44 @@
//! ```
//!
//! # Usage
-//!
//! ```
-//! parquet-rowcount <file-path> ...
+//! parquet-rowcount <file-paths>...
//! ```
-//! where `file-path` is the path to a Parquet file and `...` is any additional number of
-//! parquet files to count the number of rows from.
+//!
+//! ## Flags
+//! -h, --help Prints help information
+//! -V, --version Prints version information
+//!
+//! ## Args
+//! <file-paths>... List of parquet files to read from
//!
//! Note that `parquet-rowcount` reads full file schema, no projection or filtering is
//! applied.
extern crate parquet;
-use std::{env, fs::File, path::Path, process};
+use std::{env, fs::File, path::Path};
+
+use clap::{crate_authors, crate_version, App, Arg};
use parquet::file::reader::{FileReader, SerializedFileReader};
fn main() {
- let args: Vec<String> = env::args().collect();
- if args.len() < 2 {
- println!("Usage: parquet-rowcount <file-path> ...");
- process::exit(1);
- }
+ let matches = App::new("parquet-rowcount")
+ .version(crate_version!())
+ .author(crate_authors!())
+ .about("Return number of rows in parquet file")
+ .arg(
+ Arg::with_name("file_paths")
+ .value_name("file-paths")
+ .required(true)
+ .multiple(true)
+ .help("List of parquet files to read from"),
Review comment:
maybe add "separated by space"
##########
File path: rust/parquet/src/bin/parquet-read.rs
##########
@@ -34,39 +34,72 @@
//! ```
//!
//! # Usage
-//!
//! ```
-//! parquet-read <file-path> [num-records]
+//! parquet-read <file-path> [num-records]
Review comment:
nit: remove space?
##########
File path: rust/parquet/src/bin/parquet-read.rs
##########
@@ -34,39 +34,72 @@
//! ```
//!
//! # Usage
-//!
//! ```
-//! parquet-read <file-path> [num-records]
+//! parquet-read <file-path> [num-records]
//! ```
-//! where `file-path` is the path to a Parquet file and `num-records` is the optional
-//! numeric option that allows to specify number of records to read from a file.
-//! When not provided, all records are read.
+//!
+//! ## Flags
+//! -h, --help Prints help information
+//! -j, --json Print parquet file in JSON lines Format
+//! -V, --version Prints version information
+//!
+//! ## Args
+//! <file-path> Path to a parquet file
+//! <num-records> Number of records to read. When not provided, all records are read.
//!
//! Note that `parquet-read` reads full file schema, no projection or filtering is
//! applied.
extern crate parquet;
-use std::{env, fs::File, path::Path, process};
+use std::{env, fs::File, path::Path};
+
+use clap::{crate_authors, crate_version, App, Arg};
use parquet::file::reader::{FileReader, SerializedFileReader};
+use parquet::record::Row;
fn main() {
- let args: Vec<String> = env::args().collect();
- if args.len() != 2 && args.len() != 3 {
- println!("Usage: parquet-read <file-path> [num-records]");
- process::exit(1);
- }
+ let app = App::new("parquet-read")
+ .version(crate_version!())
+ .author(crate_authors!())
+ .about("Read data from parquet file")
Review comment:
maybe say "Read data from a Parquet file and print output in console, in either built-in or JSON format"?
##########
File path: rust/parquet/src/bin/parquet-rowcount.rs
##########
@@ -34,30 +34,44 @@
//! ```
//!
//! # Usage
-//!
//! ```
-//! parquet-rowcount <file-path> ...
+//! parquet-rowcount <file-paths>...
//! ```
-//! where `file-path` is the path to a Parquet file and `...` is any additional number of
-//! parquet files to count the number of rows from.
+//!
+//! ## Flags
+//! -h, --help Prints help information
+//! -V, --version Prints version information
+//!
+//! ## Args
+//! <file-paths>... List of parquet files to read from
//!
//! Note that `parquet-rowcount` reads full file schema, no projection or filtering is
//! applied.
extern crate parquet;
-use std::{env, fs::File, path::Path, process};
+use std::{env, fs::File, path::Path};
+
+use clap::{crate_authors, crate_version, App, Arg};
use parquet::file::reader::{FileReader, SerializedFileReader};
fn main() {
- let args: Vec<String> = env::args().collect();
- if args.len() < 2 {
- println!("Usage: parquet-rowcount <file-path> ...");
- process::exit(1);
- }
+ let matches = App::new("parquet-rowcount")
+ .version(crate_version!())
+ .author(crate_authors!())
+ .about("Return number of rows in parquet file")
Review comment:
nit: let's use Parquet instead of parquet everywhere
##########
File path: rust/parquet/src/bin/parquet-read.rs
##########
@@ -34,39 +34,72 @@
//! ```
//!
//! # Usage
-//!
//! ```
-//! parquet-read <file-path> [num-records]
+//! parquet-read <file-path> [num-records]
//! ```
-//! where `file-path` is the path to a Parquet file and `num-records` is the optional
-//! numeric option that allows to specify number of records to read from a file.
-//! When not provided, all records are read.
+//!
+//! ## Flags
+//! -h, --help Prints help information
+//! -j, --json Print parquet file in JSON lines Format
+//! -V, --version Prints version information
+//!
+//! ## Args
+//! <file-path> Path to a parquet file
+//! <num-records> Number of records to read. When not provided, all records are read.
//!
//! Note that `parquet-read` reads full file schema, no projection or filtering is
//! applied.
extern crate parquet;
-use std::{env, fs::File, path::Path, process};
+use std::{env, fs::File, path::Path};
+
+use clap::{crate_authors, crate_version, App, Arg};
use parquet::file::reader::{FileReader, SerializedFileReader};
+use parquet::record::Row;
fn main() {
- let args: Vec<String> = env::args().collect();
- if args.len() != 2 && args.len() != 3 {
- println!("Usage: parquet-read <file-path> [num-records]");
- process::exit(1);
- }
+ let app = App::new("parquet-read")
+ .version(crate_version!())
+ .author(crate_authors!())
+ .about("Read data from parquet file")
+ .arg(
+ Arg::with_name("file_path")
+ .value_name("file-path")
+ .required(true)
+ .index(1)
+ .help("Path to a parquet file"),
+ )
+ .arg(
+ Arg::with_name("num_records")
+ .value_name("num-records")
+ .index(2)
+ .help(
+ "Number of records to read. When not provided, all records are read.",
+ ),
+ )
+ .arg(
+ Arg::with_name("json")
+ .short("j")
+ .long("json")
+ .takes_value(false)
+ .help("Print parquet file in JSON lines Format"),
Review comment:
nit: Format -> format
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format, add "cli" feature to parquet crate
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#issuecomment-766366191
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=h1) Report
> Merging [#9306](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=desc) (01c0333) into [master](https://codecov.io/gh/apache/arrow/commit/437c8c944acb3479b76804f041f5f8cbce309fa7?el=desc) (437c8c9) will **increase** coverage by `0.13%`.
> The diff coverage is `8.33%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9306/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9306 +/- ##
==========================================
+ Coverage 81.88% 82.02% +0.13%
==========================================
Files 215 227 +12
Lines 52988 53476 +488
==========================================
+ Hits 43391 43864 +473
- Misses 9597 9612 +15
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/parquet/src/record/api.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9yZWNvcmQvYXBpLnJz) | `92.63% <8.33%> (-5.26%)` | :arrow_down: |
| [rust/datafusion/src/datasource/csv.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2Nzdi5ycw==) | `60.46% <0.00%> (-4.54%)` | :arrow_down: |
| [rust/datafusion/src/scalar.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zY2FsYXIucnM=) | `55.85% <0.00%> (-3.12%)` | :arrow_down: |
| [rust/arrow/src/array/equal/boolean.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvYm9vbGVhbi5ycw==) | `97.43% <0.00%> (-2.57%)` | :arrow_down: |
| [rust/datafusion/src/physical\_plan/projection.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3Byb2plY3Rpb24ucnM=) | `84.93% <0.00%> (-0.99%)` | :arrow_down: |
| [rust/datafusion/src/sql/parser.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvcGFyc2VyLnJz) | `81.45% <0.00%> (-0.90%)` | :arrow_down: |
| [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `95.41% <0.00%> (-0.80%)` | :arrow_down: |
| [...t/datafusion/src/physical\_plan/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvYWxlc2NlX2JhdGNoZXMucnM=) | `84.11% <0.00%> (-0.80%)` | :arrow_down: |
| [rust/arrow/src/compute/kernels/filter.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2ZpbHRlci5ycw==) | `97.71% <0.00%> (-0.77%)` | :arrow_down: |
| [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `86.95% <0.00%> (-0.55%)` | :arrow_down: |
| ... and [90 more](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=footer). Last update [437c8c9...01c0333](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#discussion_r565660664
##########
File path: rust/parquet/Cargo.toml
##########
@@ -43,6 +43,8 @@ chrono = "0.4"
num-bigint = "0.3"
arrow = { path = "../arrow", version = "4.0.0-SNAPSHOT", optional = true }
base64 = { version = "0.12", optional = true }
+clap = "2.33.3"
Review comment:
That is what I would suggest -- do you have any thoughts @nevi-me or @sunchao ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] sunchao commented on a change in pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format, add "cli" feature to parquet crate
Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#discussion_r567470915
##########
File path: rust/parquet/src/bin/parquet-rowcount.rs
##########
@@ -34,30 +34,44 @@
//! ```
//!
//! # Usage
-//!
//! ```
-//! parquet-rowcount <file-path> ...
+//! parquet-rowcount <file-paths>...
//! ```
-//! where `file-path` is the path to a Parquet file and `...` is any additional number of
-//! parquet files to count the number of rows from.
+//!
+//! ## Flags
+//! -h, --help Prints help information
+//! -V, --version Prints version information
+//!
+//! ## Args
+//! <file-paths>... List of parquet files to read from
Review comment:
ditto
##########
File path: rust/parquet/src/bin/parquet-schema.rs
##########
@@ -34,49 +34,65 @@
//! ```
//!
//! # Usage
-//!
//! ```
-//! parquet-schema <file-path> [verbose]
+//! parquet-schema [FLAGS] <file-path>
//! ```
-//! where `file-path` is the path to a Parquet file and `verbose` is the optional boolean
-//! flag that allows to print schema only, when set to `false` (default behaviour when
-//! not provided), or print full file metadata, when set to `true`.
+//!
+//! ## Flags
+//! -h, --help Prints help information
+//! -V, --version Prints version information
+//! -v, --verbose Enable printing full file metadata
+//!
+//! ## Args
+//! <file-path> Path to a parquet file
Review comment:
ditto
##########
File path: rust/parquet/src/bin/parquet-read.rs
##########
@@ -34,39 +34,72 @@
//! ```
//!
//! # Usage
-//!
//! ```
//! parquet-read <file-path> [num-records]
//! ```
-//! where `file-path` is the path to a Parquet file and `num-records` is the optional
-//! numeric option that allows to specify number of records to read from a file.
-//! When not provided, all records are read.
+//!
+//! ## Flags
+//! -h, --help Prints help information
+//! -j, --json Print parquet file in JSON lines Format
+//! -V, --version Prints version information
+//!
+//! ## Args
+//! <file-path> Path to a parquet file
Review comment:
nit: let's use Parquet instead of parquet :)
##########
File path: rust/parquet/src/bin/parquet-read.rs
##########
@@ -34,39 +34,72 @@
//! ```
//!
//! # Usage
-//!
//! ```
//! parquet-read <file-path> [num-records]
//! ```
-//! where `file-path` is the path to a Parquet file and `num-records` is the optional
-//! numeric option that allows to specify number of records to read from a file.
-//! When not provided, all records are read.
+//!
+//! ## Flags
+//! -h, --help Prints help information
+//! -j, --json Print parquet file in JSON lines Format
Review comment:
ditto
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] manojkarthick commented on a change in pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format
Posted by GitBox <gi...@apache.org>.
manojkarthick commented on a change in pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#discussion_r567264338
##########
File path: rust/parquet/Cargo.toml
##########
@@ -43,6 +43,8 @@ chrono = "0.4"
num-bigint = "0.3"
arrow = { path = "../arrow", version = "4.0.0-SNAPSHOT", optional = true }
base64 = { version = "0.12", optional = true }
+clap = "2.33.3"
Review comment:
Thanks @alamb - I've added a new feature called `cli` that encapsulates the binaries and has `serde_json` and `clap` as dependencies.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format
Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#issuecomment-766366191
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=h1) Report
> Merging [#9306](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=desc) (1a064a3) into [master](https://codecov.io/gh/apache/arrow/commit/67d0c2e38011cd883059e3a9fd0ea08088661707?el=desc) (67d0c2e) will **decrease** coverage by `0.06%`.
> The diff coverage is `0.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9306/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9306 +/- ##
==========================================
- Coverage 81.84% 81.77% -0.07%
==========================================
Files 215 215
Lines 52949 52993 +44
==========================================
Hits 43336 43336
- Misses 9613 9657 +44
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/parquet/src/bin/parquet-read.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1yZWFkLnJz) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/bin/parquet-rowcount.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1yb3djb3VudC5ycw==) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/bin/parquet-schema.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1zY2hlbWEucnM=) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/record/api.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9yZWNvcmQvYXBpLnJz) | `92.50% <0.00%> (-5.38%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=footer). Last update [67d0c2e...1a064a3](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] manojkarthick commented on pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format, add "cli" feature to parquet crate
Posted by GitBox <gi...@apache.org>.
manojkarthick commented on pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#issuecomment-771160489
Thank you @alamb and @sunchao !
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] manojkarthick commented on a change in pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format
Posted by GitBox <gi...@apache.org>.
manojkarthick commented on a change in pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#discussion_r564488177
##########
File path: rust/parquet/src/record/api.rs
##########
@@ -635,6 +648,55 @@ impl Field {
_ => nyi!(descr, value),
}
}
+
+ #[cfg(feature = "json_output")]
+ pub fn to_json_value(&self) -> Value {
+ match &self {
+ Field::Null => Value::Null,
+ Field::Bool(b) => Value::Bool(*b),
+ Field::Byte(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::Short(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::Int(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::Long(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::UByte(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::UShort(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::UInt(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::ULong(n) => Value::Number(serde_json::Number::from(*n)),
+ Field::Float(n) => serde_json::Number::from_f64(*n as f64)
+ .map(Value::Number)
+ .unwrap_or(Value::Null),
+ Field::Double(n) => serde_json::Number::from_f64(*n)
+ .map(Value::Number)
+ .unwrap_or(Value::Null),
+ Field::Decimal(n) => Value::String(convert_decimal_to_string(&n)),
+ Field::Str(s) => Value::String(s.to_owned()),
+ Field::Bytes(b) => Value::String(base64::encode(b.data())),
+ Field::Date(d) => Value::String(convert_date_to_string(*d)),
+ Field::TimestampMillis(ts) => {
+ Value::String(convert_timestamp_millis_to_string(*ts))
+ }
+ Field::TimestampMicros(ts) => {
+ Value::String(convert_timestamp_micros_to_string(*ts))
+ }
+ Field::Group(row) => row.to_json_value(),
+ Field::ListInternal(fields) => {
+ Value::Array(fields.elements.iter().map(|f| f.to_json_value()).collect())
+ }
+ Field::MapInternal(map) => Value::Object(
+ map.entries
+ .iter()
+ .map(|(key_field, value_field)| {
+ let key_val = key_field.to_json_value();
+ let key_str = key_val
+ .as_str()
+ .map(|s| s.to_owned())
+ .unwrap_or_else(|| key_val.to_string());
+ (key_str, value_field.to_json_value())
+ })
+ .collect(),
+ ),
+ }
+ }
}
Review comment:
Thank you - I will work on that.
##########
File path: rust/parquet/src/bin/parquet-read.rs
##########
@@ -34,39 +34,98 @@
//! ```
//!
//! # Usage
-//!
//! ```
-//! parquet-read <file-path> [num-records]
+//! parquet-read <file-path> [num-records]
//! ```
-//! where `file-path` is the path to a Parquet file and `num-records` is the optional
-//! numeric option that allows to specify number of records to read from a file.
-//! When not provided, all records are read.
+//!
+//! ## Flags
+//! -h, --help Prints help information
+//! -j, --json Print parquet file in JSON lines Format
+//! -V, --version Prints version information
+//!
+//! ## Args
+//! <file-path> Path to a parquet file
+//! <num-records> Number of records to read. When not provided, all records are read.
//!
//! Note that `parquet-read` reads full file schema, no projection or filtering is
//! applied.
extern crate parquet;
-use std::{env, fs::File, path::Path, process};
+use std::{env, fs::File, path::Path};
+
+use clap::{crate_authors, crate_version, App, Arg};
use parquet::file::reader::{FileReader, SerializedFileReader};
+use parquet::record::Row;
fn main() {
- let args: Vec<String> = env::args().collect();
- if args.len() != 2 && args.len() != 3 {
- println!("Usage: parquet-read <file-path> [num-records]");
- process::exit(1);
- }
+ #[cfg(not(feature = "json_output"))]
+ let app = App::new("parquet-read")
+ .version(crate_version!())
+ .author(crate_authors!())
+ .about("Read data from parquet file")
+ .arg(
+ Arg::with_name("file_path")
+ .value_name("file-path")
+ .required(true)
+ .index(1)
+ .help("Path to a parquet file"),
+ )
+ .arg(
+ Arg::with_name("num_records")
+ .value_name("num-records")
+ .index(2)
+ .help(
+ "Number of records to read. When not provided, all records are read.",
+ ),
+ );
+
+ #[cfg(feature = "json_output")]
+ let app = App::new("parquet-read")
+ .version(crate_version!())
+ .author(crate_authors!())
+ .about("Read data from parquet file")
+ .arg(
+ Arg::with_name("file_path")
+ .value_name("file-path")
+ .required(true)
+ .index(1)
+ .help("Path to a parquet file"),
+ )
+ .arg(
+ Arg::with_name("num_records")
+ .value_name("num-records")
+ .index(2)
+ .help(
+ "Number of records to read. When not provided, all records are read.",
+ ),
+ )
+ .arg(
Review comment:
I tried but that didn't work since `clap::App` doesn't implement the `Copy` trait. OTOH, if we move this into a CLI tools feature, I don't think we need to have a separate feature for json output. Thoughts?
##########
File path: rust/parquet/Cargo.toml
##########
@@ -43,6 +43,8 @@ chrono = "0.4"
num-bigint = "0.3"
arrow = { path = "../arrow", version = "4.0.0-SNAPSHOT", optional = true }
base64 = { version = "0.12", optional = true }
+clap = "2.33.3"
Review comment:
Thanks @alamb ! Just to confirm: Having one feature for command line tools that has both `clap` + `serde_json` as dependencies and when used without that feature, no binaries are produced. Is my understanding correct?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] manojkarthick commented on a change in pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format, add "cli" feature to parquet crate
Posted by GitBox <gi...@apache.org>.
manojkarthick commented on a change in pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#discussion_r567465341
##########
File path: rust/parquet/README.md
##########
@@ -79,23 +79,23 @@ enabled by adding `RUSTFLAGS="-C target-feature=+sse4.2"` before the
`cargo build` command.
## Test
-Run `cargo test` for unit tests.
+Run `cargo test` for unit tests. To also run tests related to the binaries, use `cargo test --features cli`.
## Binaries
-The following binaries are provided (use `cargo install` to install them):
+The following binaries are provided (use `cargo install --features cli` to install them):
- **parquet-schema** for printing Parquet file schema and metadata.
-`Usage: parquet-schema <file-path> [verbose]`, where `file-path` is the path to a Parquet file,
-and optional `verbose` is the boolean flag that allows to print full metadata or schema only
-(when not specified only schema will be printed).
+`Usage: parquet-schema <file-path>`, where `file-path` is the path to a Parquet file. Use `-v/--verbose` flag
+to print full metadata or schema only (when not specified only schema will be printed).
- **parquet-read** for reading records from a Parquet file.
`Usage: parquet-read <file-path> [num-records]`, where `file-path` is the path to a Parquet file,
and `num-records` is the number of records to read from a file (when not specified all records will
-be printed).
+be printed). Use `cargo install --features json_output` to enable `-j/--json` flag for printing output
Review comment:
That was an outdated comment as I had removed that feature based on earlier feedback. I've updated the PR to reflect your other comments as well. Thanks @sunchao !
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#issuecomment-766366191
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=h1) Report
> Merging [#9306](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=desc) (e9e6fe3) into [master](https://codecov.io/gh/apache/arrow/commit/67d0c2e38011cd883059e3a9fd0ea08088661707?el=desc) (67d0c2e) will **decrease** coverage by `0.06%`.
> The diff coverage is `0.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9306/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9306 +/- ##
==========================================
- Coverage 81.84% 81.77% -0.07%
==========================================
Files 215 215
Lines 52949 52993 +44
==========================================
Hits 43336 43336
- Misses 9613 9657 +44
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/parquet/src/bin/parquet-read.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1yZWFkLnJz) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/bin/parquet-rowcount.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1yb3djb3VudC5ycw==) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/bin/parquet-schema.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1zY2hlbWEucnM=) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/record/api.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9yZWNvcmQvYXBpLnJz) | `92.50% <0.00%> (-5.38%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=footer). Last update [67d0c2e...e9e6fe3](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#issuecomment-766366191
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#issuecomment-766366191
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=h1) Report
> Merging [#9306](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=desc) (5fd156a) into [master](https://codecov.io/gh/apache/arrow/commit/67d0c2e38011cd883059e3a9fd0ea08088661707?el=desc) (67d0c2e) will **decrease** coverage by `0.06%`.
> The diff coverage is `0.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9306/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9306 +/- ##
==========================================
- Coverage 81.84% 81.77% -0.07%
==========================================
Files 215 215
Lines 52949 52993 +44
==========================================
Hits 43336 43336
- Misses 9613 9657 +44
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/parquet/src/bin/parquet-read.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1yZWFkLnJz) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/bin/parquet-rowcount.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1yb3djb3VudC5ycw==) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/bin/parquet-schema.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1zY2hlbWEucnM=) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/record/api.rs](https://codecov.io/gh/apache/arrow/pull/9306/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9yZWNvcmQvYXBpLnJz) | `92.50% <0.00%> (-5.38%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=footer). Last update [67d0c2e...e9e6fe3](https://codecov.io/gh/apache/arrow/pull/9306?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format, add "cli" feature to parquet crate
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#issuecomment-771158985
Thanks again @manojkarthick 👍
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format, add "cli" feature to parquet crate
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#discussion_r567413050
##########
File path: rust/parquet/src/record/api.rs
##########
@@ -1608,6 +1670,117 @@ mod tests {
);
}
}
+
+ #[test]
+ #[cfg(feature = "cli")]
+ fn test_to_json_value() {
+ assert_eq!(Field::Null.to_json_value(), Value::Null);
+ assert_eq!(Field::Bool(true).to_json_value(), Value::Bool(true));
+ assert_eq!(Field::Bool(false).to_json_value(), Value::Bool(false));
+ assert_eq!(
+ Field::Byte(1).to_json_value(),
+ Value::Number(serde_json::Number::from(1))
+ );
+ assert_eq!(
+ Field::Short(2).to_json_value(),
+ Value::Number(serde_json::Number::from(2))
+ );
+ assert_eq!(
+ Field::Int(3).to_json_value(),
+ Value::Number(serde_json::Number::from(3))
+ );
+ assert_eq!(
+ Field::Long(4).to_json_value(),
+ Value::Number(serde_json::Number::from(4))
+ );
+ assert_eq!(
+ Field::UByte(1).to_json_value(),
+ Value::Number(serde_json::Number::from(1))
+ );
+ assert_eq!(
+ Field::UShort(2).to_json_value(),
+ Value::Number(serde_json::Number::from(2))
+ );
+ assert_eq!(
+ Field::UInt(3).to_json_value(),
+ Value::Number(serde_json::Number::from(3))
+ );
+ assert_eq!(
+ Field::ULong(4).to_json_value(),
+ Value::Number(serde_json::Number::from(4))
+ );
+ assert_eq!(
+ Field::Float(5.0).to_json_value(),
+ Value::Number(serde_json::Number::from_f64(f64::from(5.0 as f32)).unwrap())
+ );
+ assert_eq!(
+ Field::Float(5.1234).to_json_value(),
+ Value::Number(
+ serde_json::Number::from_f64(f64::from(5.1234 as f32)).unwrap()
+ )
+ );
+ assert_eq!(
+ Field::Double(6.0).to_json_value(),
+ Value::Number(serde_json::Number::from_f64(6.0 as f64).unwrap())
+ );
+ assert_eq!(
+ Field::Double(6.1234).to_json_value(),
+ Value::Number(serde_json::Number::from_f64(6.1234 as f64).unwrap())
+ );
+ assert_eq!(
+ Field::Str("abc".to_string()).to_json_value(),
+ Value::String(String::from("abc"))
+ );
+ assert_eq!(
+ Field::Decimal(Decimal::from_i32(4, 8, 2)).to_json_value(),
+ Value::String(String::from("0.04"))
+ );
+ assert_eq!(
+ Field::Bytes(ByteArray::from(vec![1, 2, 3])).to_json_value(),
+ Value::String(String::from("AQID"))
+ );
+ assert_eq!(
Review comment:
When I ran the new test, it failed for me (probably because my timezone is different). Maybe we should print out UTF timestamps for consistency?
```
cargo test --features=cli -- test_to_json_value
cargo test --features=cli -- test_to_json_value
Finished test [unoptimized + debuginfo] target(s) in 0.10s
Running /Users/alamb/Software/arrow2/rust/target/debug/deps/parquet-113eb074b13ea986
running 1 test
test record::api::tests::test_to_json_value ... FAILED
failures:
---- record::api::tests::test_to_json_value stdout ----
thread 'record::api::tests::test_to_json_value' panicked at 'assertion failed: `(left == right)`
left: `String("1969-12-31 19:00:12 -05:00")`,
right: `String("1969-12-31 16:00:12 -08:00")`', parquet/src/record/api.rs:1742:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
record::api::tests::test_to_json_value
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 429 filtered out
error: test failed, to rerun pass '--lib'
alamb@ip-10-0-0-49 parquet %
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] manojkarthick commented on a change in pull request #9306: ARROW-10297: [Rust] Parameter for parquet-read to output data in json format, add "cli" feature to parquet crate
Posted by GitBox <gi...@apache.org>.
manojkarthick commented on a change in pull request #9306:
URL: https://github.com/apache/arrow/pull/9306#discussion_r567423337
##########
File path: rust/parquet/src/record/api.rs
##########
@@ -1608,6 +1670,117 @@ mod tests {
);
}
}
+
+ #[test]
+ #[cfg(feature = "cli")]
+ fn test_to_json_value() {
+ assert_eq!(Field::Null.to_json_value(), Value::Null);
+ assert_eq!(Field::Bool(true).to_json_value(), Value::Bool(true));
+ assert_eq!(Field::Bool(false).to_json_value(), Value::Bool(false));
+ assert_eq!(
+ Field::Byte(1).to_json_value(),
+ Value::Number(serde_json::Number::from(1))
+ );
+ assert_eq!(
+ Field::Short(2).to_json_value(),
+ Value::Number(serde_json::Number::from(2))
+ );
+ assert_eq!(
+ Field::Int(3).to_json_value(),
+ Value::Number(serde_json::Number::from(3))
+ );
+ assert_eq!(
+ Field::Long(4).to_json_value(),
+ Value::Number(serde_json::Number::from(4))
+ );
+ assert_eq!(
+ Field::UByte(1).to_json_value(),
+ Value::Number(serde_json::Number::from(1))
+ );
+ assert_eq!(
+ Field::UShort(2).to_json_value(),
+ Value::Number(serde_json::Number::from(2))
+ );
+ assert_eq!(
+ Field::UInt(3).to_json_value(),
+ Value::Number(serde_json::Number::from(3))
+ );
+ assert_eq!(
+ Field::ULong(4).to_json_value(),
+ Value::Number(serde_json::Number::from(4))
+ );
+ assert_eq!(
+ Field::Float(5.0).to_json_value(),
+ Value::Number(serde_json::Number::from_f64(f64::from(5.0 as f32)).unwrap())
+ );
+ assert_eq!(
+ Field::Float(5.1234).to_json_value(),
+ Value::Number(
+ serde_json::Number::from_f64(f64::from(5.1234 as f32)).unwrap()
+ )
+ );
+ assert_eq!(
+ Field::Double(6.0).to_json_value(),
+ Value::Number(serde_json::Number::from_f64(6.0 as f64).unwrap())
+ );
+ assert_eq!(
+ Field::Double(6.1234).to_json_value(),
+ Value::Number(serde_json::Number::from_f64(6.1234 as f64).unwrap())
+ );
+ assert_eq!(
+ Field::Str("abc".to_string()).to_json_value(),
+ Value::String(String::from("abc"))
+ );
+ assert_eq!(
+ Field::Decimal(Decimal::from_i32(4, 8, 2)).to_json_value(),
+ Value::String(String::from("0.04"))
+ );
+ assert_eq!(
+ Field::Bytes(ByteArray::from(vec![1, 2, 3])).to_json_value(),
+ Value::String(String::from("AQID"))
+ );
+ assert_eq!(
Review comment:
Thanks for the quick feedback @alamb ! I've updated all the tests to use UTC timestamps.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org