You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by mgoddard-pivotal <gi...@git.apache.org> on 2018/04/27 02:04:38 UTC
[GitHub] incubator-hawq pull request #1357: Changes to enable reuse of PXF Parquet cl...
GitHub user mgoddard-pivotal opened a pull request:
https://github.com/apache/incubator-hawq/pull/1357
Changes to enable reuse of PXF Parquet classes, for data in S3
I am working on adding support for read/write of Parquet formatted data, stored in S3, over PXF. I wanted to reuse these existing Parquet classes since they are very functional, but I had to add a few methods since their members were private, as was a method I needed to use.
I'd like to submit this PR for just these changes. The PR for the overall S3 Parquet project is in a different module and I will submit this separately.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mgoddard-pivotal/incubator-hawq s3-parquet
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-hawq/pull/1357.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1357
----
commit 95471117c0f866c80890abcb5740595bf0614e25
Author: Michael Goddard <mg...@...>
Date: 2018-04-27T01:53:12Z
Changes so that these Parquet classes could be reused to support S3 Parquet reads
----
---
[GitHub] incubator-hawq pull request #1357: Changes to enable reuse of PXF Parquet cl...
Posted by mgoddard-pivotal <gi...@git.apache.org>.
Github user mgoddard-pivotal closed the pull request at:
https://github.com/apache/incubator-hawq/pull/1357
---
[GitHub] incubator-hawq issue #1357: Changes to enable reuse of PXF Parquet classes, ...
Posted by shivzone <gi...@git.apache.org>.
Github user shivzone commented on the issue:
https://github.com/apache/incubator-hawq/pull/1357
@mgoddard-pivotal please close this PR
---
[GitHub] incubator-hawq pull request #1357: Changes to enable reuse of PXF Parquet cl...
Posted by shivzone <gi...@git.apache.org>.
Github user shivzone commented on a diff in the pull request:
https://github.com/apache/incubator-hawq/pull/1357#discussion_r185127394
--- Diff: pxf/pxf-hdfs/src/main/java/org/apache/hawq/pxf/plugins/hdfs/ParquetResolver.java ---
@@ -55,6 +56,16 @@ public ParquetResolver(InputData metaData) {
super(metaData);
}
+ // This method facilitates passing in the MessageType instance, which is
--- End diff --
javadoc wouldn't like comments this style. Please refer to above function on standard convention used
---
[GitHub] incubator-hawq pull request #1357: Changes to enable reuse of PXF Parquet cl...
Posted by mgoddard-pivotal <gi...@git.apache.org>.
Github user mgoddard-pivotal commented on a diff in the pull request:
https://github.com/apache/incubator-hawq/pull/1357#discussion_r185272600
--- Diff: pxf/pxf-hdfs/src/main/java/org/apache/hawq/pxf/plugins/hdfs/ParquetResolver.java ---
@@ -55,6 +56,16 @@ public ParquetResolver(InputData metaData) {
super(metaData);
}
+ // This method facilitates passing in the MessageType instance, which is
--- End diff --
@shivzone, I've switched to a Javadoc style comment. Does this work?
---
[GitHub] incubator-hawq issue #1357: Changes to enable reuse of PXF Parquet classes, ...
Posted by shivzone <gi...@git.apache.org>.
Github user shivzone commented on the issue:
https://github.com/apache/incubator-hawq/pull/1357
Merged to Master
---
[GitHub] incubator-hawq pull request #1357: Changes to enable reuse of PXF Parquet cl...
Posted by shivzone <gi...@git.apache.org>.
Github user shivzone commented on a diff in the pull request:
https://github.com/apache/incubator-hawq/pull/1357#discussion_r185127176
--- Diff: pxf/pxf-hdfs/src/main/java/org/apache/hawq/pxf/plugins/hdfs/ParquetFileAccessor.java ---
@@ -130,21 +130,40 @@ private Group readNextGroup() {
*/
public ParquetFileAccessor(InputData input) {
super(input);
- ParquetUserData parquetUserData = HdfsUtilities.parseParquetUserData(input);
- schema = parquetUserData.getSchema();
+ }
+
+ public MessageType getSchema() {
+ return schema;
+ }
+
+ public void setSchema(MessageType schema) {
+ this.schema = schema;
+ columnIO = new ColumnIOFactory().getColumnIO(schema);
+ }
+
+ // Enable sub-classes of ParquetFileAccessor to set up recordIterator
+ public void setRecordIterator() {
+ recordIterator = new RecordIterator(reader);
+ }
+
+ public void setReader (ParquetFileReader reader) {
+ this.reader = reader;
+ }
+
+ public boolean iteratorHasNext() {
+ return recordIterator.hasNext();
}
@Override
public boolean openForRead() throws Exception {
Configuration conf = new Configuration();
Path file = new Path(inputData.getDataSource());
FileSplit fileSplit = HdfsUtilities.parseFileSplit(inputData);
+ setSchema(HdfsUtilities.parseParquetUserData(inputData).getSchema());
--- End diff --
At some point we should look into moving some of the avro/parquet util functions into a generic package (maybe pxf-api) to decouple parquet specific functionality from hdfs package. For now, this is fine.
---