You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2020/09/15 19:56:53 UTC

[GitHub] [parquet-mr] HunterL opened a new pull request #815: PARQUET-1776: NIO wrapper for Output/Input File

HunterL opened a new pull request #815:
URL: https://github.com/apache/parquet-mr/pull/815


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes how to use it.
     - All the public functions and the classes in the PR contain Javadoc that explain what it does
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] belugabehr commented on pull request #815: PARQUET-1776: NIO wrapper for Output/Input File

Posted by GitBox <gi...@apache.org>.
belugabehr commented on pull request #815:
URL: https://github.com/apache/parquet-mr/pull/815#issuecomment-696929711


   Here are a couple of starters:
   
   ```
          private final ByteBuffer oneByteBuffer = ByteBuffer.allocate(1);
   
   	@Override
   	public int read() throws IOException {
   		int read = this.channel.read(this.oneByteBuffer.rewind());
   		if (read < 0) {
   			return read;
   		}
   		return (0xFF & this.oneByteBuffer.flip().get());
   	}
   
   	@Override
   	public void readFully(byte[] bytes) throws IOException {
   		ByteBuffer bb = ByteBuffer.allocate(bytes.length);
   					while (bb.hasRemaining()) {
   						int read = this.channel.read(bb);
   						if (read == -1) {
   							throw new EOFException();
   						}
   					}
   					bb.flip().get(bytes);
   	}
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] belugabehr edited a comment on pull request #815: PARQUET-1776: NIO wrapper for Output/Input File

Posted by GitBox <gi...@apache.org>.
belugabehr edited a comment on pull request #815:
URL: https://github.com/apache/parquet-mr/pull/815#issuecomment-696754782


   @HunterL 
   
   OK, so I've been recently diving into Java NIO a bit and let me add some thoughts that relate to this task:
   
   Please change the naming convention and drop the "Local" prefix.  I think Nio is probably best.  As an example of why this matters:
   
   ```
   // This exists in the current PR
   return new LocalSeekableInputStream(Files.newInputStream(path));
   ```
   
   Well that statement `Files.newInputStream(path)` will provide a new InputStream, but the InputStream is totally driven by whatever the path points at.  Things within the JVM could be setup to allow the provided path to point at an FTP location for example and that would be a _remote_ read and not a _local_ read.
   
   
   I also think you will have better luck storing, internal to the Input/Output streams a `SeekableByteChannel` instead of an `OutputStream` or an `InputStream`.  This will allow you to easily track the position aspect of the underlying stream without having to do it yourself.  For example, the current `OutputStream` implementation does not currently implement this feature:
   
   ```
     @Override
     public long getPos() throws IOException {
       return 0;
     }
   ```
   
   Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] fommil commented on a change in pull request #815: PARQUET-1776: NIO wrapper for Output/Input File

Posted by GitBox <gi...@apache.org>.
fommil commented on a change in pull request #815:
URL: https://github.com/apache/parquet-mr/pull/815#discussion_r543682706



##########
File path: parquet-common/src/main/java/org/apache/parquet/io/LocalSeekableInputStream.java
##########
@@ -0,0 +1,47 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an
+ *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ *  KIND, either express or implied.  See the License for the
+ *  specific language governing permissions and limitations
+ *  under the License.
+ */
+
+package org.apache.parquet.io;
+
+import java.io.IOException;
+import java.io.InputStream;
+
+public class LocalSeekableInputStream extends DelegatingSeekableInputStream {
+  private long pos;
+
+  public LocalSeekableInputStream(InputStream stream) {
+    super(stream);
+    this.pos = 0;
+  }
+
+  @Override
+  public long getPos() throws IOException {
+    return pos;
+  }
+
+  @Override
+  public void seek(long newPos) throws IOException {

Review comment:
       I don't see how this can work, the `pos` needs to be updated anytime the stream is progressed. From my reading of the DelegatingSeekableInputStream it is designed to work only for InputStreams that already track their position and can therefore be consulted. e.g. in
   
   ```scala
   final class LocalInputFile(file: File) extends InputFile {
     def getLength() = file.length()
     def newStream(): SeekableInputStream = {
       val input = new FileInputStream(file)
       new DelegatingSeekableInputStream(input) {
         def getPos(): Long = input.getChannel().position()
         def seek(bs: Long): Unit = {
           val _ = input.getChannel().position(bs)
         }
       }
     }
   }
   ```
   
   from https://issues.apache.org/jira/browse/PARQUET-1953 which is relying on the underlying `FileInputStream` to keep track of its position in the stream.
   
   Also note that `seek` is not the same as `skip`. The former is an absolute position from the beginning of the stream, whereas `skip` is about a relative movement from the current position.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] belugabehr commented on pull request #815: PARQUET-1776: NIO wrapper for Output/Input File

Posted by GitBox <gi...@apache.org>.
belugabehr commented on pull request #815:
URL: https://github.com/apache/parquet-mr/pull/815#issuecomment-696755550


   Also requires unit tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] fommil commented on a change in pull request #815: PARQUET-1776: NIO wrapper for Output/Input File

Posted by GitBox <gi...@apache.org>.
fommil commented on a change in pull request #815:
URL: https://github.com/apache/parquet-mr/pull/815#discussion_r543682706



##########
File path: parquet-common/src/main/java/org/apache/parquet/io/LocalSeekableInputStream.java
##########
@@ -0,0 +1,47 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an
+ *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ *  KIND, either express or implied.  See the License for the
+ *  specific language governing permissions and limitations
+ *  under the License.
+ */
+
+package org.apache.parquet.io;
+
+import java.io.IOException;
+import java.io.InputStream;
+
+public class LocalSeekableInputStream extends DelegatingSeekableInputStream {
+  private long pos;
+
+  public LocalSeekableInputStream(InputStream stream) {
+    super(stream);
+    this.pos = 0;
+  }
+
+  @Override
+  public long getPos() throws IOException {
+    return pos;
+  }
+
+  @Override
+  public void seek(long newPos) throws IOException {

Review comment:
       I don't see how this can work, the `pos` needs to be updated anytime the stream is progressed. From my reading of the DelegatingSeekableInputStream it is designed to work only for InputStreams that already track their position and can therefore be consulted. e.g. in
   
   ```scala
   final class LocalInputFile(file: File) extends InputFile {
     def getLength() = file.length()
     def newStream(): SeekableInputStream = {
       val input = new FileInputStream(file)
       new DelegatingSeekableInputStream(input) {
         def getPos(): Long = input.getChannel().position()
         def seek(bs: Long): Unit = {
           val _ = input.getChannel().position(bs)
         }
       }
     }
   }
   ```
   
   from https://issues.apache.org/jira/browse/PARQUET-1953 which is relying on the underlying `FileInputStream` to keep track of its position in the stream.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] belugabehr commented on pull request #815: PARQUET-1776: NIO wrapper for Output/Input File

Posted by GitBox <gi...@apache.org>.
belugabehr commented on pull request #815:
URL: https://github.com/apache/parquet-mr/pull/815#issuecomment-696754782






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] belugabehr commented on pull request #815: PARQUET-1776: NIO wrapper for Output/Input File

Posted by GitBox <gi...@apache.org>.
belugabehr commented on pull request #815:
URL: https://github.com/apache/parquet-mr/pull/815#issuecomment-696754782


   @HunterL 
   
   OK, so I've been recently diving into Java NIO a bit and let me add some thoughts that relate to this task:
   
   Please change the naming convention and drop the "Local" prefix.  I think Nio is probably best.  As an example of why this matters:
   
   ```
   // This exists in the current PR
   return new LocalSeekableInputStream(Files.newInputStream(path));
   ```
   
   Well that statement `Files.newInputStream(path)` will provide a new InputStream, but the InputStream is totally driven by whatever the path points at.  Things within the JVM could be setup to all the provided path to point at an FTP location for example and that would be a _remote_ read and not a _local_ read.
   
   
   I also think you will have better luck storing, internal to the Input/Output streams a `SeekableByteChannel` instead of an `OutputStream` or an `InputStream`.  This will allow you to easily track the position aspect of the underlying stream without having to do it yourself.  For example, the current `OutputStream` implementation does not currently implement this feature:
   
   ```
     @Override
     public long getPos() throws IOException {
       return 0;
     }
   ```
   
   Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] belugabehr edited a comment on pull request #815: PARQUET-1776: NIO wrapper for Output/Input File

Posted by GitBox <gi...@apache.org>.
belugabehr edited a comment on pull request #815:
URL: https://github.com/apache/parquet-mr/pull/815#issuecomment-696754782


   @HunterL 
   
   OK, so I've been recently diving into Java NIO a bit and let me add some thoughts that relate to this task:
   
   Please change the naming convention and drop the "Local" prefix.  I think Nio is probably best.  As an example of why this matters:
   
   ```
   // This exists in the current PR
   return new LocalSeekableInputStream(Files.newInputStream(path));
   ```
   
   Well that statement `Files.newInputStream(path)` will provide a new InputStream, but the InputStream is totally driven by whatever the path points at.  Things within the JVM could be setup to allow the provided path to point at an FTP location for example and that would be a _remote_ read and not a _local_ read.
   
   
   I also think you will have better luck storing, internal to the Input/Output streams a `SeekableByteChannel` instead of an `OutputStream` or an `InputStream`.  This will allow you to easily track the position aspect of the underlying stream without having to do it yourself.  For example, the current `OutputStream` implementation does not currently implement this feature:
   
   ```
     @Override
     public long getPos() throws IOException {
       return 0;
     }
   ```
   
   Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org