You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/09/13 18:32:34 UTC

[GitHub] [beam] steveniemitz opened a new pull request, #23214: use avro DataFileReader to read avro container files

steveniemitz opened a new pull request, #23214:
URL: https://github.com/apache/beam/pull/23214

   This fixes #23213
   
   This refactors AvroSource to use the standard DataFileReader from Avro.  I don't really have the context around why a custom file parser was written in beam for this, but the code is so old I assume that DataFileReader couldn't properly support reading from arbitrary positions in a container file when it was written.
   
   R: @iemejia  (you were the last person other than myself to substantially change this class)
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [x] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [x] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/get-started-contributing/#make-the-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] steveniemitz commented on pull request #23214: use avro DataFileReader to read avro container files

Posted by GitBox <gi...@apache.org>.
steveniemitz commented on PR #23214:
URL: https://github.com/apache/beam/pull/23214#issuecomment-1250976390

   ping @iemejia 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #23214: use avro DataFileReader to read avro container files

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #23214:
URL: https://github.com/apache/beam/pull/23214#issuecomment-1245942411

   Assigning reviewers. If you would like to opt out of this review, comment `assign to next reviewer`:
   
   R: @lukecwik for label java.
   
   Available commands:
   - `stop reviewer notifications` - opt out of the automated review tooling
   - `remind me after tests pass` - tag the comment author after tests pass
   - `waiting on author` - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
   
   The PR bot will only process comments in the main thread (not review comments).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] iemejia merged pull request #23214: use avro DataFileReader to read avro container files

Posted by GitBox <gi...@apache.org>.
iemejia merged PR #23214:
URL: https://github.com/apache/beam/pull/23214


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] steveniemitz commented on pull request #23214: use avro DataFileReader to read avro container files

Posted by GitBox <gi...@apache.org>.
steveniemitz commented on PR #23214:
URL: https://github.com/apache/beam/pull/23214#issuecomment-1245905839

   Run Kotlin_Examples PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] iemejia commented on a diff in pull request #23214: use avro DataFileReader to read avro container files

Posted by GitBox <gi...@apache.org>.
iemejia commented on code in PR #23214:
URL: https://github.com/apache/beam/pull/23214#discussion_r977872542


##########
sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java:
##########
@@ -561,77 +553,24 @@ private Object readResolve() throws ObjectStreamException {
    */
   @Experimental(Kind.SOURCE_SINK)
   static class AvroBlock<T> extends Block<T> {
-    private final Mode<T> mode;
-
-    // The number of records in the block.
-    private final long numRecords;
 
     // The current record in the block. Initialized in readNextRecord.
     private @Nullable T currentRecord;
 
     // The index of the current record in the block.
     private long currentRecordIndex = 0;
 
-    // A DatumReader to read records from the block.
-    private final DatumReader<?> reader;
+    private final Iterator<?> iterator;
 
-    // A BinaryDecoder used by the reader to decode records.
-    private final BinaryDecoder decoder;
+    private final SerializableFunction<GenericRecord, T> parseFn;
 
-    /**
-     * Decodes a byte array as an InputStream. The byte array may be compressed using some codec.
-     * Reads from the returned stream will result in decompressed bytes.
-     *
-     * <p>This supports the same codecs as Avro's {@link CodecFactory}, namely those defined in
-     * {@link DataFileConstants}.
-     *
-     * <ul>
-     *   <li>"snappy" : Google's Snappy compression
-     *   <li>"deflate" : deflate compression
-     *   <li>"bzip2" : Bzip2 compression
-     *   <li>"xz" : xz compression
-     *   <li>"null" (the string, not the value): Uncompressed data
-     * </ul>
-     */
-    private static InputStream decodeAsInputStream(byte[] data, String codec) throws IOException {
-      ByteArrayInputStream byteStream = new ByteArrayInputStream(data);
-      switch (codec) {
-        case DataFileConstants.SNAPPY_CODEC:
-          return new SnappyCompressorInputStream(byteStream, 1 << 16 /* Avro uses 64KB blocks */);
-        case DataFileConstants.DEFLATE_CODEC:
-          // nowrap == true: Do not expect ZLIB header or checksum, as Avro does not write them.
-          Inflater inflater = new Inflater(true);
-          return new InflaterInputStream(byteStream, inflater);
-        case DataFileConstants.XZ_CODEC:
-          return new XZCompressorInputStream(byteStream);
-        case DataFileConstants.BZIP2_CODEC:
-          return new BZip2CompressorInputStream(byteStream);
-        case DataFileConstants.NULL_CODEC:
-          return byteStream;
-        default:
-          throw new IllegalArgumentException("Unsupported codec: " + codec);
-      }
-    }
+    private final long numRecordsInBlock;

Review Comment:
   Much better name 👍 



##########
sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java:
##########
@@ -717,71 +678,25 @@ public synchronized AvroSource<T> getCurrentSource() {
     //
     // Postcondition: same as above, but for the new current (formerly next) block.
     @Override
-    public boolean readNextBlock() throws IOException {
-      long startOfNextBlock;
-      synchronized (progressLock) {

Review Comment:
   👍 



##########
sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java:
##########
@@ -669,13 +608,49 @@ public double getFractionOfBlockConsumed() {
    */
   @Experimental(Kind.SOURCE_SINK)
   public static class AvroReader<T> extends BlockBasedReader<T> {
-    // Initialized in startReading.
-    private @Nullable AvroMetadata metadata;
+
+    private static class SeekableChannelInput implements SeekableInput {

Review Comment:
   A pity this one does not exist in Avro



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] steveniemitz commented on pull request #23214: use avro DataFileReader to read avro container files

Posted by GitBox <gi...@apache.org>.
steveniemitz commented on PR #23214:
URL: https://github.com/apache/beam/pull/23214#issuecomment-1245905672

   Run Java_PVR_Flink_Docker PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #23214: use avro DataFileReader to read avro container files

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #23214:
URL: https://github.com/apache/beam/pull/23214#issuecomment-1245930993

   Assigning reviewers. If you would like to opt out of this review, comment `assign to next reviewer`:
   
   R: @lukecwik for label java.
   
   Available commands:
   - `stop reviewer notifications` - opt out of the automated review tooling
   - `remind me after tests pass` - tag the comment author after tests pass
   - `waiting on author` - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
   
   The PR bot will only process comments in the main thread (not review comments).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #23214: use avro DataFileReader to read avro container files

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #23214:
URL: https://github.com/apache/beam/pull/23214#issuecomment-1245869953

   Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment `assign set of reviewers`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org