You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/12/09 19:15:19 UTC

[GitHub] [beam] Romster opened a new pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Romster opened a new pull request #13513:
URL: https://github.com/apache/beam/pull/13513


   Fixing java.nio.BufferOverflowException in getFirstOccurenceOfRecordElement method of XMLReader
   
   ------------------------
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Dataflow | Flink | Samza | Spark | Twister2
   --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) | ---
   Java | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://ci-beam
 .apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://ci-beam.a
 pache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/)
   Python | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam
 .apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/) | ---
   XLang | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Dataflow/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/) | ---
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website | Whitespace | Typescript
   --- | --- | --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/lastCompletedBuild/) <br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocs_Cron/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocs_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/be
 am_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Whitespace_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Whitespace_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Typescript_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Typescript_Cron/lastCompletedBuild/)
   Portable | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | --- | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Romster commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
Romster commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r544121675



##########
File path: sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlSourceTest.java
##########
@@ -873,6 +881,46 @@ public void testSplitAtFractionExhaustiveSingleByte() throws Exception {
     assertSplitAtFractionExhaustive(source, options);
   }
 
+  @Test
+  public void testNoBufferOverflowThrown() throws IOException {
+    // The magicNumber was found imperatively and will be different for different xml content.
+    // Test with the current setup causes BufferOverflow in
+    // XMLReader#getFirstOccurenceOfRecordElement method,
+    // if the specific corner case is not handled
+    final int magicNumber = 183;
+    StringBuilder sb = new StringBuilder();

Review comment:
       Here is a piece of XML we deal with in our service:
   from the root `ONIXMessage` we read records `Product`
   ```
   <?xml version='1.0' encoding='ISO-8859-1'?>
   <ONIXMessage xmlns="http://ns.editeur.org/onix/3.0/reference" release="3.0">
     <Header>
       <Sender>
         <SenderIdentifier>
           <SenderIDType></SenderIDType>
           <IDValue></IDValue>
         </SenderIdentifier>
         <SenderName></SenderName>
         <EmailAddress></EmailAddress>
       </Sender>
       <Addressee>
         <AddresseeName></AddresseeName>
       </Addressee>
       <MessageNumber></MessageNumber>
       <SentDateTime></SentDateTime>
     </Header>
     <Product>
       <RecordReference></RecordReference>
       <NotificationType></NotificationType>
       <RecordSourceType></RecordSourceType>
       <ProductIdentifier>
         <ProductIDType></ProductIDType>
         <IDValue></IDValue>
       </ProductIdentifier>
       <ProductIdentifier>
         <ProductIDType></ProductIDType>
         <IDValue></IDValue>
       </ProductIdentifier>
   ```
   as you can see there are inner tags like `ProductIdentifier` and `ProductIDType`.
   But we also have structore like
   ```
   <root>
   <head>
   </head>
   <record>
   </record>
   ...
   </root>
   ```
   and _in most cases_, it's not a problem
   We run it in Google Dataflow, and when it fails with BufferOverflow restarting the job helps - so the issue can't be easily reproduced.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] aaltay commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
aaltay commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r543818581



##########
File path: sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java
##########
@@ -281,7 +283,15 @@ private long getFirstOccurenceOfRecordElement(
               break outer;
             } else {
               // Matching was unsuccessful. Reset the buffer to include bytes read for the char.
-              ByteBuffer newbuf = ByteBuffer.allocate(BUF_SIZE);
+              int bytesToWrite = buf.remaining() + charBytes.length;

Review comment:
       Why is `bytesToWrite` not equal to `charBytes.length`?
   
   I believe we are trying to check that:
   
   `if (charBytes.length <= buf.remaining()` instead of `if (bytesToWrite > BUF_SIZE)` ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Romster commented on pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
Romster commented on pull request #13513:
URL: https://github.com/apache/beam/pull/13513#issuecomment-742004956


   R:@jbonofre
   R:@lukecwik
   R:@chamikaramj
   R:@timrobertson100


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chamikaramj commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r543921078



##########
File path: sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlSourceTest.java
##########
@@ -873,6 +881,46 @@ public void testSplitAtFractionExhaustiveSingleByte() throws Exception {
     assertSplitAtFractionExhaustive(source, options);
   }
 
+  @Test
+  public void testNoBufferOverflowThrown() throws IOException {
+    // The magicNumber was found imperatively and will be different for different xml content.
+    // Test with the current setup causes BufferOverflow in
+    // XMLReader#getFirstOccurenceOfRecordElement method,
+    // if the specific corner case is not handled
+    final int magicNumber = 183;
+    StringBuilder sb = new StringBuilder();

Review comment:
       I think your input here does not conform to the format required by the Xml source that is defined here: https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlIO.java#L68
   
   Specifically it has to be of the following format.
   ```
   <root>
   <record> ... </record>
   <record> ... </record>
   <record> ... </record>
   ...
   <record> ... </record>
   </root>
   ```
   But you have additional element: `<trainTags><trainTag></trainTag> ... <trainTag></trainTag></trainTags>`
   
   Were you able to reproduce the issue when the input conforms to the required format ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chamikaramj commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r544791259



##########
File path: sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java
##########
@@ -281,7 +283,15 @@ private long getFirstOccurenceOfRecordElement(
               break outer;
             } else {
               // Matching was unsuccessful. Reset the buffer to include bytes read for the char.
-              ByteBuffer newbuf = ByteBuffer.allocate(BUF_SIZE);
+              int bytesToWrite = buf.remaining() + charBytes.length;
+              ByteBuffer newbuf;
+              if (bytesToWrite > BUF_SIZE) {
+                // Avoiding buffer overflow

Review comment:
       Probably expand comment to following.
   
   "The number of bytes to push to the buffer might be larger than BUF_SIZE due to additional 'charBytes'."

##########
File path: sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java
##########
@@ -320,7 +330,12 @@ private long getFirstOccurenceOfRecordElement(
             recordStartBytesMatched = true;
           }
         }
-        buf.clear();
+        if (bufSizeChanged) {
+          buf = ByteBuffer.allocate(BUF_SIZE);

Review comment:
       Add a comment.
   "We have to reset the size of the buffer to 'BUF_SIZE' to prevent it from infinitely increasing."




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Romster commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
Romster commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r544087349



##########
File path: sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlSourceTest.java
##########
@@ -873,6 +881,46 @@ public void testSplitAtFractionExhaustiveSingleByte() throws Exception {
     assertSplitAtFractionExhaustive(source, options);
   }
 
+  @Test
+  public void testNoBufferOverflowThrown() throws IOException {
+    // The magicNumber was found imperatively and will be different for different xml content.
+    // Test with the current setup causes BufferOverflow in
+    // XMLReader#getFirstOccurenceOfRecordElement method,
+    // if the specific corner case is not handled
+    final int magicNumber = 183;
+    StringBuilder sb = new StringBuilder();

Review comment:
       This is an artificial example, but the main condition is that you have some amout of `<recordBlahBlah>` tags
   The real case looks like
   ```
   <root>
    <record> 
     <recordSomething>
     </recordSomething>
    </record>
    <record> 
     <recordSomething>
     </recordSomething>
    </record>
   ...
    <record> 
     <recordSomething>
     </recordSomething>
    </record>
   </root>
   ```
   The behaviour seems to be environment-dependent, so I'm not sure if even my example will be reproduced in another environment (it depends also on how many bytes we are reading  from the channel)

##########
File path: sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlSourceTest.java
##########
@@ -873,6 +881,46 @@ public void testSplitAtFractionExhaustiveSingleByte() throws Exception {
     assertSplitAtFractionExhaustive(source, options);
   }
 
+  @Test
+  public void testNoBufferOverflowThrown() throws IOException {
+    // The magicNumber was found imperatively and will be different for different xml content.
+    // Test with the current setup causes BufferOverflow in
+    // XMLReader#getFirstOccurenceOfRecordElement method,
+    // if the specific corner case is not handled
+    final int magicNumber = 183;
+    StringBuilder sb = new StringBuilder();

Review comment:
       This is an artificial example, but the main condition is that you have some amount of `<recordBlahBlah>` tags
   The real case looks like
   ```
   <root>
    <record> 
     <recordSomething>
     </recordSomething>
    </record>
    <record> 
     <recordSomething>
     </recordSomething>
    </record>
   ...
    <record> 
     <recordSomething>
     </recordSomething>
    </record>
   </root>
   ```
   The behaviour seems to be environment-dependent, so I'm not sure if even my example will be reproduced in another environment (it depends also on how many bytes we are reading  from the channel)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Romster edited a comment on pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
Romster edited a comment on pull request #13513:
URL: https://github.com/apache/beam/pull/13513#issuecomment-742004956


   R: @jbonofre
   R: @lukecwik
   R: @chamikaramj
   R: @timrobertson100


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Romster commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
Romster commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r544083679



##########
File path: sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java
##########
@@ -281,7 +283,15 @@ private long getFirstOccurenceOfRecordElement(
               break outer;
             } else {
               // Matching was unsuccessful. Reset the buffer to include bytes read for the char.
-              ByteBuffer newbuf = ByteBuffer.allocate(BUF_SIZE);
+              int bytesToWrite = buf.remaining() + charBytes.length;

Review comment:
       charBytes is an array of 4 bytes, we've read from the buffer.
   this code caused the buffer overflow because that charBytes can be the input from the previous reading from the channel.
   ```
                 newbuf.put(charBytes);
                 offsetInFileOfCurrentByte -= charBytes.length;
                 while (buf.hasRemaining()) {
                   newbuf.put(buf.get());
                 }
   ```
   so when we read `BUF_SIZE` (or BUF_SIZE-n where n <charBytes.length) bytes from the channel to `buf` and go here, we have more byte to write to the newFuf than it's capacity. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chamikaramj merged pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
chamikaramj merged pull request #13513:
URL: https://github.com/apache/beam/pull/13513


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Romster commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
Romster commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r544307401



##########
File path: sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlSourceTest.java
##########
@@ -873,6 +881,46 @@ public void testSplitAtFractionExhaustiveSingleByte() throws Exception {
     assertSplitAtFractionExhaustive(source, options);
   }
 
+  @Test
+  public void testNoBufferOverflowThrown() throws IOException {
+    // The magicNumber was found imperatively and will be different for different xml content.
+    // Test with the current setup causes BufferOverflow in
+    // XMLReader#getFirstOccurenceOfRecordElement method,
+    // if the specific corner case is not handled
+    final int magicNumber = 183;
+    StringBuilder sb = new StringBuilder();

Review comment:
       I've updated the test - now it uses input like
   ```
   <trains>
       <train>
           <trainTags>
               <trainTag>0</trainTag>
               <trainTag>1</trainTag>
               <trainTag>2</trainTag>
               ...
    ```
    So the format issue is not the case.
    
   To reproduce the error I had to use TestPipeline - so the input was split into bundles
   > INFO: Splitting filepattern /var/folders/j5/2qx0r7453tvd56zpjbstv6fw4m_zm3/T/junit946409719265813413/trainXMLWithTags into bundles of size 126 took 1 ms and produced 1 files and 20 bundles
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] aaltay commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
aaltay commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r544531965



##########
File path: sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java
##########
@@ -281,7 +283,15 @@ private long getFirstOccurenceOfRecordElement(
               break outer;
             } else {
               // Matching was unsuccessful. Reset the buffer to include bytes read for the char.
-              ByteBuffer newbuf = ByteBuffer.allocate(BUF_SIZE);
+              int bytesToWrite = buf.remaining() + charBytes.length;

Review comment:
       Got it, makes sense. I will resolve this. We can merge it once the other comment is also resolved.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Romster commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
Romster commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r544083679



##########
File path: sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java
##########
@@ -281,7 +283,15 @@ private long getFirstOccurenceOfRecordElement(
               break outer;
             } else {
               // Matching was unsuccessful. Reset the buffer to include bytes read for the char.
-              ByteBuffer newbuf = ByteBuffer.allocate(BUF_SIZE);
+              int bytesToWrite = buf.remaining() + charBytes.length;

Review comment:
       charBytes is an array of 4 bytes, we've read from the buffer.
   this code caused the buffer overflow because that charBytes can be the input from the previous reading from the channel.
   ```
                 newbuf.put(charBytes);
                 offsetInFileOfCurrentByte -= charBytes.length;
                 while (buf.hasRemaining()) {
                   newbuf.put(buf.get());
                 }
   ```
   so when we do the next reading of `BUF_SIZE` (or BUF_SIZE-n where n <charBytes.length) bytes from the channel to `buf` and go here, we have more byte to write to the newBuf than it's capacity. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chamikaramj commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r543921078



##########
File path: sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlSourceTest.java
##########
@@ -873,6 +881,46 @@ public void testSplitAtFractionExhaustiveSingleByte() throws Exception {
     assertSplitAtFractionExhaustive(source, options);
   }
 
+  @Test
+  public void testNoBufferOverflowThrown() throws IOException {
+    // The magicNumber was found imperatively and will be different for different xml content.
+    // Test with the current setup causes BufferOverflow in
+    // XMLReader#getFirstOccurenceOfRecordElement method,
+    // if the specific corner case is not handled
+    final int magicNumber = 183;
+    StringBuilder sb = new StringBuilder();

Review comment:
       I think your input here does not conform to the format required by the Xml source that is defined here: https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlIO.java#L68
   
   Specifically it has to be of the following format.
      * <root>
      * <record> ... </record>
      * <record> ... </record>
      * <record> ... </record>
      * ...
      * <record> ... </record>
      * </root>
   But you have additional element: <trainTags><trainTag></trainTag> ... <trainTag></trainTag></trainTags>
   
   Were you able to reproduce the issue when the input conforms to the required format ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Romster commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
Romster commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r544315668



##########
File path: sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java
##########
@@ -281,7 +283,15 @@ private long getFirstOccurenceOfRecordElement(
               break outer;
             } else {
               // Matching was unsuccessful. Reset the buffer to include bytes read for the char.
-              ByteBuffer newbuf = ByteBuffer.allocate(BUF_SIZE);
+              int bytesToWrite = buf.remaining() + charBytes.length;

Review comment:
       It seems to be a pretty rare corner case, so people usually don't face it.
   But our service deals with lots of big XML files with different formats and we see it several times a week.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Romster commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
Romster commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r544083679



##########
File path: sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java
##########
@@ -281,7 +283,15 @@ private long getFirstOccurenceOfRecordElement(
               break outer;
             } else {
               // Matching was unsuccessful. Reset the buffer to include bytes read for the char.
-              ByteBuffer newbuf = ByteBuffer.allocate(BUF_SIZE);
+              int bytesToWrite = buf.remaining() + charBytes.length;

Review comment:
       charBytes is an array of 4 bytes, we've read from the buffer.
   this code caused the buffer overflow because that charBytes can be the input from the previous reading from the channel.
   ```
                 newbuf.put(charBytes);
                 offsetInFileOfCurrentByte -= charBytes.length;
                 while (buf.hasRemaining()) {
                   newbuf.put(buf.get());
                 }
   ```
   so when we read `BUF_SIZE` (or BUF_SIZE-n where n <charBytes.length) bytes from the channel to `buf` and go here, we have more byte to write to the newBuf than it's capacity. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Romster commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
Romster commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r544121675



##########
File path: sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlSourceTest.java
##########
@@ -873,6 +881,46 @@ public void testSplitAtFractionExhaustiveSingleByte() throws Exception {
     assertSplitAtFractionExhaustive(source, options);
   }
 
+  @Test
+  public void testNoBufferOverflowThrown() throws IOException {
+    // The magicNumber was found imperatively and will be different for different xml content.
+    // Test with the current setup causes BufferOverflow in
+    // XMLReader#getFirstOccurenceOfRecordElement method,
+    // if the specific corner case is not handled
+    final int magicNumber = 183;
+    StringBuilder sb = new StringBuilder();

Review comment:
       Here is a piece of XML we deal with in our service:
   from the root `ONIXMessage` we read records `Product`
   ```
   <?xml version='1.0' encoding='ISO-8859-1'?>
   <ONIXMessage xmlns="http://ns.editeur.org/onix/3.0/reference" release="3.0">
     <Header>
       <Sender>
         <SenderIdentifier>
           <SenderIDType></SenderIDType>
           <IDValue></IDValue>
         </SenderIdentifier>
         <SenderName></SenderName>
         <EmailAddress></EmailAddress>
       </Sender>
       <Addressee>
         <AddresseeName></AddresseeName>
       </Addressee>
       <MessageNumber></MessageNumber>
       <SentDateTime></SentDateTime>
     </Header>
     <Product>
       <RecordReference></RecordReference>
       <NotificationType></NotificationType>
       <RecordSourceType></RecordSourceType>
       <ProductIdentifier>
         <ProductIDType></ProductIDType>
         <IDValue></IDValue>
       </ProductIdentifier>
       <ProductIdentifier>
         <ProductIDType></ProductIDType>
         <IDValue></IDValue>
       </ProductIdentifier>
   ```
   as you can see there are inner tags like `ProductIdentifier` and `ProductIDType`.
   But we also have a structure like
   ```
   <root>
   <head>
   </head>
   <record>
   </record>
   ...
   </root>
   ```
   and _in most cases_, it's not a problem
   We run it in Google Dataflow, and when it fails with BufferOverflow restarting the job helps - so the issue can't be easily reproduced.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Romster commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
Romster commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r544087349



##########
File path: sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlSourceTest.java
##########
@@ -873,6 +881,46 @@ public void testSplitAtFractionExhaustiveSingleByte() throws Exception {
     assertSplitAtFractionExhaustive(source, options);
   }
 
+  @Test
+  public void testNoBufferOverflowThrown() throws IOException {
+    // The magicNumber was found imperatively and will be different for different xml content.
+    // Test with the current setup causes BufferOverflow in
+    // XMLReader#getFirstOccurenceOfRecordElement method,
+    // if the specific corner case is not handled
+    final int magicNumber = 183;
+    StringBuilder sb = new StringBuilder();

Review comment:
       This is an artificial example, but the main condition is that you have inner `<recordBlahBlah>` tags inside the `record`
   The real case looks like
   ```
   <root>
    <record> 
     <recordSomething>
     </recordSomething>
    </record>
    <record> 
     <recordSomething>
     </recordSomething>
    </record>
   ...
    <record> 
     <recordSomething>
     </recordSomething>
    </record>
   </root>
   ```
   The behaviour seems to be environment-dependent, so I'm not sure if even my example will be reproduced in another environment (it depends also on how many bytes we are reading  from the channel)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Romster commented on a change in pull request #13513: [Beam-11002] Fixes BufferOverflowException in XMLReader

Posted by GitBox <gi...@apache.org>.
Romster commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r544887898



##########
File path: sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java
##########
@@ -281,7 +283,15 @@ private long getFirstOccurenceOfRecordElement(
               break outer;
             } else {
               // Matching was unsuccessful. Reset the buffer to include bytes read for the char.
-              ByteBuffer newbuf = ByteBuffer.allocate(BUF_SIZE);
+              int bytesToWrite = buf.remaining() + charBytes.length;
+              ByteBuffer newbuf;
+              if (bytesToWrite > BUF_SIZE) {
+                // Avoiding buffer overflow

Review comment:
       done

##########
File path: sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java
##########
@@ -320,7 +330,12 @@ private long getFirstOccurenceOfRecordElement(
             recordStartBytesMatched = true;
           }
         }
-        buf.clear();
+        if (bufSizeChanged) {
+          buf = ByteBuffer.allocate(BUF_SIZE);

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org