You are viewing a plain text version of this content. The canonical link for it is here.

Posted to codereview@trafodion.apache.org by selvaganesang <gi...@git.apache.org> on 2018/08/01 20:15:02 UTC

[GitHub] trafodion pull request #1674: [TRAFODION-3171] Refactor Hive sequence file r...

GitHub user selvaganesang opened a pull request:

    https://github.com/apache/trafodion/pull/1674

    [TRAFODION-3171] Refactor Hive sequence file reading to use the new i…

    …mplementation

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/selvaganesang/trafodion sequence_files

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/trafodion/pull/1674.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1674
    
----
commit 6165e6b1017521abed062c55931d4555b886387d
Author: selvaganesang <se...@...>
Date:   2018-08-01T20:08:46Z

    [TRAFODION-3171] Refactor Hive sequence file reading to use the new implementation

----


---

[GitHub] trafodion pull request #1674: [TRAFODION-3171] Refactor Hive sequence file r...

Posted by sureshsubbiah <gi...@git.apache.org>.

Github user sureshsubbiah commented on a diff in the pull request:

    https://github.com/apache/trafodion/pull/1674#discussion_r207440918
  
    --- Diff: core/sql/src/main/java/org/trafodion/sql/HDFSClient.java ---
    @@ -240,7 +256,48 @@ int compressedFileRead(int readLenRemain) throws IOException
              return totalReadLen; 
         } 
     
    -    native int copyToByteBuffer(ByteBuffer buf, int bufOffset, byte[] bufArray, int copyLen);
    +    /* Trafodion adds record delimiter '\n' while copying it
    +       to buffer backing up the ByteBuffer */
    +
    +    int sequenceFileRead(int readLenRemain) throws IOException 
    +    {
    +       boolean eof = false;
    +       byte[] byteArray;
    +       int readLen;
    +       int totalReadLen = 0;
    +       long tempPos;
    +       int lenRemain = readLenRemain;
    +
    +       while (!eof && lenRemain > 0) {
    +          tempPos = reader_.getPosition();
    +          try {
    +            eof = reader_.next(key_, value_);
    +          }
    +          catch (java.io.EOFException e)
    +          {
    +              eof = true;
    +              break;
    +          }
    +          byteArray = ((Text)value_).getBytes();
    +          readLen = ((Text)value_).getLength();
    +          if (readLen <= lenRemain) {
    +                            
    +              buf_.put(byteArray, 0, readLen);
    +              buf_.put(recDelimiter_);
    --- End diff --
    
    Should we be worried that the 1 byte delimiter will put us past us past end of buffer as suggest in the comment in line 290? (when readLen == lenRemain)


---

[GitHub] trafodion pull request #1674: [TRAFODION-3171] Refactor Hive sequence file r...

Posted by selvaganesang <gi...@git.apache.org>.

Github user selvaganesang commented on a diff in the pull request:

    https://github.com/apache/trafodion/pull/1674#discussion_r207459712
  
    --- Diff: core/sql/executor/ExHdfsScan.cpp ---
    @@ -571,6 +569,7 @@ ExWorkProcRetcode ExHdfsScanTcb::work()
                  hdfsScan_ = HdfsScan::newInstance((NAHeap *)getHeap(), hdfsScanBuf_, hdfsScanBufMaxSize_, 
                                 hdfsScanTdb().hdfsIoByteArraySizeInKB_, 
                                 &hdfsFileInfoListAsArray_, beginRangeNum_, numRanges_, hdfsScanTdb().rangeTailIOSize_, 
    +                            isSequenceFile(), hdfsScanTdb().recordDelimiter_, 
    --- End diff --
    
    There is no need pass the record delimiter for text format because it is a raw read of text formatted table. It should have the record delimiter as per the hive metadata.


---

[GitHub] trafodion pull request #1674: [TRAFODION-3171] Refactor Hive sequence file r...

Posted by sureshsubbiah <gi...@git.apache.org>.

Github user sureshsubbiah commented on a diff in the pull request:

    https://github.com/apache/trafodion/pull/1674#discussion_r207442360
  
    --- Diff: core/sql/executor/ExHdfsScan.cpp ---
    @@ -571,6 +569,7 @@ ExWorkProcRetcode ExHdfsScanTcb::work()
                  hdfsScan_ = HdfsScan::newInstance((NAHeap *)getHeap(), hdfsScanBuf_, hdfsScanBufMaxSize_, 
                                 hdfsScanTdb().hdfsIoByteArraySizeInKB_, 
                                 &hdfsFileInfoListAsArray_, beginRangeNum_, numRanges_, hdfsScanTdb().rangeTailIOSize_, 
    +                            isSequenceFile(), hdfsScanTdb().recordDelimiter_, 
    --- End diff --
    
    If we did not pass in the recordDelimiter_ previously, do we know if previous code worked correctly for text format, when the record delimiter was something other than /n ? This PR does not seem change anything for text format reads, so if there was issue previously, it might still exist.


---

[GitHub] trafodion pull request #1674: [TRAFODION-3171] Refactor Hive sequence file r...

Posted by selvaganesang <gi...@git.apache.org>.

Github user selvaganesang commented on a diff in the pull request:

    https://github.com/apache/trafodion/pull/1674#discussion_r207459245
  
    --- Diff: core/sql/src/main/java/org/trafodion/sql/HDFSClient.java ---
    @@ -240,7 +256,48 @@ int compressedFileRead(int readLenRemain) throws IOException
              return totalReadLen; 
         } 
     
    -    native int copyToByteBuffer(ByteBuffer buf, int bufOffset, byte[] bufArray, int copyLen);
    +    /* Trafodion adds record delimiter '\n' while copying it
    +       to buffer backing up the ByteBuffer */
    +
    +    int sequenceFileRead(int readLenRemain) throws IOException 
    +    {
    +       boolean eof = false;
    +       byte[] byteArray;
    +       int readLen;
    +       int totalReadLen = 0;
    +       long tempPos;
    +       int lenRemain = readLenRemain;
    +
    +       while (!eof && lenRemain > 0) {
    +          tempPos = reader_.getPosition();
    +          try {
    +            eof = reader_.next(key_, value_);
    +          }
    +          catch (java.io.EOFException e)
    +          {
    +              eof = true;
    +              break;
    +          }
    +          byteArray = ((Text)value_).getBytes();
    +          readLen = ((Text)value_).getLength();
    +          if (readLen <= lenRemain) {
    +                            
    +              buf_.put(byteArray, 0, readLen);
    +              buf_.put(recDelimiter_);
    --- End diff --
    
    Yes. I need to consider the extra byte added for delimiter. Good catch, Will fix it.


---

[GitHub] trafodion pull request #1674: [TRAFODION-3171] Refactor Hive sequence file r...

Posted by sureshsubbiah <gi...@git.apache.org>.

Github user sureshsubbiah commented on a diff in the pull request:

    https://github.com/apache/trafodion/pull/1674#discussion_r207986948
  
    --- Diff: core/sql/executor/ExHdfsScan.cpp ---
    @@ -571,6 +569,7 @@ ExWorkProcRetcode ExHdfsScanTcb::work()
                  hdfsScan_ = HdfsScan::newInstance((NAHeap *)getHeap(), hdfsScanBuf_, hdfsScanBufMaxSize_, 
                                 hdfsScanTdb().hdfsIoByteArraySizeInKB_, 
                                 &hdfsFileInfoListAsArray_, beginRangeNum_, numRanges_, hdfsScanTdb().rangeTailIOSize_, 
    +                            isSequenceFile(), hdfsScanTdb().recordDelimiter_, 
    --- End diff --
    
    Thank you for explaining


---

[GitHub] trafodion pull request #1674: [TRAFODION-3171] Refactor Hive sequence file r...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/trafodion/pull/1674


---