You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2022/03/22 09:15:00 UTC
[jira] [Comment Edited] (ORC-1131) [C++] getMemoryUsage() is incorrect on string vector batches

    [ https://issues.apache.org/jira/browse/ORC-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510353#comment-17510353 ] 

Quanlong Huang edited comment on ORC-1131 at 3/22/22, 9:14 AM:
---------------------------------------------------------------

Direct encoded string columns are read by StringDirectColumnReader, while dictionary encoded string columns are read by StringDictionaryColumnReader. RowReader::createRowBatch() creates an EncodedStringVectorBatch for a string column if the reader options has set EnableLazyDecoding to true.

There are actually 4 cases depending on the column encoding and the batch type (i.e. encoded or not). 
 * direct encoding read with StringVectorBatch
 ** blob should be counted.
 * direct encoding read with EncodedStringVectorBatch
 ** EncodedStringVectorBatch is used as a StringVectorBatch (isEncoded=false). Same as above.
 * dictionary encoding read with StringVectorBatch
 ** Strings reference the StringDictionary of the reader. blob is empty. I think we don't need to count the StringDictionary.
 * dictionary encoding read with EncodedStringVectorBatch
 ** StringDictionary shared with the reader. I'm not sure if we can skip it as well. The ideal solution would be counting it if and only if the dictionary is only held by the batch (e.g. when the reader is destroyed and the client is still using the batch). But this complicates the implementation. [~wgtmac],  [~Yurui Zhou] What do you think?


was (Author: stiga-huang):
Direct encoded string columns are read by StringDirectColumnReader, while dictionary encoded string columns are read by StringDictionaryColumnReader. RowReader::createRowBatch() creates an EncodedStringVectorBatch for a string column if the reader options has set EnableLazyDecoding to true.

There are actually 4 cases depending on the column encoding and the batch type (i.e. encoded or not). 
 * direct encoding read with StringVectorBatch
 ** blob should be counted.
 * direct encoding read with EncodedStringVectorBatch
 ** EncodedStringVectorBatch is used as a StringVectorBatch (isEncoded=false). Same as above.
 * dictionary encoding read with StringVectorBatch
 ** Strings reference the StringDictionary of the reader. blob is empty. I think we don't need to count the StringDictionary.
 * dictionary encoding read with EncodedStringVectorBatch
 ** StringDictionary shared with the reader. I'm not sure if we can skip it as well. The ideal solution would be counting it if and only if the dictionary is only held by the batch (e.g. when the reader is destroyed and the client is still using the batch). But this complicates the implementation. [~wgtmac],  [~Yurui Zhou] What do you think?

 

> [C++] getMemoryUsage() is incorrect on string vector batches 
> -------------------------------------------------------------
>
>                 Key: ORC-1131
>                 URL: https://issues.apache.org/jira/browse/ORC-1131
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.6.0
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>
> The C++ client produces two kinds of string vector batches, i.e. StringVectorBatch and EncodedStringVectorBatch. They both have incorrect results in getMemoryUsage() currently.
> After ORC-501, we move the blob from StringColumnReader to StringVectorBatch. However, StringVectorBatch::getMemoryUsage() was not updated to count for it.
> {code:cpp}
> uint64_t StringVectorBatch::getMemoryUsage() {
>   return ColumnVectorBatch::getMemoryUsage()
>         + static_cast<uint64_t>(data.capacity() * sizeof(char*)
>         + length.capacity() * sizeof(int64_t));
> } {code}
> For EncodedStringVectorBatch, it inherits StringVectorBatch but doesn't override the getMemoryUsage() method. Thus counting for wrong results.
> {code:cpp}
> struct EncodedStringVectorBatch : public StringVectorBatch { 
>   EncodedStringVectorBatch(uint64_t capacity, MemoryPool& pool);
>   virtual ~EncodedStringVectorBatch();
>   std::string toString() const;
>   void resize(uint64_t capacity);
>   std::shared_ptr<StringDictionary> dictionary;
>   // index for dictionary entry
>   DataBuffer<int64_t> index;
> };{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)