You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/22 09:57:51 UTC

[GitHub] [arrow] lz19970205 opened a new issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

lz19970205 opened a new issue #10776:
URL: https://github.com/apache/arrow/issues/10776


   When I tried to import data into ClickHouse, I encountered this error. My data is stored in the orc file.
   I read the source code of ClickHouse and I noticed that it use this library to read orc file.
   I also read the source code of Apache arrow and I thought it was because there was a very large string in the data? So I removed some columns and tried to import again. But I encountered this error again.
   
   Because the data is related to the user's private information, so I can't provide it.
   Does anyone know the possible reason? Or should I post this issue under ClickHouse project? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lz19970205 closed issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
lz19970205 closed issue #10776:
URL: https://github.com/apache/arrow/issues/10776


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885365761


   > So the library will convert the data of one column to an array and process it?
   In Arrow a column is always an array.
   
   >  In this case, this error may occur on any column and not just the string column, right?
   For this particular error the column's data type would have to be string, variable sized list, or fixed size list I believe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885355493


   The length we are talking about here is the total length (in bytes) of a column.  So, for example, if you have a column with 1 million strings and each string has 36,000 characters that would be a total length of 36 billion which would be too large.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace edited a comment on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
westonpace edited a comment on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885365761


   > So the library will convert the data of one column to an array and process it?
   
   In Arrow a column always contains an array (or a chunked array).
   
   >  In this case, this error may occur on any column and not just the string column, right?
   
   For this particular error the column's data type would have to be string, variable sized list, or fixed size list I believe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lz19970205 closed issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
lz19970205 closed issue #10776:
URL: https://github.com/apache/arrow/issues/10776


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lz19970205 commented on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
lz19970205 commented on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885357489


   In this case, this error may occur on any column and not just the string column, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885372377


   Correct.  A single array (regardless of type) may not contain more than 2^31 - 1 elements.  String & List arrays cannot contain more than 2^31 - 1 bytes (these types are more restrictive).
   
   In Arrow, this limitation is usually fixed by using multiple arrays in a single chunked array or multiple record batches in a single table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace edited a comment on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
westonpace edited a comment on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885355493


   The length we are talking about here is the total length (in bytes) of a column.  So, for example, if you have a column with 1 million strings and each string has 36,000 characters that would be a total length of 36 billion bytes which would be too large.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lz19970205 edited a comment on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
lz19970205 edited a comment on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885368714


   One more question.
   What are the restrictions if the column type is int or other basic types (float, double...)?
   If a column always contains an array to indicate the offset, I think it maybe cannot contain more than 2^31 - 1 elements?
   In other words, the data cannot contain more than 2^31 -1 rows?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lz19970205 commented on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
lz19970205 commented on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885354992


   > List arrays and string arrays cannot have more than 2GB. This is because they are represented as two arrays. A values array and an offsets array.
   > 
   > ```
   >         0  1  2  3  4  5  6  7  8  9  10 11 12 13       
   > Values: s  t  r  i  n  g  1  s  t  r  i  n  g  2
   > Offsets: 0, 7, 14
   > ```
   > 
   > The offsets point to the beginning (and end) of each string. Since the offsets array is int32 the maximum offset is 2GB and so the values array cannot have more than 2GB bytes of values.
   > 
   > Normally, when this limit is hit, a good workaround is to split your data into smaller record batches (you can still represent it as a single table) but it will depend on what you are trying to do.
   
   I see. So you mean there is a huge array in my data?
   But I have already converted all array types to string types and removed the big string I thought.
   I verified the maximum length of all string columns in my data and I found that the maximum length is only 36000. This is far from reaching the 2GB limit.
   
   I will split data into smaller piece and try again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lz19970205 removed a comment on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
lz19970205 removed a comment on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885349007


   > List arrays and string arrays cannot have more than 2GB. This is because they are represented as two arrays. A values array and an offsets array.
   > 
   > ```
   >         0  1  2  3  4  5  6  7  8  9  10 11 12 13       
   > Values: s  t  r  i  n  g  1  s  t  r  i  n  g  2
   > Offsets: 0, 7, 14
   > ```
   > 
   > The offsets point to the beginning (and end) of each string. Since the offsets array is int32 the maximum offset is 2GB and so the values array cannot have more than 2GB bytes of values.
   > 
   > Normally, when this limit is hit, a good workaround is to split your data into smaller record batches (you can still represent it as a single table) but it will depend on what you are trying to do.
   
   Thanks for the reply!
   So you mean there is a big array in my data?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace edited a comment on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
westonpace edited a comment on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885365761


   > So the library will convert the data of one column to an array and process it?
   
   In Arrow a column is always an array.
   
   >  In this case, this error may occur on any column and not just the string column, right?
   
   For this particular error the column's data type would have to be string, variable sized list, or fixed size list I believe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace edited a comment on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
westonpace edited a comment on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885123242


   List arrays and string arrays cannot have more than 2GB.  This is because they are represented as two arrays.  A values array and an offsets array.
   
   ```
           0  1  2  3  4  5  6  7  8  9  10 11 12 13       
   Values: s  t  r  i  n  g  1  s  t  r  i  n  g  2
   Offsets: 0, 7, 14
   ```
   
   The offsets point to the beginning (and end) of each string.  Since the offsets array is int32 the maximum offset is 2GB and so the values array cannot have more than 2GB bytes of values.
   
   Normally, when this limit is hit, a good workaround is to split your data into smaller record batches (you can still represent it as a single table) but it will depend on what you are trying to do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lz19970205 commented on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
lz19970205 commented on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885372960


   Got it!
   Thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885123242


   List arrays and string arrays cannot have more than 2GB.  This is because they are represented as two arrays.  A values array and an offsets array.
   
   ```
           0  1  2  3  4  5  6  7  8  9  10 11 12 13       
   Values: s  t  r  i  n  g  1  s  t  r  i  n  g  2
   Offsets: 0, 7, 13
   ```
   
   The offsets point to the beginning (and end) of each string.  Since the offsets array is int32 the maximum offset is 2GB and so the values array cannot have more than 2GB bytes of values.
   
   Normally, when this limit is hit, a good workaround is to split your data into smaller record batches (you can still represent it as a single table) but it will depend on what you are trying to do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lz19970205 edited a comment on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
lz19970205 edited a comment on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885357489


   So the library will convert the data of one column to an array and process it?
   In this case, this error may occur on any column and not just the string column, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace edited a comment on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
westonpace edited a comment on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885372377


   Correct.  A single array (regardless of type) may not contain more than 2^31 - 1 elements.  String arrays cannot contain more than 2^31 - 1 bytes (these types are more restrictive).
   
   In Arrow, this limitation is usually fixed by using multiple arrays in a single chunked array or multiple record batches in a single table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lz19970205 commented on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
lz19970205 commented on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885349007


   > List arrays and string arrays cannot have more than 2GB. This is because they are represented as two arrays. A values array and an offsets array.
   > 
   > ```
   >         0  1  2  3  4  5  6  7  8  9  10 11 12 13       
   > Values: s  t  r  i  n  g  1  s  t  r  i  n  g  2
   > Offsets: 0, 7, 14
   > ```
   > 
   > The offsets point to the beginning (and end) of each string. Since the offsets array is int32 the maximum offset is 2GB and so the values array cannot have more than 2GB bytes of values.
   > 
   > Normally, when this limit is hit, a good workaround is to split your data into smaller record batches (you can still represent it as a single table) but it will depend on what you are trying to do.
   
   Thanks for the reply!
   So you mean there is a big array in my data?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lz19970205 commented on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Posted by GitBox <gi...@apache.org>.
lz19970205 commented on issue #10776:
URL: https://github.com/apache/arrow/issues/10776#issuecomment-885368714


   So what are the restrictions if the column type is int or other basic types (float, double...)?
   If a column always contains an array to indicate the offset, I think it maybe cannot contain more than 2^31 - 1 elements?
   In other words, the data cannot contain more than 2^31 -1 rows?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org