You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/01 18:18:15 UTC

[GitHub] [arrow] nealrichardson opened a new pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

nealrichardson opened a new pull request #7611:
URL: https://github.com/apache/arrow/pull/7611


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

Posted by GitBox <gi...@apache.org>.
wesm closed pull request #7611:
URL: https://github.com/apache/arrow/pull/7611


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7611:
URL: https://github.com/apache/arrow/pull/7611#discussion_r452931573



##########
File path: r/src/array_from_vector.cpp
##########
@@ -1155,6 +1155,25 @@ std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<REALSXP>(SEXP x) {
   return float64();
 }
 
+template <>
+std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<STRSXP>(SEXP x) {
+  // See how big the character vector is
+  R_xlen_t n = XLENGTH(x);
+  int64_t size = 0;
+  for (R_xlen_t i = 0; i < n; i++) {
+    SEXP string_i = STRING_ELT(x, i);
+    if (string_i != NA_STRING) {
+      size += XLENGTH(Rf_mkCharCE(Rf_translateCharUTF8(string_i), CE_UTF8));
+    }
+    if (size > 2147483646) {

Review comment:
       We should use `arrow::kBinaryMemoryLimit` here, will update




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7611:
URL: https://github.com/apache/arrow/pull/7611#discussion_r449081453



##########
File path: r/src/array_from_vector.cpp
##########
@@ -1150,6 +1150,25 @@ std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<REALSXP>(SEXP x) {
   return float64();
 }
 
+template <>
+std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<STRSXP>(SEXP x) {
+  // See how big the character vector is
+  R_xlen_t n = XLENGTH(x);
+  int64_t size = 0;
+  for (R_xlen_t i = 0; i < n; i++) {
+    SEXP string_i = STRING_ELT(x, i);
+    if (string_i != NA_STRING) {
+      size += XLENGTH(Rf_mkCharCE(Rf_translateCharUTF8(string_i), CE_UTF8));
+    }
+    if (size > 2147483646) {
+      // Exceeds 2GB capacity of utf8 type, so use large
+      return large_utf8();
+    }
+  }
+
+  return utf8();
+}

Review comment:
       Yes, ideally we'd have a `AdaptiveStringBuilder`-like clas.

##########
File path: r/src/array_from_vector.cpp
##########
@@ -1150,6 +1150,25 @@ std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<REALSXP>(SEXP x) {
   return float64();
 }
 
+template <>
+std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<STRSXP>(SEXP x) {
+  // See how big the character vector is
+  R_xlen_t n = XLENGTH(x);
+  int64_t size = 0;
+  for (R_xlen_t i = 0; i < n; i++) {
+    SEXP string_i = STRING_ELT(x, i);
+    if (string_i != NA_STRING) {
+      size += XLENGTH(Rf_mkCharCE(Rf_translateCharUTF8(string_i), CE_UTF8));
+    }
+    if (size > 2147483646) {
+      // Exceeds 2GB capacity of utf8 type, so use large
+      return large_utf8();
+    }
+  }
+
+  return utf8();
+}

Review comment:
       Yes, ideally we'd have a `AdaptiveStringBuilder`-like class.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7611:
URL: https://github.com/apache/arrow/pull/7611#issuecomment-652828462


   In the Python test suite, we have a "large_memory" mark, and tests marked as such are not run on CI (C++ seems to do something similar)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7611:
URL: https://github.com/apache/arrow/pull/7611#discussion_r449088163



##########
File path: r/src/array_from_vector.cpp
##########
@@ -1150,6 +1150,25 @@ std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<REALSXP>(SEXP x) {
   return float64();
 }
 
+template <>
+std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<STRSXP>(SEXP x) {
+  // See how big the character vector is
+  R_xlen_t n = XLENGTH(x);
+  int64_t size = 0;
+  for (R_xlen_t i = 0; i < n; i++) {
+    SEXP string_i = STRING_ELT(x, i);
+    if (string_i != NA_STRING) {
+      size += XLENGTH(Rf_mkCharCE(Rf_translateCharUTF8(string_i), CE_UTF8));
+    }
+    if (size > 2147483646) {
+      // Exceeds 2GB capacity of utf8 type, so use large
+      return large_utf8();
+    }
+  }
+
+  return utf8();
+}

Review comment:
       Got it, thanks. Let me hack on this a little bit and see what I can do




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7611:
URL: https://github.com/apache/arrow/pull/7611#issuecomment-653024972


   I've been asking breathlessly for nightly large memory tests to be set up (18 months and counting), so I'm wishing out loud for it again here
   
   https://issues.apache.org/jira/browse/ARROW-4046


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #7611:
URL: https://github.com/apache/arrow/pull/7611#discussion_r449083406



##########
File path: r/src/array_from_vector.cpp
##########
@@ -1150,6 +1150,25 @@ std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<REALSXP>(SEXP x) {
   return float64();
 }
 
+template <>
+std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<STRSXP>(SEXP x) {
+  // See how big the character vector is
+  R_xlen_t n = XLENGTH(x);
+  int64_t size = 0;
+  for (R_xlen_t i = 0; i < n; i++) {
+    SEXP string_i = STRING_ELT(x, i);
+    if (string_i != NA_STRING) {
+      size += XLENGTH(Rf_mkCharCE(Rf_translateCharUTF8(string_i), CE_UTF8));
+    }
+    if (size > 2147483646) {
+      // Exceeds 2GB capacity of utf8 type, so use large
+      return large_utf8();
+    }
+  }
+
+  return utf8();
+}

Review comment:
       Here's the definition of `translateCharUTF8`: https://github.com/wch/r-source/blob/122dcf452ec5eacdd66e165457985bded1af4fee/src/main/sysutils.c#L1085
   
   If the string is already UTF-8 or ASCII, it does nothing. So FWIW the performance hit would be exceedingly rare except on Windows. That said, it would be better if we could ingest a character vector without iterating through it and calling `iconv` 3 times.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7611:
URL: https://github.com/apache/arrow/pull/7611#issuecomment-652596005


   The failed build is an OOM. Any recommendations for testing this? I could disable the test on CI and maybe that's fine since this code shouldn't be changing much, but skipping tests is a slippery slope. Or is there a preferred way to parametrize https://github.com/apache/arrow/pull/7611/files#diff-3577cb758e28aed76f3336d869911598R1163 so that we could set a lower threshold for testing (and thereby trigger this without requiring GBs of memory)? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7611:
URL: https://github.com/apache/arrow/pull/7611#discussion_r449032259



##########
File path: r/src/array_from_vector.cpp
##########
@@ -1150,6 +1150,25 @@ std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<REALSXP>(SEXP x) {
   return float64();
 }
 
+template <>
+std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<STRSXP>(SEXP x) {
+  // See how big the character vector is
+  R_xlen_t n = XLENGTH(x);
+  int64_t size = 0;
+  for (R_xlen_t i = 0; i < n; i++) {
+    SEXP string_i = STRING_ELT(x, i);
+    if (string_i != NA_STRING) {
+      size += XLENGTH(Rf_mkCharCE(Rf_translateCharUTF8(string_i), CE_UTF8));
+    }
+    if (size > 2147483646) {
+      // Exceeds 2GB capacity of utf8 type, so use large
+      return large_utf8();
+    }
+  }
+
+  return utf8();
+}

Review comment:
       This is very concerning from a perf perspective -- particularly to have to use the UTF-8 functions more than once.
   
   I can spend some time working this issue -- I can do a few things:
   
   * Only use these (guessing) expensive functions like `Rf_translateCharUTF8` if the string data is not UTF-8. I'll need to check the perf of converting ASCII data with various approaches
   * Use a single conversion path for utf8/large_utf8 and only switch from the 32-bit to 64-bit path (i.e. from `TypedBufferBuilder<int32_t>` to `TypedBufferBuilder<int64_t>` when hitting the memory limit)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7611:
URL: https://github.com/apache/arrow/pull/7611#issuecomment-656764101


   thanks @nealrichardson!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7611:
URL: https://github.com/apache/arrow/pull/7611#discussion_r449032259



##########
File path: r/src/array_from_vector.cpp
##########
@@ -1150,6 +1150,25 @@ std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<REALSXP>(SEXP x) {
   return float64();
 }
 
+template <>
+std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<STRSXP>(SEXP x) {
+  // See how big the character vector is
+  R_xlen_t n = XLENGTH(x);
+  int64_t size = 0;
+  for (R_xlen_t i = 0; i < n; i++) {
+    SEXP string_i = STRING_ELT(x, i);
+    if (string_i != NA_STRING) {
+      size += XLENGTH(Rf_mkCharCE(Rf_translateCharUTF8(string_i), CE_UTF8));
+    }
+    if (size > 2147483646) {
+      // Exceeds 2GB capacity of utf8 type, so use large
+      return large_utf8();
+    }
+  }
+
+  return utf8();
+}

Review comment:
       This is very concerning from a perf perspective -- particularly to have to use the UTF-8 functions more than once.
   
   I can spend some time working this issue -- I can do a few things:
   
   * Only use these (guessing) expensive functions like `Rf_translateCharUTF8` if the string data is not UTF-8 (we have a fast ValidateUTF8 function that we can use). I'll need to check the perf of converting ASCII data with various approaches
   * Use a single conversion path for utf8/large_utf8 and only switch from the 32-bit to 64-bit path (i.e. from `TypedBufferBuilder<int32_t>` to `TypedBufferBuilder<int64_t>` when hitting the memory limit)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7611:
URL: https://github.com/apache/arrow/pull/7611#issuecomment-652580102


   https://issues.apache.org/jira/browse/ARROW-3308


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org