You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/06 15:35:23 UTC

[GitHub] [arrow] romainfrancois opened a new pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

romainfrancois opened a new pull request #7645:
URL: https://github.com/apache/arrow/pull/7645


   This needs some testing: 
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   f1 <- factor(c("a"), levels = c("a", "b"))
   f2 <- factor(c("c"), levels = c("c", "d"))
   
   ca <- ChunkedArray$create(f1, f2)
   ca$as_vector()
   #> [1] a c
   #> Levels: a b c d
   ca$type
   #> DictionaryType
   #> dictionary<values=string, indices=int8>
   
   tab <- Table$create(
     record_batch(f = f1), 
     record_batch(f = f2)
   )
   tab
   #> Table
   #> 2 rows x 1 columns
   #> $f <dictionary<values=string, indices=int8>>
   df <- as.data.frame(tab)
   df
   #> # A tibble: 2 x 1
   #>   f    
   #>   <fct>
   #> 1 a    
   #> 2 c
   df$f
   #> [1] a c
   #> Levels: a b c d
   ```
   
   <sup>Created on 2020-07-06 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001)</sup>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on a change in pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on a change in pull request #7645:
URL: https://github.com/apache/arrow/pull/7645#discussion_r452727559



##########
File path: r/src/array_to_vector.cpp
##########
@@ -56,15 +56,16 @@ class Converter {
 
   // ingest the values from the array into data[ start : (start + n)]
   virtual Status Ingest_some_nulls(SEXP data, const std::shared_ptr<arrow::Array>& array,
-                                   R_xlen_t start, R_xlen_t n) const = 0;
+                                   R_xlen_t start, R_xlen_t n,
+                                   size_t array_index) const = 0;

Review comment:
       Thanks




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on a change in pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on a change in pull request #7645:
URL: https://github.com/apache/arrow/pull/7645#discussion_r452155774



##########
File path: r/src/array_to_vector.cpp
##########
@@ -180,7 +183,7 @@ class Converter_Date32 : public Converter_SimpleArray<REALSXP> {
   }
 
   Status Ingest_some_nulls(SEXP data, const std::shared_ptr<arrow::Array>& array,
-                           R_xlen_t start, R_xlen_t n) const {
+                           R_xlen_t start, R_xlen_t n, size_t array_index) const {

Review comment:
       Perhaps the dispatch could rather be done by `arrow::VisitTypeInline()` but I'd rather do this in a follow up




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7645:
URL: https://github.com/apache/arrow/pull/7645#discussion_r452321250



##########
File path: r/src/array_to_vector.cpp
##########
@@ -56,15 +56,16 @@ class Converter {
 
   // ingest the values from the array into data[ start : (start + n)]
   virtual Status Ingest_some_nulls(SEXP data, const std::shared_ptr<arrow::Array>& array,
-                                   R_xlen_t start, R_xlen_t n) const = 0;
+                                   R_xlen_t start, R_xlen_t n,
+                                   size_t array_index) const = 0;

Review comment:
       Please rename this to `chunk_index` everywhere to make it clear what's being indexed

##########
File path: r/src/array_to_vector.cpp
##########
@@ -180,7 +183,7 @@ class Converter_Date32 : public Converter_SimpleArray<REALSXP> {
   }
 
   Status Ingest_some_nulls(SEXP data, const std::shared_ptr<arrow::Array>& array,
-                           R_xlen_t start, R_xlen_t n) const {
+                           R_xlen_t start, R_xlen_t n, size_t array_index) const {

Review comment:
       I think it's better to have a uniform interface. I would like a comment indicating that most implementations will ignore the chunk_index and that (IIUC) only the dictionary conversion path currently uses it.

##########
File path: r/src/array_to_vector.cpp
##########
@@ -413,18 +437,64 @@ class Converter_Dictionary : public Converter {
  private:
   template <typename Type>
   Status Ingest_some_nulls_Impl(SEXP data, const std::shared_ptr<arrow::Array>& array,
-                                R_xlen_t start, R_xlen_t n) const {
+                                R_xlen_t start, R_xlen_t n, size_t array_index) const {
     using value_type = typename arrow::TypeTraits<Type>::ArrayType::value_type;
 
-    std::shared_ptr<Array> indices =
-        static_cast<DictionaryArray*>(array.get())->indices();
+    const auto& dictionary_array = checked_cast<const DictionaryArray&>(*array);
+    auto indices = dictionary_array.indices();
+    auto values = indices->data()->GetValues<value_type>(1);
+
+    if (need_unification_) {
+      auto transpose =
+          reinterpret_cast<const int32_t*>(arrays_transpose_[array_index]->data());
+      auto transpose_convert = [=](value_type value) { return transpose[value] + 1; };
+
+      return SomeNull_Ingest<INTSXP, value_type>(data, start, n, values, indices,
+                                                 transpose_convert);
+    } else {
+      // convert the 0-based indices from the arrow Array
+      // to 1-based indices used in R factors
+      auto convert = [](value_type value) { return static_cast<int>(value) + 1; };
+
+      return SomeNull_Ingest<INTSXP, value_type>(data, start, n, values, indices,
+                                                 convert);
+    }
+  }

Review comment:
       The use of `values` here is confusing, since elsewhere that's used to refer to the elements of the array's dictionary.
   ```suggestion
       using index_type = typename arrow::TypeTraits<Type>::ArrayType::value_type;
       auto indices = checked_cast<const DictionaryArray&>(*array).indices();
       auto raw_indices = indices->data()->GetValues<index_type>(1);
   
       // convert the 0-based indices from the arrow Array
       // to 1-based indices used in R factors
       if (need_unification_) {
         // transpose the indices before converting
         auto transposed =
             reinterpret_cast<const int32_t*>(arrays_transpose_[array_index]->data());
         auto transpose_convert = [=](index_type i) { return transposed[i] + 1; };
   
         return SomeNull_Ingest<INTSXP>(data, start, n, raw_indices, indices,
                                                    transpose_convert);
       } else {
         auto convert = [](index_type i) { return static_cast<int>(i) + 1; };
   
         return SomeNull_Ingest<INTSXP>(data, start, n, raw_indices, indices,
                                                    convert);
       }
     }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7645:
URL: https://github.com/apache/arrow/pull/7645#issuecomment-654314571


   https://issues.apache.org/jira/browse/ARROW-8374


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on pull request #7645:
URL: https://github.com/apache/arrow/pull/7645#issuecomment-654732149


   In other words, when creating a chunked array from a list of factors from R, should the dictionary be unified and shared across the arrays of the chunked array ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on pull request #7645:
URL: https://github.com/apache/arrow/pull/7645#issuecomment-654744787


   I think we can leave this for a follow up: 
   
   ```
       // R factor levels must be type "character" so coerce `dict` to STRSXP
       // TODO (npr): this coercion should be optional, "dictionariesAsFactors" ;)
       // Alternative: preserve the logical type of the dictionary values
       // (e.g. if dict is timestamp, return a POSIXt R vector, not factor)
   ```
   
   as this will require some additional `vctrs` effort to implement some sort of generic R representation for dictionaries that are not factors
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7645:
URL: https://github.com/apache/arrow/pull/7645#issuecomment-655008360


   Re: ChunkedArray print method, `git blame` says it was introduced in #5492. I would guess that I added a custom method so that the printing wouldn't explode off the screen if you have a big array. Or maybe so that the internal chunking details weren't exposed since that's not always helpful. Ok with me if you want to change it, but I wouldn't unify the dictionaries in the print method--if you're trying to show more about the internals of what's in the array, show what's actually there.
   
   Re: dictionariesAsFactors, that's ARROW-7657, fine to keep it out of scope here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7645:
URL: https://github.com/apache/arrow/pull/7645#issuecomment-656727628


   Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on pull request #7645:
URL: https://github.com/apache/arrow/pull/7645#issuecomment-655361522


   Fair enough, I now remember that levels (or whatever they are called in arrow) used to be part of the type, but not anymore, so no unification until they need to be converted back to a single R factor, which is then absolutely needed and the business of this PR. 
   
   I will apply my stash, and so let arrow deal with the printing. 
   
   Perhaps at some point we can have something similar to `dplyr::glimpse()` to print things more succinctly ... 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson closed pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #7645:
URL: https://github.com/apache/arrow/pull/7645


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #7645:
URL: https://github.com/apache/arrow/pull/7645#discussion_r451696851



##########
File path: r/src/array_to_vector.cpp
##########
@@ -180,7 +183,7 @@ class Converter_Date32 : public Converter_SimpleArray<REALSXP> {
   }
 
   Status Ingest_some_nulls(SEXP data, const std::shared_ptr<arrow::Array>& array,
-                           R_xlen_t start, R_xlen_t n) const {
+                           R_xlen_t start, R_xlen_t n, size_t array_index) const {

Review comment:
       Why do these unrelated converters need an additional arg?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on pull request #7645:
URL: https://github.com/apache/arrow/pull/7645#issuecomment-654675294


   Any reason why `ChunkedArray$print()` does not use the `ToString()` C++ method ? @nealrichardson 
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   f1 <- factor(c("a", "a"), levels = c("a", "b"))
   f2 <- factor(c("c"), levels = c("c", "d"))
   f3 <- factor(NA, levels = c("d"))
   
   ca <- ChunkedArray$create(f1, f2, f3)
   ca
   #> ChunkedArray
   #> <dictionary<values=string, indices=int8>>
   #> 
   #> -- dictionary:
   #>   [
   #>     "a",
   #>     "b"
   #>   ]
   #> -- indices:
   #>   [
   #>     0,
   #>     0
   #>   ]
   ```
   
   <sup>Created on 2020-07-07 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001)</sup>
   
   
   I have a stashed commit that makes this: 
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   f1 <- factor(c("a", "a"), levels = c("a", "b"))
   f2 <- factor(c("c"), levels = c("c", "d"))
   f3 <- factor(NA, levels = c("d"))
   
   ca <- ChunkedArray$create(f1, f2, f3)
   ca
   #> ChunkedArray
   #> [
   #> 
   #>   -- dictionary:
   #>     [
   #>       "a",
   #>       "b"
   #>     ]
   #>   -- indices:
   #>     [
   #>       0,
   #>       0
   #>     ],
   #> 
   #>   -- dictionary:
   #>     [
   #>       "c",
   #>       "d"
   #>     ]
   #>   -- indices:
   #>     [
   #>       0
   #>     ],
   #> 
   #>   -- dictionary:
   #>     [
   #>       "d"
   #>     ]
   #>   -- indices:
   #>     [
   #>       null
   #>     ]
   #> ]
   ```
   
   <sup>Created on 2020-07-07 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001)</sup>
   
   I can put this on another jira/pr though. 
   
   Independently, should the printed dictionary be the unified one ? For now, this PR only unifies on conversion back to R, but that does not seem right ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on a change in pull request #7645: ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on a change in pull request #7645:
URL: https://github.com/apache/arrow/pull/7645#discussion_r452154972



##########
File path: r/src/array_to_vector.cpp
##########
@@ -180,7 +183,7 @@ class Converter_Date32 : public Converter_SimpleArray<REALSXP> {
   }
 
   Status Ingest_some_nulls(SEXP data, const std::shared_ptr<arrow::Array>& array,
-                           R_xlen_t start, R_xlen_t n) const {
+                           R_xlen_t start, R_xlen_t n, size_t array_index) const {

Review comment:
       Because this only gets called from `Converter::Ingest_some_nulls()` so they all need to have the same interface. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org