You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "qifei9 (via GitHub)" <gi...@apache.org> on 2023/04/05 20:48:51 UTC

[GitHub] [arrow] qifei9 opened a new issue, #34910: `write_dataset` puts addtional things around dir name when write big tibble

qifei9 opened a new issue, #34910:
URL: https://github.com/apache/arrow/issues/34910

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I have a big tibble of >500,000,000 rows. I want to use `write_dataset` to write it out with partationing by the column "species" containing values of "mouse" and "zebrafish".
   
   However, it forms wrong dir names of
   ```
   [\n "mouse"\n]
   [\n "zebrafish"\n]
   ```
   instead of correct form of
   ```
   mouse
   zebrafish
   ```
   
   If I randomly sample a small fraction of the tibble, like 30000 rows, `write_dataset` could form the correct dir names.
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [R] `write_dataset` puts addtional things around dir name when write big tibble [arrow]

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic closed issue #34910: [R] `write_dataset` puts addtional things around dir name when write big tibble
URL: https://github.com/apache/arrow/issues/34910


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [R] `write_dataset` puts addtional things around dir name when write big tibble [arrow]

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #34910:
URL: https://github.com/apache/arrow/issues/34910#issuecomment-1746569857

   I'm closing this as there's been no interaction on this in ~ 6 months


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #34910: [R] `write_dataset` puts addtional things around dir name when write big tibble

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #34910:
URL: https://github.com/apache/arrow/issues/34910#issuecomment-1503203000

   Thanks for reporting this @qifei9.  Can you tell me a bit more about your data source, and where it's from?
   
   And would you mind double-checking there haven't been any input errors by running something like:
   
   ```
   library(arrow)
   library(dplyr)
   
   my_tibble %>%
     distinct(grouping_col) %>%
     collect()
   ```
   
   substituting `my_tibble` and `grouping_col` for the relevant values in your dataset.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34910: [R] `write_dataset` puts addtional things around dir name when write big tibble

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34910:
URL: https://github.com/apache/arrow/issues/34910#issuecomment-1502366369

   I'm not sure I follow.  Are the `[`, `\n`, and `"` characters part of the directory filename?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] qifei9 commented on issue #34910: [R] `write_dataset` puts addtional things around dir name when write big tibble

Posted by "qifei9 (via GitHub)" <gi...@apache.org>.
qifei9 commented on issue #34910:
URL: https://github.com/apache/arrow/issues/34910#issuecomment-1503316839

   > Thanks for reporting this @qifei9. Can you tell me a bit more about your data source, and where it's from?
   > 
   > And would you mind double-checking there haven't been any input errors by running something like:
   > 
   > ```
   > library(arrow)
   > library(dplyr)
   > 
   > my_tibble %>%
   >   distinct(grouping_col) %>%
   >   collect()
   > ```
   > 
   > substituting `my_tibble` and `grouping_col` for the relevant values in your dataset.
   
   many thanks. I am quite busy now. Will try this once I get time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] qifei9 commented on issue #34910: [R] `write_dataset` puts addtional things around dir name when write big tibble

Posted by "qifei9 (via GitHub)" <gi...@apache.org>.
qifei9 commented on issue #34910:
URL: https://github.com/apache/arrow/issues/34910#issuecomment-1502949596

   > I'm not sure I follow. Are the `[`, `\n`, and `"` characters part of the newly created directory filenames?
   
   Yes, they are. 
   
   If I read the dataset back into R, and `collect()`, the values of the column "species" used for partationing changed from original `mouse` to wrong form `[\n "mouse"\n]`.
   
   I tried to change the parameter `hive_style = TRUE`, but it did not help.
   
   I tried to partationing with more than 1 level, and change the order of the levels, in some rare cases, the problem did not occur, but I have not get a clue of why.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org