You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2019/09/17 17:13:00 UTC
[jira] [Commented] (ARROW-6582) R's read_parquet() fails with embedded nuls in strings

    [ https://issues.apache.org/jira/browse/ARROW-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931666#comment-16931666 ] 

Neal Richardson commented on ARROW-6582:
----------------------------------------

Thanks for the report. A few thoughts:

1. "embedded nul in string" is an error coming from R. Since the error is being thrown in {{Table__to_dataframe}}, that means the Parquet file was already read into Arrow memory successfully, and R is failing to read it from Arrow. That helps isolate the issue.

2. Given that, you could play around with the {{col_select}} argument to {{read_parquet}} and identify which column it is that has the nul, if you don't already know. If you don't happen to need this column for whatever you're trying to do, you could omit it from there and proceed.

3. If you can identify the offending column, it would be interesting to know what Arrow type it is. To do that, do something like

{code:r}
tab <- read_parquet(file, as_tibble=FALSE)
tab$schema
{code}

and report back what type that column is.

4. Check your system locale and encoding and make sure it aligns with the data in the file. [Googling the error message|https://www.google.com/search?q=embedded+nul+in+string] points to encoding often being implicated.

5. How are these Parquet files generated? Same host? Or different system, platform, etc.? Does that tell you something useful about the locale/encoding you need to set in R to read the data?

6. If any of this leads you to a place where you can write out a sufficiently anonymized/obfuscated file that reproduces the error, that would of course be most helpful.

> R's read_parquet() fails with embedded nuls in strings
> ------------------------------------------------------
>
>                 Key: ARROW-6582
>                 URL: https://issues.apache.org/jira/browse/ARROW-6582
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 0.14.1
>         Environment: Windows 10
> R 3.4.4
>            Reporter: John Cassil
>            Priority: Major
>
> Apologies if this issue isn't categorized or documented appropriately.  Please be gentle! :)
> As a heavy R user that normally interacts with parquet files using SparklyR, I have recently decided to try to use arrow::read_parquet() on a few parquet files that were on my local machine rather than in hadoop.  I was not able to proceed after several various attempts due to embedded nuls.  For example:
> try({df <- read_parquet('out_2019-09_data_1.snappy.parquet') })
> Error in Table__to_dataframe(x, use_threads = option_use_threads()) : 
>   embedded nul in string: 'INSTALL BOTH LEFT FRONT AND RIGHT FRONT  TORQUE ARMS\0 ARMS'
> Is there a solution to this?
> I have also hit roadblocks with embedded nuls in the past with csvs using data.table::fread(), but readr::read_delim() seems to handle them gracefully with just a warning after proceeding.
> Apologies that I do not have a handy reprex. I don't know if I can even recreate a parquet file with embedded nuls using arrow if it won't let me read one in, and I can't share this file due to company restrictions.
> Please let me know how I can be of any more help!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)