You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Angelo Casalan <ac...@gmail.com> on 2023/01/27 01:43:02 UTC

R arrow package question

Hi ,

I hope you are well. I wish to ask how I can resolve this error:

"CSV conversion error to int64: invalid value"


To give an idea of my dataset. I have 4 csvs all placed in a local folder.


The code below worked when importing:


arrow<-open_dataset(
sources="csv location",
format="csv")


However, when I run:


arrow %>% count(column) %>% collect()
nrow(arrow %>% collect)

head(arrow %>% collect(),10 )

I always get the same  error message: "Invalid: In CSV column #12: Row
#580. CSV conversion error to int64: invalid value"

I tried going back to open_dataset(,schema() ). Where the column that is
giving me problems is set as utf8 or sometimes str in the schema argument.

schema(
col=utf8(),
other nth columns
)

But I still encounter the same problem.

Using this code below fail to work either.

arrow2<-arrow_table(arrow)

Thanks in advance if you can help me.

-- 
Regards,

Angelo Casalan
Statistical Methodology Unit

Re: R arrow package question

Posted by Dewey Dunnington <de...@voltrondata.com.INVALID>.

In case it hasn't already been mentioned here, I wonder if manually setting
`schema()` would help. You're correct that the invalid value isn't
scientific notation (i.e., it's a blank string) so maybe that column should
be a string column instead. You could get the guessed schema from the
original open_dataset(), modify it to change any problematic columns to
"string" type, then open the dataset again and try to collect (example
below). I am guessing that disk.frame and arrow have different methods they
use to guess schemas which is why you're seeing the difference.

```
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()`
for more information.

temp <- tempfile()
writeLines("col1,col2\na,1\nb,2\n", temp)
cat(readLines(temp), sep = "\n")
#> col1,col2
#> a,1
#> b,2

(ds <- open_dataset(temp, format = "csv"))
#> FileSystemDataset with 1 csv file
#> col1: string
#> col2: int64
schema <- ds$schema
schema$col2 <- string()

(ds2 <- open_dataset(temp, format = "csv", schema = schema))
#> FileSystemDataset with 1 csv file
#> col1: string
#> col2: string
```

On Tue, Jan 31, 2023 at 6:43 PM Nic Crane <th...@gmail.com> wrote:

> Hi Angelo,
>
> The original code with just `open_dataset()` works as it's created a
> dataset without actually pulling the data into your R session.  The
> subsequent commands you tried (i.e. involving `collect()` read in the
> files, resulting in an error when the data is read in.
>
> It looks like there's an invalid value in your dataset which is causing it
> to fail to load.  From the error message you see there, it looks like it's
> in the 12th column of your data in row 580.  I think when Jacob asked "have
> you checked the value there", another way of phrasing what he said would be
> to ask if you have manually checked the contents of whichever CSV is
> causing the problem, in row 580 and column 12, to see what value is there?
> (rather than checking the data type/value reported by Arrow).
>
> It's going to be tricky to help diagnose the issue without a reproducible
> example. If I'm working with a larger dataset, I usually narrow down the
> issue by dividing it into two smaller datasets and running the code on each
> to see which one contains the problematic row, and then keep going until I
> find the row which is failing to load.  If you can get to the point where
> you can pinpoint the exact values which are causing problems, this will be
> the quickest way we can help you.
>
> Best wishes,
>
> Nic
>
> On Tue, 31 Jan 2023 at 00:52, Angelo Casalan <ac...@gmail.com>
> wrote:
>
> > Hi Jacob,
> >
> > Thanks. To provide some specifics on my query:
> >
> > 1.which version of arrow are you running?
> > - 10.0.1
> >
> > 2. The error message provides an exact col,row position, have you checked
> > the value there?
> > Yes. It is int64. This is after running open_dataset without specifying
> > schema:
> > '''
> > arrow<-open_dataset(
> > sources="location of csv files",
> > format="csv"
> > )
> > '''
> >
> >  3. I have to correct the exact error message:
> > CSV conversion error to int64:invalid value ' '
> > I think arrow tells me the invalid value present is ' '
> >
> >  4. This reminds me of cases where scientific notation is used for
> integers
> >  which causes an error but that usually shows the value e.g. "1e6".
> > the invalid value is: ' '
> >
> > 5. I am really confused because using disk.frame() function, on the same
> > csvs, I have not encountered this problem on this column because it was
> > cleanly encoded as a numeric variable.
> >
> > Regards,
> >
> >
> >
> > On Fri, Jan 27, 2023 at 9:43 AM Angelo Casalan <ac...@gmail.com>
> > wrote:
> >
> > > Hi ,
> > >
> > > I hope you are well. I wish to ask how I can resolve this error:
> > >
> > > "CSV conversion error to int64: invalid value"
> > >
> > >
> > > To give an idea of my dataset. I have 4 csvs all placed in a local
> > folder.
> > >
> > >
> > > The code below worked when importing:
> > >
> > >
> > > arrow<-open_dataset(
> > > sources="csv location",
> > > format="csv")
> > >
> > >
> > > However, when I run:
> > >
> > >
> > > arrow %>% count(column) %>% collect()
> > > nrow(arrow %>% collect)
> > >
> > > head(arrow %>% collect(),10 )
> > >
> > > I always get the same  error message: "Invalid: In CSV column #12: Row
> > > #580. CSV conversion error to int64: invalid value"
> > >
> > > I tried going back to open_dataset(,schema() ). Where the column that
> is
> > > giving me problems is set as utf8 or sometimes str in the schema
> > argument.
> > >
> > > schema(
> > > col=utf8(),
> > > other nth columns
> > > )
> > >
> > > But I still encounter the same problem.
> > >
> > > Using this code below fail to work either.
> > >
> > > arrow2<-arrow_table(arrow)
> > >
> > > Thanks in advance if you can help me.
> > >
> > > --
> > > Regards,
> > >
> > > Angelo Casalan
> > > Statistical Methodology Unit
> > >
> >
> >
> > --
> > Regards,
> >
> > Angelo Casalan
> > Statistical Methodology Unit
> >
>

Re: R arrow package question

Posted by Nic Crane <th...@gmail.com>.

Hi Angelo,

The original code with just `open_dataset()` works as it's created a
dataset without actually pulling the data into your R session.  The
subsequent commands you tried (i.e. involving `collect()` read in the
files, resulting in an error when the data is read in.

It looks like there's an invalid value in your dataset which is causing it
to fail to load.  From the error message you see there, it looks like it's
in the 12th column of your data in row 580.  I think when Jacob asked "have
you checked the value there", another way of phrasing what he said would be
to ask if you have manually checked the contents of whichever CSV is
causing the problem, in row 580 and column 12, to see what value is there?
(rather than checking the data type/value reported by Arrow).

It's going to be tricky to help diagnose the issue without a reproducible
example. If I'm working with a larger dataset, I usually narrow down the
issue by dividing it into two smaller datasets and running the code on each
to see which one contains the problematic row, and then keep going until I
find the row which is failing to load.  If you can get to the point where
you can pinpoint the exact values which are causing problems, this will be
the quickest way we can help you.

Best wishes,

Nic

On Tue, 31 Jan 2023 at 00:52, Angelo Casalan <ac...@gmail.com> wrote:

> Hi Jacob,
>
> Thanks. To provide some specifics on my query:
>
> 1.which version of arrow are you running?
> - 10.0.1
>
> 2. The error message provides an exact col,row position, have you checked
> the value there?
> Yes. It is int64. This is after running open_dataset without specifying
> schema:
> '''
> arrow<-open_dataset(
> sources="location of csv files",
> format="csv"
> )
> '''
>
>  3. I have to correct the exact error message:
> CSV conversion error to int64:invalid value ' '
> I think arrow tells me the invalid value present is ' '
>
>  4. This reminds me of cases where scientific notation is used for integers
>  which causes an error but that usually shows the value e.g. "1e6".
> the invalid value is: ' '
>
> 5. I am really confused because using disk.frame() function, on the same
> csvs, I have not encountered this problem on this column because it was
> cleanly encoded as a numeric variable.
>
> Regards,
>
>
>
> On Fri, Jan 27, 2023 at 9:43 AM Angelo Casalan <ac...@gmail.com>
> wrote:
>
> > Hi ,
> >
> > I hope you are well. I wish to ask how I can resolve this error:
> >
> > "CSV conversion error to int64: invalid value"
> >
> >
> > To give an idea of my dataset. I have 4 csvs all placed in a local
> folder.
> >
> >
> > The code below worked when importing:
> >
> >
> > arrow<-open_dataset(
> > sources="csv location",
> > format="csv")
> >
> >
> > However, when I run:
> >
> >
> > arrow %>% count(column) %>% collect()
> > nrow(arrow %>% collect)
> >
> > head(arrow %>% collect(),10 )
> >
> > I always get the same  error message: "Invalid: In CSV column #12: Row
> > #580. CSV conversion error to int64: invalid value"
> >
> > I tried going back to open_dataset(,schema() ). Where the column that is
> > giving me problems is set as utf8 or sometimes str in the schema
> argument.
> >
> > schema(
> > col=utf8(),
> > other nth columns
> > )
> >
> > But I still encounter the same problem.
> >
> > Using this code below fail to work either.
> >
> > arrow2<-arrow_table(arrow)
> >
> > Thanks in advance if you can help me.
> >
> > --
> > Regards,
> >
> > Angelo Casalan
> > Statistical Methodology Unit
> >
>
>
> --
> Regards,
>
> Angelo Casalan
> Statistical Methodology Unit
>

Re: R arrow package question

Posted by Nic Crane <th...@gmail.com>.

Hi Angelo,

I think what might be happening here is that you have space characters in
your integer column, which are causing problems.

I created what could be a reproducible example of your problem at:
https://gist.github.com/thisisnic/af265166d5cd1ebce605cf3e478ee6d8

In short, can you try including the parameter (and values) `null_values =
c("", " ", "NA")` in your call to `open_dataset()`?  By default, empty
strings are set to NA values, but spaces are not, so this could be the
source of your error.

Nic

On Thu, 9 Feb 2023 at 05:06, Angelo Casalan <ac...@gmail.com> wrote:

> Hi Everyone,
>
> Thanks for the responses. I hope you are all well.
>
> Hi Dewey. As to the problematic column error message: Invalid: Could not
> open CSV input source 'folder/name.CSV': Invalid: In CSV column #30: Row
> #5: CSV conversion error to int32: invalid value ''
>
> I manually opened the csv and saw the cells are empty or blanks along with
> integers on the same column 30. Also present in some other columns.
>
> I tried manually setting via schema() the columns as utf8()/character
> equivalent in R, or string().
>
> I still get the same error message.
>
> disk.frame read these columns mixing integers with spaces/blanks as
> integers smoothly with no error messages at all. I think disk.frame read
> the spaces/blanks as null values/NA in R studio.
>
> I am scripting all of these in RMarkdown if that might be a factor.
>
> Questions:
> 1.Is there a way in open_dataset() to automatically set all blanks as null
> values across multiple csvs which im trying to load into R? Similar in
> logic to pandas.read_csv('test.csv',na_values=['nan'])
>
> manual re-encoding is not feasible because im dealing with millions of data
> points, I am also just a secondary user of this data, and my goal is to
> automate in R for my organization.
>
> 2.  Are there other arrow functions/commands that can load multiple csvs
> from my local folder as an arrow object?
>
> Regards,
>
> On Tue, Jan 31, 2023 at 8:50 AM Angelo Casalan <ac...@gmail.com>
> wrote:
>
> > Hi Jacob,
> >
> > Thanks. To provide some specifics on my query:
> >
> > 1.which version of arrow are you running?
> > - 10.0.1
> >
> > 2. The error message provides an exact col,row position, have you checked
> > the value there?
> > Yes. It is int64. This is after running open_dataset without specifying
> > schema:
> > '''
> > arrow<-open_dataset(
> > sources="location of csv files",
> > format="csv"
> > )
> > '''
> >
> >  3. I have to correct the exact error message:
> > CSV conversion error to int64:invalid value ' '
> > I think arrow tells me the invalid value present is ' '
> >
> >  4. This reminds me of cases where scientific notation is used for
> > integers
> >  which causes an error but that usually shows the value e.g. "1e6".
> > the invalid value is: ' '
> >
> > 5. I am really confused because using disk.frame() function, on the same
> > csvs, I have not encountered this problem on this column because it was
> > cleanly encoded as a numeric variable.
> >
> > Regards,
> >
> >
> >
> > On Fri, Jan 27, 2023 at 9:43 AM Angelo Casalan <ac...@gmail.com>
> > wrote:
> >
> >> Hi ,
> >>
> >> I hope you are well. I wish to ask how I can resolve this error:
> >>
> >> "CSV conversion error to int64: invalid value"
> >>
> >>
> >> To give an idea of my dataset. I have 4 csvs all placed in a local
> folder.
> >>
> >>
> >> The code below worked when importing:
> >>
> >>
> >> arrow<-open_dataset(
> >> sources="csv location",
> >> format="csv")
> >>
> >>
> >> However, when I run:
> >>
> >>
> >> arrow %>% count(column) %>% collect()
> >> nrow(arrow %>% collect)
> >>
> >> head(arrow %>% collect(),10 )
> >>
> >> I always get the same  error message: "Invalid: In CSV column #12: Row
> >> #580. CSV conversion error to int64: invalid value"
> >>
> >> I tried going back to open_dataset(,schema() ). Where the column that is
> >> giving me problems is set as utf8 or sometimes str in the schema
> argument.
> >>
> >> schema(
> >> col=utf8(),
> >> other nth columns
> >> )
> >>
> >> But I still encounter the same problem.
> >>
> >> Using this code below fail to work either.
> >>
> >> arrow2<-arrow_table(arrow)
> >>
> >> Thanks in advance if you can help me.
> >>
> >> --
> >> Regards,
> >>
> >> Angelo Casalan
> >> Statistical Methodology Unit
> >>
> >
> >
> > --
> > Regards,
> >
> > Angelo Casalan
> > Statistical Methodology Unit
> >
>
>
> --
> Regards,
>
> Angelo Casalan
>

Re: R arrow package question

Posted by Angelo Casalan <ac...@gmail.com>.

Hi Everyone,

Thanks for the responses. I hope you are all well.

Hi Dewey. As to the problematic column error message: Invalid: Could not
open CSV input source 'folder/name.CSV': Invalid: In CSV column #30: Row
#5: CSV conversion error to int32: invalid value ''

I manually opened the csv and saw the cells are empty or blanks along with
integers on the same column 30. Also present in some other columns.

I tried manually setting via schema() the columns as utf8()/character
equivalent in R, or string().

I still get the same error message.

disk.frame read these columns mixing integers with spaces/blanks as
integers smoothly with no error messages at all. I think disk.frame read
the spaces/blanks as null values/NA in R studio.

I am scripting all of these in RMarkdown if that might be a factor.

Questions:
1.Is there a way in open_dataset() to automatically set all blanks as null
values across multiple csvs which im trying to load into R? Similar in
logic to pandas.read_csv('test.csv',na_values=['nan'])

manual re-encoding is not feasible because im dealing with millions of data
points, I am also just a secondary user of this data, and my goal is to
automate in R for my organization.

2.  Are there other arrow functions/commands that can load multiple csvs
from my local folder as an arrow object?

Regards,

On Tue, Jan 31, 2023 at 8:50 AM Angelo Casalan <ac...@gmail.com>
wrote:

> Hi Jacob,
>
> Thanks. To provide some specifics on my query:
>
> 1.which version of arrow are you running?
> - 10.0.1
>
> 2. The error message provides an exact col,row position, have you checked
> the value there?
> Yes. It is int64. This is after running open_dataset without specifying
> schema:
> '''
> arrow<-open_dataset(
> sources="location of csv files",
> format="csv"
> )
> '''
>
>  3. I have to correct the exact error message:
> CSV conversion error to int64:invalid value ' '
> I think arrow tells me the invalid value present is ' '
>
>  4. This reminds me of cases where scientific notation is used for
> integers
>  which causes an error but that usually shows the value e.g. "1e6".
> the invalid value is: ' '
>
> 5. I am really confused because using disk.frame() function, on the same
> csvs, I have not encountered this problem on this column because it was
> cleanly encoded as a numeric variable.
>
> Regards,
>
>
>
> On Fri, Jan 27, 2023 at 9:43 AM Angelo Casalan <ac...@gmail.com>
> wrote:
>
>> Hi ,
>>
>> I hope you are well. I wish to ask how I can resolve this error:
>>
>> "CSV conversion error to int64: invalid value"
>>
>>
>> To give an idea of my dataset. I have 4 csvs all placed in a local folder.
>>
>>
>> The code below worked when importing:
>>
>>
>> arrow<-open_dataset(
>> sources="csv location",
>> format="csv")
>>
>>
>> However, when I run:
>>
>>
>> arrow %>% count(column) %>% collect()
>> nrow(arrow %>% collect)
>>
>> head(arrow %>% collect(),10 )
>>
>> I always get the same  error message: "Invalid: In CSV column #12: Row
>> #580. CSV conversion error to int64: invalid value"
>>
>> I tried going back to open_dataset(,schema() ). Where the column that is
>> giving me problems is set as utf8 or sometimes str in the schema argument.
>>
>> schema(
>> col=utf8(),
>> other nth columns
>> )
>>
>> But I still encounter the same problem.
>>
>> Using this code below fail to work either.
>>
>> arrow2<-arrow_table(arrow)
>>
>> Thanks in advance if you can help me.
>>
>> --
>> Regards,
>>
>> Angelo Casalan
>> Statistical Methodology Unit
>>
>
>
> --
> Regards,
>
> Angelo Casalan
> Statistical Methodology Unit
>

-- 
Regards,

Angelo Casalan

Re: R arrow package question

Posted by Angelo Casalan <ac...@gmail.com>.

Hi Jacob,

Thanks. To provide some specifics on my query:

1.which version of arrow are you running?
- 10.0.1

2. The error message provides an exact col,row position, have you checked
the value there?
Yes. It is int64. This is after running open_dataset without specifying
schema:
'''
arrow<-open_dataset(
sources="location of csv files",
format="csv"
)
'''

 3. I have to correct the exact error message:
CSV conversion error to int64:invalid value ' '
I think arrow tells me the invalid value present is ' '

 4. This reminds me of cases where scientific notation is used for integers
 which causes an error but that usually shows the value e.g. "1e6".
the invalid value is: ' '

5. I am really confused because using disk.frame() function, on the same
csvs, I have not encountered this problem on this column because it was
cleanly encoded as a numeric variable.

Regards,

On Fri, Jan 27, 2023 at 9:43 AM Angelo Casalan <ac...@gmail.com>
wrote:

> Hi ,
>
> I hope you are well. I wish to ask how I can resolve this error:
>
> "CSV conversion error to int64: invalid value"
>
>
> To give an idea of my dataset. I have 4 csvs all placed in a local folder.
>
>
> The code below worked when importing:
>
>
> arrow<-open_dataset(
> sources="csv location",
> format="csv")
>
>
> However, when I run:
>
>
> arrow %>% count(column) %>% collect()
> nrow(arrow %>% collect)
>
> head(arrow %>% collect(),10 )
>
> I always get the same  error message: "Invalid: In CSV column #12: Row
> #580. CSV conversion error to int64: invalid value"
>
> I tried going back to open_dataset(,schema() ). Where the column that is
> giving me problems is set as utf8 or sometimes str in the schema argument.
>
> schema(
> col=utf8(),
> other nth columns
> )
>
> But I still encounter the same problem.
>
> Using this code below fail to work either.
>
> arrow2<-arrow_table(arrow)
>
> Thanks in advance if you can help me.
>
> --
> Regards,
>
> Angelo Casalan
> Statistical Methodology Unit
>

-- 
Regards,

Angelo Casalan
Statistical Methodology Unit