You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/04/14 15:06:00 UTC

[jira] [Reopened] (ARROW-16157) [R] Inconsistent behavior for arrow datasets vs working in memory

     [ https://issues.apache.org/jira/browse/ARROW-16157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicola Crane reopened ARROW-16157:
----------------------------------

> [R] Inconsistent behavior for arrow datasets vs working in memory
> -----------------------------------------------------------------
>
>                 Key: ARROW-16157
>                 URL: https://issues.apache.org/jira/browse/ARROW-16157
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 7.0.0
>         Environment: Ubuntu 21.10
> R 4.1.3.
> Arrow 7.0.0
>            Reporter: Egill Axfjord Fridgeirsson
>            Assignee: Nicola Crane
>            Priority: Major
>
> When I generate a sparse matrix using indices from an arrow dataset I get inconsistent behavior, sometimes there are duplicated indexes resulting in a matrix with values more than one at some places. When loading the dataset first in memory everything works as expected and all the values are one
> Repro
> {code:java}
> library(Matrix)
> library(dplyr)
> library(arrow)
> sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")
> dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)
> arrow::write_dataset(dF, path='./data/feather', format='feather')
> arrowDataset <- arrow::open_dataset('./data/feather', format='feather')
> # run the below a few times, and at some time the output is more than just # 1 for unique(newSparse@x), indicating there are 
> # duplicate indices for the sparse matrix (then it adds the values there)
> newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
>                                   j = arrowDataset %>% pull(j),
>                                   x = 1)
> unique(newSparse@x) # here is the bug, @x is the slot for values
> arrowInMemory <- arrowDataset %>% collect()
> # after loading in memory the output is never more than 1 no matter how 
> # often I run it
> newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
>                                   j = arrowInMemory %>% pull(j),
>                                   x = 1)
> unique(newSparse@x){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)