You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Egill Axfjord Fridgeirsson (Jira)" <ji...@apache.org> on 2022/04/08 17:39:00 UTC

[jira] [Created] (ARROW-16157) [R] Inconsistent behavior for arrow datasets vs working in memory

Egill Axfjord Fridgeirsson created ARROW-16157:
--------------------------------------------------

             Summary: [R] Inconsistent behavior for arrow datasets vs working in memory
                 Key: ARROW-16157
                 URL: https://issues.apache.org/jira/browse/ARROW-16157
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 7.0.0
         Environment: Ubuntu 21.10
R 4.1.3.
Arrow 7.0.0
            Reporter: Egill Axfjord Fridgeirsson


When I generate a sparse matrix using indices from an arrow dataset I get inconsistent behavior, sometimes there are duplicated indexes resulting in a matrix with values more than one at some places. When loading the dataset first in memory everything works as expected and all the values are one

Repro
{code:java}
library(Matrix)
library(dplyr)
library(arrow)

sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")

dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)

arrow::write_dataset(dF, path='./data/feather', format='feather')
arrowDataset <- arrow::open_dataset('./data/feather', format='feather')

# run the below a few times, and at some time the output is more than just # 1 for unique(newSparse@x), indicating there are duplicate indices for  
# the sparse matrix (then it adds the values there)
newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
                                  j = arrowDataset %>% pull(j),
                                  x = 1)
unique(newSparse@x) # here is the bug, @x is the slot for values


arrowInMemory <- arrowDataset %>% collect()

# after loading in memory the output is never more than 1 no matter how 
# often I run it
newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
                                  j = arrowInMemory %>% pull(j),
                                  x = 1)
unique(newSparse@x){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)