You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Egill Axfjord Fridgeirsson (Jira)" <ji...@apache.org> on 2022/04/08 17:39:00 UTC
[jira] [Created] (ARROW-16157) [R] Inconsistent behavior for arrow datasets vs working in memory
Egill Axfjord Fridgeirsson created ARROW-16157:
--------------------------------------------------
Summary: [R] Inconsistent behavior for arrow datasets vs working in memory
Key: ARROW-16157
URL: https://issues.apache.org/jira/browse/ARROW-16157
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 7.0.0
Environment: Ubuntu 21.10
R 4.1.3.
Arrow 7.0.0
Reporter: Egill Axfjord Fridgeirsson
When I generate a sparse matrix using indices from an arrow dataset I get inconsistent behavior, sometimes there are duplicated indexes resulting in a matrix with values more than one at some places. When loading the dataset first in memory everything works as expected and all the values are one
Repro
{code:java}
library(Matrix)
library(dplyr)
library(arrow)
sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")
dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)
arrow::write_dataset(dF, path='./data/feather', format='feather')
arrowDataset <- arrow::open_dataset('./data/feather', format='feather')
# run the below a few times, and at some time the output is more than just # 1 for unique(newSparse@x), indicating there are duplicate indices for
# the sparse matrix (then it adds the values there)
newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
j = arrowDataset %>% pull(j),
x = 1)
unique(newSparse@x) # here is the bug, @x is the slot for values
arrowInMemory <- arrowDataset %>% collect()
# after loading in memory the output is never more than 1 no matter how
# often I run it
newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
j = arrowInMemory %>% pull(j),
x = 1)
unique(newSparse@x){code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)