You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/12/13 09:34:00 UTC
[jira] [Commented] (ARROW-15045) PyArrow SIGSEGV error when using UnionDatasets

    [ https://issues.apache.org/jira/browse/ARROW-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458240#comment-17458240 ] 

Joris Van den Bossche commented on ARROW-15045:
-----------------------------------------------

bq.  At the time I am writing the folder is about 30GB/1.85M files.

Do I read this correctly that the folder contains 1.8 million files? Which means that on average every file is only around 18kb? 
(we should of course still look into the segfault, but as a general comment if the above is correct: I would recommend creating fewer and larger files)

bq. So I tried to use a UnionDataset composed of single exchange Dataset. 

It's not really clear to me how you are using a UnionDataset to limit the number of folders/files in it? (it's a union of which other datasets? What's the "single exchange Dataset")   
Could you maybe provide some example code to illustrate the different workflows?





> PyArrow SIGSEGV error when using UnionDatasets
> ----------------------------------------------
>
>                 Key: ARROW-15045
>                 URL: https://issues.apache.org/jira/browse/ARROW-15045
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 6.0.1
>         Environment: Fedora Linux 35 (Workstation Edition), AMD Ryzen 5950X.
>            Reporter: Thomas Cercato
>            Priority: Blocker
>
> h3. The context:
> I am using PyArrow to read a folder structured as {{exchange/symbol/date.parquet}}. The folder contains multiple exchanges, multiple symbols and multiple files. At the time I am writing the folder is about 30GB/1.85M files.
> If I use a single PyArrow Dataset to read/manage the entire folder, the simplest process with just the dataset defined will occupy 2.3GB of RAM. The problem is, I am instanciating this dataset on multiple processes but since every process only needs some exchanges (typically just one), I don't need to read all folders and files in every single process.
> So I tried to use a UnionDataset composed of single exchange Dataset. In this way, every process just loads the required folder/files as a dataset. By a simple test, by doing so every process now occupy just 868MB of RAM, -63%.
> h3. The problem:
> When using a single Dataset for the entire folder/files, I have no problem at all. I can read filtered data without problems and it's fast as duck.
> But when I read the UnionDataset filtered data, I always get {{Process finished with exit code 139 (interrupted by signal 11: SIGSEGV}} error. So after looking every single source of the problem, I noticed that if I create a dummy folder with multiple exchanges but just some symbols, in order to limit the files amout to read, I don't get that error and it works normally. If I then copy new symbols folders (any) I get again that error.
> I came up thinking that the problem is not about my code, but linked instead to the amout of files that the UnionDataset is able to manage.
> Am I correct or am I doing something wrong? Thank you all, have a nice day and good work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)