You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2019/11/19 04:54:14 UTC

Re: pyarrow read_csv with different amount of columns per row

+dev@arrow to see if there is a more definitive answer, but I don't believe
this type of functionality is supported currently.




On Fri, Nov 15, 2019 at 1:42 AM Elisa Scandellari <
elisa.scandellari@gmail.com> wrote:

> Hi,
> I'm trying to improve the performance of my program that loads csv data
> and manipulates it.
> My CSV file contains 14 million rows and has a variable amount of columns.
> The first 27 columns will always be available, and a row can have up to 16
> more columns for a total of 43.
>
> Using vanilla pandas I've found this workaround:
> ```
>
>
>
>
>
>
>
>
>
>
> *largest_column_count = 0with open(data_file, 'r') as temp_f:    lines =
> temp_f.readlines()    for l in lines:        column_count =
> len(l.split(',')) + 1        largest_column_count = column_count if
> largest_column_count < column_count else
> largest_column_counttemp_f.close()column_names = [i for i in range(0,
> largest_column_count)]all_columns_df = pd.read_csv(file, header=None,
> delimiter=',', names=column_names, dtype='category').replace(pd.np.nan, '',
> regex=True)*```
> This will create the table with all my data plus empty cells where the
> data is not available.
> With a smaller file, this works perfectly well. With the complete file, my
> memory usage goes over the roof.
>
> I've been reading about Apache Arrow and, after a few attempts to load a
> structured csv file (same amount of columns for every row), I'm extremely
> impressed.
> I've tried to load my data file, using the same concept as above:
> ```
>
>
>
>
>
>
>
>
>
>
>
> *fixed_column_names = [str(i) for i in range(0, 27)]extra_column_names =
> [str(i) for i in range(len(fixed_column_names),
> largest_column_count)]total_columns =
> fixed_column_namestotal_columns.extend(extra_column_names)read_options =
> csv.ReadOptions(column_names=total_columns)convert_options =
> csv.ConvertOptions(include_columns=total_columns,
>            include_missing_columns=True,
>  strings_can_be_null=True)table = csv.read_csv(edr_filename,
> read_options=read_options, convert_options=convert_options)*
> ```
> but I get the following error
> ****Exception: CSV parse error: Expected 43 columns, got 32****
>
> I need to use the csv provided by pyarrow, if not I wouldn't be able to
> create the pyarrow table to then convert to pandas
> ```from pyarrow import csv```
>
> I guess that the csv library provided by pyarrow is more streamlined than
> the complete one.
>
> Is there any way I can load this file? Maybe using some ReadOptions and/or
> ConvertOptions?
> I'd be using pandas to manipulate the data after it's been loaded.
>
> Thank you in advance
>
>

Re: pyarrow read_csv with different amount of columns per row

Posted by Elisa Scandellari <el...@gmail.com>.

Hi,
Thank you for your replies.. my solution is very similar to Marteen's
suggestions.
Using the method from my first email, I wrote a new file with only the data
I needed in the fixed amount of columns.
After that, I just load the new optimised file with Apache Arrow.

Thank you all,
Elisa

On Tue, 19 Nov 2019 at 20:09, Maarten Ballintijn <ma...@xs4all.nl> wrote:

> Hi Elisa,
>
> One option is to preprocess the file and add the missing columns.
> You can do this using two passes (reading once to determine the number of
> columns
> and once writing out the lines filled out to the right number of columns)
> This does not need to take a lot of memory as  you can read line by line.
> (see
> http://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
> )
>
> Another option is to make an iterator to do this on the fly.
>
> Either way the resulting DataFrame is going to be large:
>
> 14e6 rows * 43 columns * 8 bytes/float ~ 5 Gbyte  (in the best case)
>
> Cheers,
> Maarten
>
>
>
> On Nov 19, 2019, at 4:51 AM, Antoine Pitrou <an...@python.org> wrote:
>
>
> No, there is no way to load CSV files with irregular dimensions, and we
> don't have any plans currently to support them.  Sorry :-(
>
> Regards
>
> Antoine.
>
>
> Le 19/11/2019 à 05:54, Micah Kornfield a écrit :
>
> +dev@arrow to see if there is a more definitive answer, but I don't
> believe
> this type of functionality is supported currently.
>
>
>
>
> On Fri, Nov 15, 2019 at 1:42 AM Elisa Scandellari <
> elisa.scandellari@gmail.com> wrote:
>
> Hi,
> I'm trying to improve the performance of my program that loads csv data
> and manipulates it.
> My CSV file contains 14 million rows and has a variable amount of columns.
> The first 27 columns will always be available, and a row can have up to 16
> more columns for a total of 43.
>
> Using vanilla pandas I've found this workaround:
> ```
>
>
>
>
>
>
>
>
>
>
> *largest_column_count = 0with open(data_file, 'r') as temp_f:    lines =
> temp_f.readlines()    for l in lines:        column_count =
> len(l.split(',')) + 1        largest_column_count = column_count if
> largest_column_count < column_count else
> largest_column_counttemp_f.close()column_names = [i for i in range(0,
> largest_column_count)]all_columns_df = pd.read_csv(file, header=None,
> delimiter=',', names=column_names, dtype='category').replace(pd.np.nan, '',
> regex=True)*```
> This will create the table with all my data plus empty cells where the
> data is not available.
> With a smaller file, this works perfectly well. With the complete file, my
> memory usage goes over the roof.
>
> I've been reading about Apache Arrow and, after a few attempts to load a
> structured csv file (same amount of columns for every row), I'm extremely
> impressed.
> I've tried to load my data file, using the same concept as above:
> ```
>
>
>
>
>
>
>
>
>
>
>
> *fixed_column_names = [str(i) for i in range(0, 27)]extra_column_names =
> [str(i) for i in range(len(fixed_column_names),
> largest_column_count)]total_columns =
> fixed_column_namestotal_columns.extend(extra_column_names)read_options =
> csv.ReadOptions(column_names=total_columns)convert_options =
> csv.ConvertOptions(include_columns=total_columns,
>           include_missing_columns=True,
> strings_can_be_null=True)table = csv.read_csv(edr_filename,
> read_options=read_options, convert_options=convert_options)*
> ```
> but I get the following error
> ****Exception: CSV parse error: Expected 43 columns, got 32****
>
> I need to use the csv provided by pyarrow, if not I wouldn't be able to
> create the pyarrow table to then convert to pandas
> ```from pyarrow import csv```
>
> I guess that the csv library provided by pyarrow is more streamlined than
> the complete one.
>
> Is there any way I can load this file? Maybe using some ReadOptions and/or
> ConvertOptions?
> I'd be using pandas to manipulate the data after it's been loaded.
>
> Thank you in advance
>
>
>
>
>

Re: pyarrow read_csv with different amount of columns per row

Posted by Maarten Ballintijn <ma...@xs4all.nl>.

Hi Elisa,

One option is to preprocess the file and add the missing columns.
You can do this using two passes (reading once to determine the number of columns
and once writing out the lines filled out to the right number of columns)
This does not need to take a lot of memory as  you can read line by line.
(see http://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects <http://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects>)

Another option is to make an iterator to do this on the fly.

Either way the resulting DataFrame is going to be large:

14e6 rows * 43 columns * 8 bytes/float ~ 5 Gbyte  (in the best case)

Cheers,
Maarten



> On Nov 19, 2019, at 4:51 AM, Antoine Pitrou <an...@python.org> wrote:
> 
> 
> No, there is no way to load CSV files with irregular dimensions, and we
> don't have any plans currently to support them.  Sorry :-(
> 
> Regards
> 
> Antoine.
> 
> 
> Le 19/11/2019 à 05:54, Micah Kornfield a écrit :
>> +dev@arrow to see if there is a more definitive answer, but I don't believe
>> this type of functionality is supported currently.
>> 
>> 
>> 
>> 
>> On Fri, Nov 15, 2019 at 1:42 AM Elisa Scandellari <
>> elisa.scandellari@gmail.com> wrote:
>> 
>>> Hi,
>>> I'm trying to improve the performance of my program that loads csv data
>>> and manipulates it.
>>> My CSV file contains 14 million rows and has a variable amount of columns.
>>> The first 27 columns will always be available, and a row can have up to 16
>>> more columns for a total of 43.
>>> 
>>> Using vanilla pandas I've found this workaround:
>>> ```
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *largest_column_count = 0with open(data_file, 'r') as temp_f:    lines =
>>> temp_f.readlines()    for l in lines:        column_count =
>>> len(l.split(',')) + 1        largest_column_count = column_count if
>>> largest_column_count < column_count else
>>> largest_column_counttemp_f.close()column_names = [i for i in range(0,
>>> largest_column_count)]all_columns_df = pd.read_csv(file, header=None,
>>> delimiter=',', names=column_names, dtype='category').replace(pd.np.nan, '',
>>> regex=True)*```
>>> This will create the table with all my data plus empty cells where the
>>> data is not available.
>>> With a smaller file, this works perfectly well. With the complete file, my
>>> memory usage goes over the roof.
>>> 
>>> I've been reading about Apache Arrow and, after a few attempts to load a
>>> structured csv file (same amount of columns for every row), I'm extremely
>>> impressed.
>>> I've tried to load my data file, using the same concept as above:
>>> ```
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *fixed_column_names = [str(i) for i in range(0, 27)]extra_column_names =
>>> [str(i) for i in range(len(fixed_column_names),
>>> largest_column_count)]total_columns =
>>> fixed_column_namestotal_columns.extend(extra_column_names)read_options =
>>> csv.ReadOptions(column_names=total_columns)convert_options =
>>> csv.ConvertOptions(include_columns=total_columns,
>>>           include_missing_columns=True,
>>> strings_can_be_null=True)table = csv.read_csv(edr_filename,
>>> read_options=read_options, convert_options=convert_options)*
>>> ```
>>> but I get the following error
>>> ****Exception: CSV parse error: Expected 43 columns, got 32****
>>> 
>>> I need to use the csv provided by pyarrow, if not I wouldn't be able to
>>> create the pyarrow table to then convert to pandas
>>> ```from pyarrow import csv```
>>> 
>>> I guess that the csv library provided by pyarrow is more streamlined than
>>> the complete one.
>>> 
>>> Is there any way I can load this file? Maybe using some ReadOptions and/or
>>> ConvertOptions?
>>> I'd be using pandas to manipulate the data after it's been loaded.
>>> 
>>> Thank you in advance
>>> 
>>> 
>>

Re: pyarrow read_csv with different amount of columns per row

Posted by Maarten Ballintijn <ma...@xs4all.nl>.

Hi Elisa,

One option is to preprocess the file and add the missing columns.
You can do this using two passes (reading once to determine the number of columns
and once writing out the lines filled out to the right number of columns)
This does not need to take a lot of memory as  you can read line by line.
(see http://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects <http://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects>)

Another option is to make an iterator to do this on the fly.

Either way the resulting DataFrame is going to be large:

14e6 rows * 43 columns * 8 bytes/float ~ 5 Gbyte  (in the best case)

Cheers,
Maarten



> On Nov 19, 2019, at 4:51 AM, Antoine Pitrou <an...@python.org> wrote:
> 
> 
> No, there is no way to load CSV files with irregular dimensions, and we
> don't have any plans currently to support them.  Sorry :-(
> 
> Regards
> 
> Antoine.
> 
> 
> Le 19/11/2019 à 05:54, Micah Kornfield a écrit :
>> +dev@arrow to see if there is a more definitive answer, but I don't believe
>> this type of functionality is supported currently.
>> 
>> 
>> 
>> 
>> On Fri, Nov 15, 2019 at 1:42 AM Elisa Scandellari <
>> elisa.scandellari@gmail.com> wrote:
>> 
>>> Hi,
>>> I'm trying to improve the performance of my program that loads csv data
>>> and manipulates it.
>>> My CSV file contains 14 million rows and has a variable amount of columns.
>>> The first 27 columns will always be available, and a row can have up to 16
>>> more columns for a total of 43.
>>> 
>>> Using vanilla pandas I've found this workaround:
>>> ```
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *largest_column_count = 0with open(data_file, 'r') as temp_f:    lines =
>>> temp_f.readlines()    for l in lines:        column_count =
>>> len(l.split(',')) + 1        largest_column_count = column_count if
>>> largest_column_count < column_count else
>>> largest_column_counttemp_f.close()column_names = [i for i in range(0,
>>> largest_column_count)]all_columns_df = pd.read_csv(file, header=None,
>>> delimiter=',', names=column_names, dtype='category').replace(pd.np.nan, '',
>>> regex=True)*```
>>> This will create the table with all my data plus empty cells where the
>>> data is not available.
>>> With a smaller file, this works perfectly well. With the complete file, my
>>> memory usage goes over the roof.
>>> 
>>> I've been reading about Apache Arrow and, after a few attempts to load a
>>> structured csv file (same amount of columns for every row), I'm extremely
>>> impressed.
>>> I've tried to load my data file, using the same concept as above:
>>> ```
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *fixed_column_names = [str(i) for i in range(0, 27)]extra_column_names =
>>> [str(i) for i in range(len(fixed_column_names),
>>> largest_column_count)]total_columns =
>>> fixed_column_namestotal_columns.extend(extra_column_names)read_options =
>>> csv.ReadOptions(column_names=total_columns)convert_options =
>>> csv.ConvertOptions(include_columns=total_columns,
>>>           include_missing_columns=True,
>>> strings_can_be_null=True)table = csv.read_csv(edr_filename,
>>> read_options=read_options, convert_options=convert_options)*
>>> ```
>>> but I get the following error
>>> ****Exception: CSV parse error: Expected 43 columns, got 32****
>>> 
>>> I need to use the csv provided by pyarrow, if not I wouldn't be able to
>>> create the pyarrow table to then convert to pandas
>>> ```from pyarrow import csv```
>>> 
>>> I guess that the csv library provided by pyarrow is more streamlined than
>>> the complete one.
>>> 
>>> Is there any way I can load this file? Maybe using some ReadOptions and/or
>>> ConvertOptions?
>>> I'd be using pandas to manipulate the data after it's been loaded.
>>> 
>>> Thank you in advance
>>> 
>>> 
>>

Re: pyarrow read_csv with different amount of columns per row

Posted by Antoine Pitrou <an...@python.org>.

No, there is no way to load CSV files with irregular dimensions, and we
don't have any plans currently to support them.  Sorry :-(

Regards

Antoine.


Le 19/11/2019 à 05:54, Micah Kornfield a écrit :
> +dev@arrow to see if there is a more definitive answer, but I don't believe
> this type of functionality is supported currently.
> 
> 
> 
> 
> On Fri, Nov 15, 2019 at 1:42 AM Elisa Scandellari <
> elisa.scandellari@gmail.com> wrote:
> 
>> Hi,
>> I'm trying to improve the performance of my program that loads csv data
>> and manipulates it.
>> My CSV file contains 14 million rows and has a variable amount of columns.
>> The first 27 columns will always be available, and a row can have up to 16
>> more columns for a total of 43.
>>
>> Using vanilla pandas I've found this workaround:
>> ```
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *largest_column_count = 0with open(data_file, 'r') as temp_f:    lines =
>> temp_f.readlines()    for l in lines:        column_count =
>> len(l.split(',')) + 1        largest_column_count = column_count if
>> largest_column_count < column_count else
>> largest_column_counttemp_f.close()column_names = [i for i in range(0,
>> largest_column_count)]all_columns_df = pd.read_csv(file, header=None,
>> delimiter=',', names=column_names, dtype='category').replace(pd.np.nan, '',
>> regex=True)*```
>> This will create the table with all my data plus empty cells where the
>> data is not available.
>> With a smaller file, this works perfectly well. With the complete file, my
>> memory usage goes over the roof.
>>
>> I've been reading about Apache Arrow and, after a few attempts to load a
>> structured csv file (same amount of columns for every row), I'm extremely
>> impressed.
>> I've tried to load my data file, using the same concept as above:
>> ```
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *fixed_column_names = [str(i) for i in range(0, 27)]extra_column_names =
>> [str(i) for i in range(len(fixed_column_names),
>> largest_column_count)]total_columns =
>> fixed_column_namestotal_columns.extend(extra_column_names)read_options =
>> csv.ReadOptions(column_names=total_columns)convert_options =
>> csv.ConvertOptions(include_columns=total_columns,
>>            include_missing_columns=True,
>>  strings_can_be_null=True)table = csv.read_csv(edr_filename,
>> read_options=read_options, convert_options=convert_options)*
>> ```
>> but I get the following error
>> ****Exception: CSV parse error: Expected 43 columns, got 32****
>>
>> I need to use the csv provided by pyarrow, if not I wouldn't be able to
>> create the pyarrow table to then convert to pandas
>> ```from pyarrow import csv```
>>
>> I guess that the csv library provided by pyarrow is more streamlined than
>> the complete one.
>>
>> Is there any way I can load this file? Maybe using some ReadOptions and/or
>> ConvertOptions?
>> I'd be using pandas to manipulate the data after it's been loaded.
>>
>> Thank you in advance
>>
>>
>

Re: pyarrow read_csv with different amount of columns per row

Posted by Antoine Pitrou <an...@python.org>.

No, there is no way to load CSV files with irregular dimensions, and we
don't have any plans currently to support them.  Sorry :-(

Regards

Antoine.


Le 19/11/2019 à 05:54, Micah Kornfield a écrit :
> +dev@arrow to see if there is a more definitive answer, but I don't believe
> this type of functionality is supported currently.
> 
> 
> 
> 
> On Fri, Nov 15, 2019 at 1:42 AM Elisa Scandellari <
> elisa.scandellari@gmail.com> wrote:
> 
>> Hi,
>> I'm trying to improve the performance of my program that loads csv data
>> and manipulates it.
>> My CSV file contains 14 million rows and has a variable amount of columns.
>> The first 27 columns will always be available, and a row can have up to 16
>> more columns for a total of 43.
>>
>> Using vanilla pandas I've found this workaround:
>> ```
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *largest_column_count = 0with open(data_file, 'r') as temp_f:    lines =
>> temp_f.readlines()    for l in lines:        column_count =
>> len(l.split(',')) + 1        largest_column_count = column_count if
>> largest_column_count < column_count else
>> largest_column_counttemp_f.close()column_names = [i for i in range(0,
>> largest_column_count)]all_columns_df = pd.read_csv(file, header=None,
>> delimiter=',', names=column_names, dtype='category').replace(pd.np.nan, '',
>> regex=True)*```
>> This will create the table with all my data plus empty cells where the
>> data is not available.
>> With a smaller file, this works perfectly well. With the complete file, my
>> memory usage goes over the roof.
>>
>> I've been reading about Apache Arrow and, after a few attempts to load a
>> structured csv file (same amount of columns for every row), I'm extremely
>> impressed.
>> I've tried to load my data file, using the same concept as above:
>> ```
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *fixed_column_names = [str(i) for i in range(0, 27)]extra_column_names =
>> [str(i) for i in range(len(fixed_column_names),
>> largest_column_count)]total_columns =
>> fixed_column_namestotal_columns.extend(extra_column_names)read_options =
>> csv.ReadOptions(column_names=total_columns)convert_options =
>> csv.ConvertOptions(include_columns=total_columns,
>>            include_missing_columns=True,
>>  strings_can_be_null=True)table = csv.read_csv(edr_filename,
>> read_options=read_options, convert_options=convert_options)*
>> ```
>> but I get the following error
>> ****Exception: CSV parse error: Expected 43 columns, got 32****
>>
>> I need to use the csv provided by pyarrow, if not I wouldn't be able to
>> create the pyarrow table to then convert to pandas
>> ```from pyarrow import csv```
>>
>> I guess that the csv library provided by pyarrow is more streamlined than
>> the complete one.
>>
>> Is there any way I can load this file? Maybe using some ReadOptions and/or
>> ConvertOptions?
>> I'd be using pandas to manipulate the data after it's been loaded.
>>
>> Thank you in advance
>>
>>
>