You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metamodel.apache.org by Kasper Sørensen <i....@gmail.com> on 2013/07/26 10:32:48 UTC

[PATCH] Faster CsvDataContext implementation for single-line values

Hi everyone,

For one of our applications using MetaModel we have a customer with
quite large files (100+ M records per file) and reading through them
takes quite some time, although the CSV module of MetaModel is known
by us to be one of the fastest modules.

But these particular files (and probably many others) have a
characteristic that I see we could utilize to make an optimization:
They don't allow values that span multiple lines. For instance
consider:

name,company
Kasper Sørensen, Human Inference
Ankit Kumar, Human Inference

This is a rather normal CSV layout. But our CSV parser also allows
multiline values (if quoted), like this:

"name","company"
"Kasper Sørensen","Human
Inference"
"Ankit Kumar","Human Inference"

Now the optimization I had in mind is to delay the actual parsing of
lines until the point where a value is needed. But this wont work with
multiline values since we wouldn't know if we should reserve only a
single line or multiple lines for the delayed/lazy CSV parser. So
therefore the module is slowed down by a blocking CSV parsing
operation for each row.

But if we add a flag to the user that he only expects/accepts
single-line values, then we can actually simply read through the file
with something like a BufferedReader and then return Row objects that
encapsulate the raw String line. The parsing of this line is then
delayed and can potentially be made multithreaded.

I made a quick prototype patch [1] (still a few improvements to be
made) of this idea and my quick'n'dirty tests showed up to ~ 65%
performance increase in a multithreaded consumer environment!

I did three runs before and after the improvements on a 30k record
file. The results are number of milliseconds used for reading through
all the values of the file:

        // results with old impl: [13908, 13827, 14577]. Total= 42312

        // results with new impl: [8567, 8965, 8154]. Total= 25686

The test that I ran is the class called 'CsvBigFileMemoryTest.java'.

What do you guys think? Is it feasable to make a optimization like
this for a specific type of CSV file?

[1] https://gist.github.com/kaspersorensen/6087230

Re: [PATCH] Faster CsvDataContext implementation for single-line values

Posted by Kasper Sørensen <i....@gmail.com>.
I'll go ahead with the improvement commit then.

2013/8/1 Manuel van den Berg <Ma...@humaninference.com>:
> +1 on this in terms of functionality. Using Csv data stores intensively so a need to have for us.
>
> Didn't check the code though.
>
> Manuel
> ________________________________________
> From: Kasper Sørensen [i.am.kasper.sorensen@gmail.com]
> Sent: 26 July 2013 11:26
> To: dev@metamodel.incubator.apache.org
> Subject: Re: [PATCH] Faster CsvDataContext implementation for single-line values
>
> Slight correction. I had left out a bit of functionality for sorting
> the fields correctly in the new Row implementation. So the performance
> improvement is "only" 60% straight out based on my tests:
>
>         // results with old impl: [13908, 13827, 14577]. Total= 42312
>
>         // results with new impl: [9052, 9200, 8193]. Total= 26445
>
>
> 2013/7/26 Kasper Sørensen <i....@gmail.com>:
>> Hi everyone,
>>
>> For one of our applications using MetaModel we have a customer with
>> quite large files (100+ M records per file) and reading through them
>> takes quite some time, although the CSV module of MetaModel is known
>> by us to be one of the fastest modules.
>>
>> But these particular files (and probably many others) have a
>> characteristic that I see we could utilize to make an optimization:
>> They don't allow values that span multiple lines. For instance
>> consider:
>>
>> name,company
>> Kasper Sørensen, Human Inference
>> Ankit Kumar, Human Inference
>>
>> This is a rather normal CSV layout. But our CSV parser also allows
>> multiline values (if quoted), like this:
>>
>> "name","company"
>> "Kasper Sørensen","Human
>> Inference"
>> "Ankit Kumar","Human Inference"
>>
>> Now the optimization I had in mind is to delay the actual parsing of
>> lines until the point where a value is needed. But this wont work with
>> multiline values since we wouldn't know if we should reserve only a
>> single line or multiple lines for the delayed/lazy CSV parser. So
>> therefore the module is slowed down by a blocking CSV parsing
>> operation for each row.
>>
>> But if we add a flag to the user that he only expects/accepts
>> single-line values, then we can actually simply read through the file
>> with something like a BufferedReader and then return Row objects that
>> encapsulate the raw String line. The parsing of this line is then
>> delayed and can potentially be made multithreaded.
>>
>> I made a quick prototype patch [1] (still a few improvements to be
>> made) of this idea and my quick'n'dirty tests showed up to ~ 65%
>> performance increase in a multithreaded consumer environment!
>>
>> I did three runs before and after the improvements on a 30k record
>> file. The results are number of milliseconds used for reading through
>> all the values of the file:
>>
>>         // results with old impl: [13908, 13827, 14577]. Total= 42312
>>
>>         // results with new impl: [8567, 8965, 8154]. Total= 25686
>>
>> The test that I ran is the class called 'CsvBigFileMemoryTest.java'.
>>
>> What do you guys think? Is it feasable to make a optimization like
>> this for a specific type of CSV file?
>>
>> [1] https://gist.github.com/kaspersorensen/6087230

RE: [PATCH] Faster CsvDataContext implementation for single-line values

Posted by Manuel van den Berg <Ma...@HumanInference.com>.
+1 on this in terms of functionality. Using Csv data stores intensively so a need to have for us.

Didn't check the code though.

Manuel
________________________________________
From: Kasper Sørensen [i.am.kasper.sorensen@gmail.com]
Sent: 26 July 2013 11:26
To: dev@metamodel.incubator.apache.org
Subject: Re: [PATCH] Faster CsvDataContext implementation for single-line values

Slight correction. I had left out a bit of functionality for sorting
the fields correctly in the new Row implementation. So the performance
improvement is "only" 60% straight out based on my tests:

        // results with old impl: [13908, 13827, 14577]. Total= 42312

        // results with new impl: [9052, 9200, 8193]. Total= 26445


2013/7/26 Kasper Sørensen <i....@gmail.com>:
> Hi everyone,
>
> For one of our applications using MetaModel we have a customer with
> quite large files (100+ M records per file) and reading through them
> takes quite some time, although the CSV module of MetaModel is known
> by us to be one of the fastest modules.
>
> But these particular files (and probably many others) have a
> characteristic that I see we could utilize to make an optimization:
> They don't allow values that span multiple lines. For instance
> consider:
>
> name,company
> Kasper Sørensen, Human Inference
> Ankit Kumar, Human Inference
>
> This is a rather normal CSV layout. But our CSV parser also allows
> multiline values (if quoted), like this:
>
> "name","company"
> "Kasper Sørensen","Human
> Inference"
> "Ankit Kumar","Human Inference"
>
> Now the optimization I had in mind is to delay the actual parsing of
> lines until the point where a value is needed. But this wont work with
> multiline values since we wouldn't know if we should reserve only a
> single line or multiple lines for the delayed/lazy CSV parser. So
> therefore the module is slowed down by a blocking CSV parsing
> operation for each row.
>
> But if we add a flag to the user that he only expects/accepts
> single-line values, then we can actually simply read through the file
> with something like a BufferedReader and then return Row objects that
> encapsulate the raw String line. The parsing of this line is then
> delayed and can potentially be made multithreaded.
>
> I made a quick prototype patch [1] (still a few improvements to be
> made) of this idea and my quick'n'dirty tests showed up to ~ 65%
> performance increase in a multithreaded consumer environment!
>
> I did three runs before and after the improvements on a 30k record
> file. The results are number of milliseconds used for reading through
> all the values of the file:
>
>         // results with old impl: [13908, 13827, 14577]. Total= 42312
>
>         // results with new impl: [8567, 8965, 8154]. Total= 25686
>
> The test that I ran is the class called 'CsvBigFileMemoryTest.java'.
>
> What do you guys think? Is it feasable to make a optimization like
> this for a specific type of CSV file?
>
> [1] https://gist.github.com/kaspersorensen/6087230

Re: [PATCH] Faster CsvDataContext implementation for single-line values

Posted by Kasper Sørensen <i....@gmail.com>.
Slight correction. I had left out a bit of functionality for sorting
the fields correctly in the new Row implementation. So the performance
improvement is "only" 60% straight out based on my tests:

        // results with old impl: [13908, 13827, 14577]. Total= 42312

        // results with new impl: [9052, 9200, 8193]. Total= 26445


2013/7/26 Kasper Sørensen <i....@gmail.com>:
> Hi everyone,
>
> For one of our applications using MetaModel we have a customer with
> quite large files (100+ M records per file) and reading through them
> takes quite some time, although the CSV module of MetaModel is known
> by us to be one of the fastest modules.
>
> But these particular files (and probably many others) have a
> characteristic that I see we could utilize to make an optimization:
> They don't allow values that span multiple lines. For instance
> consider:
>
> name,company
> Kasper Sørensen, Human Inference
> Ankit Kumar, Human Inference
>
> This is a rather normal CSV layout. But our CSV parser also allows
> multiline values (if quoted), like this:
>
> "name","company"
> "Kasper Sørensen","Human
> Inference"
> "Ankit Kumar","Human Inference"
>
> Now the optimization I had in mind is to delay the actual parsing of
> lines until the point where a value is needed. But this wont work with
> multiline values since we wouldn't know if we should reserve only a
> single line or multiple lines for the delayed/lazy CSV parser. So
> therefore the module is slowed down by a blocking CSV parsing
> operation for each row.
>
> But if we add a flag to the user that he only expects/accepts
> single-line values, then we can actually simply read through the file
> with something like a BufferedReader and then return Row objects that
> encapsulate the raw String line. The parsing of this line is then
> delayed and can potentially be made multithreaded.
>
> I made a quick prototype patch [1] (still a few improvements to be
> made) of this idea and my quick'n'dirty tests showed up to ~ 65%
> performance increase in a multithreaded consumer environment!
>
> I did three runs before and after the improvements on a 30k record
> file. The results are number of milliseconds used for reading through
> all the values of the file:
>
>         // results with old impl: [13908, 13827, 14577]. Total= 42312
>
>         // results with new impl: [8567, 8965, 8154]. Total= 25686
>
> The test that I ran is the class called 'CsvBigFileMemoryTest.java'.
>
> What do you guys think? Is it feasable to make a optimization like
> this for a specific type of CSV file?
>
> [1] https://gist.github.com/kaspersorensen/6087230