You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metamodel.apache.org by "Kasper Sørensen (JIRA)" <ji...@apache.org> on 2013/09/18 10:58:52 UTC

[jira] [Updated] (METAMODEL-5) Faster CsvDataContext implementation for single-line values

     [ https://issues.apache.org/jira/browse/METAMODEL-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kasper Sørensen updated METAMODEL-5:
------------------------------------

    Fix Version/s: 4.0
    
> Faster CsvDataContext implementation for single-line values
> -----------------------------------------------------------
>
>                 Key: METAMODEL-5
>                 URL: https://issues.apache.org/jira/browse/METAMODEL-5
>             Project: Metamodel
>          Issue Type: Improvement
>            Reporter: Kasper Sørensen
>            Assignee: Kasper Sørensen
>             Fix For: 4.0
>
>
> For one of our applications using MetaModel we have a customer with
> quite large files (100+ M records per file) and reading through them
> takes quite some time, although the CSV module of MetaModel is known
> by us to be one of the fastest modules.
> But these particular files (and probably many others) have a
> characteristic that I see we could utilize to make an optimization:
> They don't allow values that span multiple lines. For instance
> consider:
> name,company
> Kasper Sørensen, Human Inference
> Ankit Kumar, Human Inference
> This is a rather normal CSV layout. But our CSV parser also allows
> multiline values (if quoted), like this:
> "name","company"
> "Kasper Sørensen","Human
> Inference"
> "Ankit Kumar","Human Inference"
> Now the optimization I had in mind is to delay the actual parsing of
> lines until the point where a value is needed. But this wont work with
> multiline values since we wouldn't know if we should reserve only a
> single line or multiple lines for the delayed/lazy CSV parser. So
> therefore the module is slowed down by a blocking CSV parsing
> operation for each row.
> But if we add a flag to the user that he only expects/accepts
> single-line values, then we can actually simply read through the file
> with something like a BufferedReader and then return Row objects that
> encapsulate the raw String line. The parsing of this line is then
> delayed and can potentially be made multithreaded.
> I made a quick prototype patch [1] (still a few improvements to be
> made) of this idea and my quick'n'dirty tests showed up to ~ 65%
> performance increase in a multithreaded consumer environment!
> I did three runs before and after the improvements on a 30k record
> file. The results are number of milliseconds used for reading through
> all the values of the file:
>         // results with old impl: [13908, 13827, 14577]. Total= 42312
>         // results with new impl: [8567, 8965, 8154]. Total= 25686
> The test that I ran is the class called 'CsvBigFileMemoryTest.java'.
> What do you guys think? Is it feasable to make a optimization like
> this for a specific type of CSV file?
> [1] https://gist.github.com/kaspersorensen/6087230

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira