You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by "Steve Lawrence (Jira)" <ji...@apache.org> on 2021/02/10 17:21:00 UTC

[jira] [Commented] (DAFFODIL-2468) Uparsing an infoset for an 800mb csv file runs out of memory

    [ https://issues.apache.org/jira/browse/DAFFODIL-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282587#comment-17282587 ] 

Steve Lawrence commented on DAFFODIL-2468:
------------------------------------------

Just an FYI, the attached CSV file does not roundtrip, and that is expected. This CSV file has 27,289,600 lines, but if you parse and then unparse, the resulting data only has 27,256,800--that's exactly 32,800 lines shorter. So it looks like the data has been truncated, but that's not really the case.

The reason for this is that the original CSV file contains 32,800 blank lines. These blank lines parse to an empty record, e.g.:
{code:xml}
<record>
  <item></item>
</record>
{code}
So the number of records in the infoset does match the number of lines. But on unparse these empty records are essentially discarded. This means the unparse data has no empty lines and so is 32k lines shorter. If we remove the empty lines from the original CSV file, then it does roundtrip exactly.

> Uparsing an infoset for an 800mb csv file runs out of memory 
> -------------------------------------------------------------
>
>                 Key: DAFFODIL-2468
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2468
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Back End
>    Affects Versions: 3.1.0
>            Reporter: Dave Thompson
>            Assignee: Steve Lawrence
>            Priority: Major
>             Fix For: 3.1.0
>
>         Attachments: csv_data800m.csv.gz
>
>
> While verifying DAFFODIL-2455 - - Large CSV file causes "Attempting to backtrack too far" exception, found that unparsing the successfully parsed 800mb CSV files infoset ran out of memory.
> Increased the DAFFODIL_JAVA_OPTS memory setting several time up to 32gb and tried unparsing the infoset, each time running out of memory. Ran on test platform which has 90+GB of memory. 
> Parsed and unparsed using the shema from dfdl-shemas/dfdl-csv repo.
> The 800gb csv file (csv_data800m.csv) gzipped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)