You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Holger Stratmann (JIRA)" <ji...@apache.org> on 2014/09/17 15:04:34 UTC

[jira] [Commented] (CSV-131) Save positions of records to enable random access

    [ https://issues.apache.org/jira/browse/CSV-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137184#comment-14137184 ] 

Holger Stratmann commented on CSV-131:
--------------------------------------

{quote}Parse this new CSV data but start counting characters as X and start counting records at Y{quote}
Yes, that is exactly the point.
{quote}Why not just say, skip to record R or skip to char position P?{quote}
Because you cannot skip to char position P (and much less to record R) without reading the entire stream (up to that position/record) - which is exactly what I am trying to avoid. Just as in the test case, I want to start reading at some position in the middle. Actually, setting the record number and character position is purely cosmetic: I want the returned records to be identical to the ones I read when reading the full stream...
I agree that the setters are not really nice. Calling them only makes sense before you start reading (i.e. directly after calling the constructor). I made setters because I wanted to make minimal changes. The positions might make more sense as additional parameters to the constructor ("Here is a reader and some information about it"). I just didn't want to make additional versions of each constructor, but when I take another look at it now, it would probably only really concern the one that takes a reader.
So we could make a constructor
{code}public CSVParser(final Reader reader, final CSVFormat format, final int currentPosition, final int nextRecordNumber) throws IOException {code}
and remove the setters (and have the current constructor just call this(reader, format, 0, 1).
If you like this idea better, I can submit a new patch or you can modify it, whichever you prefer.

> Save positions of records to enable random access
> -------------------------------------------------
>
>                 Key: CSV-131
>                 URL: https://issues.apache.org/jira/browse/CSV-131
>             Project: Commons CSV
>          Issue Type: Improvement
>          Components: Parser
>    Affects Versions: 1.1
>            Reporter: Holger Stratmann
>            Priority: Minor
>         Attachments: PositionTrackingFull_v101_20140910.patch, PositionTrackingTest_20140907.patch, PositionTracking_20140907.patch, ggregory-CSV-131-parser-and-record.diff
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It would be good to have {{CSVRecord}} save its position in the source stream.
> Reason: Knowing the position of the records would enable random access to retrieve records from the source (after reading it once to build an index) if the file is too large to be read into memory (or if we don't want to read the full file to access a record in the middle).
> Additional info: I have created a "random access csv reader" and a "csv viewer" (Swing) for arbitrarily large CSV files. It requires one additional scan of the file to build an index (multi-byte charsets supported). The index can be saved to a file so it only needs to be built once. Because the lexer uses a BufferedReader, we need "internal information" to know where each record starts.
> The change to "core" is minor: one field in {{CSVRecord}}s and some associated methods to store the position.
> Patch will be attached.
> Code for random access (both UI and non-UI) will be proposed (and possibly submitted) as a separate issue. It could also be an independent add-on but requires this one little change to Commons CSV.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)