You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Matthew Reeves <mr...@Datranmedia.com> on 2011/08/08 20:20:06 UTC

seek to row number - HSSF eventmodel stream reading

I have a use case which supports reading a single large excel file and writing into multiple smaller plain text files.  I'm using the event model to achieve a low memory foot print.  What I would like to be able to do is save progress.  If I've broken my single large file into 1000 row chunks and have saved off that I've worked through 100 chunks, I would like to be able to start reading at row number 100,000.

As far as I can tell there isn't a way to do something like seek(100,000) using the HSSFEventFactory.  Would it be reasonable to assume the memory variable, _unreadRecordIndex, in RecordFactoryInputStream contains both the number I'm looking for (how many records I've processed/read) and the number I could set if I wanted to start reading at, say row number, 100,000? (Obviously I would need to make code changes to make this work)

Or is there some other way to save progress and/or start reading at a specified row number?

matt

Re: seek to row number - HSSF eventmodel stream reading

Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 8 Aug 2011, Matthew Reeves wrote:
> I have a use case which supports reading a single large excel file and 
> writing into multiple smaller plain text files.  I'm using the event 
> model to achieve a low memory foot print.  What I would like to be able 
> to do is save progress.  If I've broken my single large file into 1000 
> row chunks and have saved off that I've worked through 100 chunks, I 
> would like to be able to start reading at row number 100,000.

There's no way to do this, sorry. What you'll need to do is track when you 
hit a new line, record the row number of that, and flush your data out. 
When you start again, skip until you hit that row again, and away you go

> Would it be reasonable to assume the memory variable, 
> _unreadRecordIndex, in RecordFactoryInputStream contains both the number 
> I'm looking for (how many records I've processed/read) and the number I 
> could set if I wanted to start reading at, say row number, 100,000? 
> (Obviously I would need to make code changes to make this work)

I'd advise against trying to do it at a raw record level. Because of 
continue records etc, you might find yourself trying to resume at an point 
that doesn't really make sense. Instead, I'd suggest you just track the 
last seen row number, and use that

You might also want to look at the MissingRecordAware code. One option is 
to use that, so you can be sure to always hit the right number of rows in 
a chunk, no matter how many of them are blank. The other is just to review 
the code to see how best to do the row tracking

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org