You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Enrique <en...@gmail.com> on 2014/03/10 11:48:40 UTC
XSSF - detecting end of file when reading XLSX files
Hello,
I'm working on making extraction of some excel files (very large files),
using XSSF + SAX.
I've used some tests files, specifically one that contained thousands of rows
of data. I only wanted to keep few of them, so I deleted all the content from
the 20th row. When I process this excel file, I see that since the 21th row,
every row I processed, with empty results obviously.
I would rather expect to have this process finished. Is there anyway to say
to the XSSF processor not to enter this empty rows?
I suppose there is a relation with the fact that the file contained more data
in this rows before, but this is a real case that our end users could do.
Thanks for your help.
Kind regards,
Enrique
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
Re: XSSF - detecting end of file when reading XLSX files
Posted by Enrique <en...@gmail.com>.
Darren Roberts <robertsdjguard-poi <at> yahoo.com> writes:
>
> Yes you can do that, I do it myself by in my SAX processors overidden
DefaultHandler.startElement method,
> simply detect the row element, parse the attribute to get the row number
(or
increment an internal counter
> if you prefer) and then do a check to see whether you've reached or passed
the row
you wish to stop on. In my
> case at this point I throw a custom exception (SAXRowReachedException)
which I
explicitly handle in the
> calling code.
>
> I know that using exceptions for flow control is bad form, but I feel it
is
technically ok as by that point I
> should have found the required headers and if I've not then that's an
exceptional
event. Even IBM use the
> same method here:
>
> http://www.ibm.com/developerworks/library/x-tipsaxstop/
>
> Refer to their code if my description isn't clear.
>
Thank you Darren for your suggestion, your description was very clear,
quite easy to throw a custom runtime exception, and to catch it at the
top of my processor.
I agree with you that exceptions is not the most elegant way to make flow
control, but there's some times where is preferable, when the steps you want
to exit are so many, or (like this case with SAX) when you use a third party
framework with callbacks and you don't want to modify the code of the
framework.
It works very well now, and the performance improvement is very clear!
Cheers
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
Re: XSSF - detecting end of file when reading XLSX files
Posted by Darren Roberts <ro...@yahoo.com>.
>________________________________
> From: Enrique <en...@gmail.com>
>To: user@poi.apache.org
>Sent: Wednesday, March 12, 2014 2:23 PM
>Subject: Re: XSSF - detecting end of file when reading XLSX files
>
>
>Darren Roberts <robertsdjguard-poi <at> yahoo.com> writes:
>
>>
>> When you deleted the data from row 21 onwards did you just highlight the
>cells/data and press delete, or
>> select all the row numbers from 21 to the end, right click and select
>delete? There is a difference, the
>> former just clears the data from the cells but otherwise Excel considers
>the cells to have content
>> (retains styles etc). If you do the latter it actually deletes the rows
>entirely from its internal
>> representation.
>>
>> Have you opened up the xlsx file as a zip and had a look at the XML file
>for the sheet? Since you're still seeing
>> SAX processing those rows it's my guess that when you saved the file Excel
>thought they were still valid
>> content cells and wrote out empty cells for all of them. Try deleting the
>rows using the method I described,
>> and then seeing if SAX still processes them.
>>
>> >________________________________
>> > From: Enrique <enrique.rochina <at> gmail.com>
>> >To: user <at> poi.apache.org
>> >Sent: Monday, March 10, 2014 10:48 AM
>> >Subject: XSSF - detecting end of file when reading XLSX files
>> >
>> >
>> >Hello,
>> >
>> >I'm working on making extraction of some excel files (very large files),
>> >using XSSF + SAX.
>> >
>> >I've used some tests files, specifically one that contained thousands of
>rows
>> >of data. I only wanted to keep few of them, so I deleted all the content
>from
>> >the 20th row. When I process this excel file, I see that since the 21th
>row,
>> >every row I processed, with empty results obviously.
>> >
>> >I would rather expect to have this process finished. Is there anyway to
>say
>> >to the XSSF processor not to enter this empty rows?
>> >
>> >I suppose there is a relation with the fact that the file contained more
>data
>> >in this rows before, but this is a real case that our end users could do.
>> >
>> >Thanks for your help.
>> >
>> >Kind regards,
>> >Enrique
>> >
>> >
>> >---------------------------------------------------------------------
>> >To unsubscribe, e-mail: user-unsubscribe <at> poi.apache.org
>> >For additional commands, e-mail: user-help <at> poi.apache.org
>> >
>> >
>> >
>> >
>
>
>Hello Darren,
>
>I've tested what you said, and that's right. My previous test was based on a
>file I deleted the cells with the Delete key. After deleting the content
>property, it is not processed by SAX. Moreover, if I've verified that the
>content in the zip file has decreased (file \xl\worksheets\sheet3.xml, many
>less row elements)
>
>Now I have a second question that it may relate to SAX: is it possible to
>abort or terminate the processing promptly? I have a business case where I
>only need to extract the content on the 10 first rows, thus I don't need to
>spend time on processing the whole file.
>
>I've verified, with XSSF + Usermodel it takes too long and too much memory,
>only for opening the file. So, XSSF is still a better option
>
>Thanks
>
>
>
>Darren Roberts <robertsdjguard-poi <at> yahoo.com> writes:
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>For additional commands, e-mail: user-help@poi.apache.org
>
Yes you can do that, I do it myself by in my SAX processors overidden DefaultHandler.startElement method, simply detect the row element, parse the attribute to get the row number (or increment an internal counter if you prefer) and then do a check to see whether you've reached or passed the row you wish to stop on. In my case at this point I throw a custom exception (SAXRowReachedException) which I explicitly handle in the calling code.
I know that using exceptions for flow control is bad form, but I feel it is technically ok as by that point I should have found the required headers and if I've not then that's an exceptional event. Even IBM use the same method here:
http://www.ibm.com/developerworks/library/x-tipsaxstop/
Refer to their code if my description isn't clear.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
Re: XSSF - detecting end of file when reading XLSX files
Posted by Enrique <en...@gmail.com>.
Darren Roberts <robertsdjguard-poi <at> yahoo.com> writes:
>
> When you deleted the data from row 21 onwards did you just highlight the
cells/data and press delete, or
> select all the row numbers from 21 to the end, right click and select
delete? There is a difference, the
> former just clears the data from the cells but otherwise Excel considers
the cells to have content
> (retains styles etc). If you do the latter it actually deletes the rows
entirely from its internal
> representation.
>
> Have you opened up the xlsx file as a zip and had a look at the XML file
for the sheet? Since you're still seeing
> SAX processing those rows it's my guess that when you saved the file Excel
thought they were still valid
> content cells and wrote out empty cells for all of them. Try deleting the
rows using the method I described,
> and then seeing if SAX still processes them.
>
> >________________________________
> > From: Enrique <enrique.rochina <at> gmail.com>
> >To: user <at> poi.apache.org
> >Sent: Monday, March 10, 2014 10:48 AM
> >Subject: XSSF - detecting end of file when reading XLSX files
> >
> >
> >Hello,
> >
> >I'm working on making extraction of some excel files (very large files),
> >using XSSF + SAX.
> >
> >I've used some tests files, specifically one that contained thousands of
rows
> >of data. I only wanted to keep few of them, so I deleted all the content
from
> >the 20th row. When I process this excel file, I see that since the 21th
row,
> >every row I processed, with empty results obviously.
> >
> >I would rather expect to have this process finished. Is there anyway to
say
> >to the XSSF processor not to enter this empty rows?
> >
> >I suppose there is a relation with the fact that the file contained more
data
> >in this rows before, but this is a real case that our end users could do.
> >
> >Thanks for your help.
> >
> >Kind regards,
> >Enrique
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: user-unsubscribe <at> poi.apache.org
> >For additional commands, e-mail: user-help <at> poi.apache.org
> >
> >
> >
> >
Hello Darren,
I've tested what you said, and that's right. My previous test was based on a
file I deleted the cells with the Delete key. After deleting the content
property, it is not processed by SAX. Moreover, if I've verified that the
content in the zip file has decreased (file \xl\worksheets\sheet3.xml, many
less row elements)
Now I have a second question that it may relate to SAX: is it possible to
abort or terminate the processing promptly? I have a business case where I
only need to extract the content on the 10 first rows, thus I don't need to
spend time on processing the whole file.
I've verified, with XSSF + Usermodel it takes too long and too much memory,
only for opening the file. So, XSSF is still a better option
Thanks
Darren Roberts <robertsdjguard-poi <at> yahoo.com> writes:
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
Re: XSSF - detecting end of file when reading XLSX files
Posted by Darren Roberts <ro...@yahoo.com>.
When you deleted the data from row 21 onwards did you just highlight the cells/data and press delete, or select all the row numbers from 21 to the end, right click and select delete? There is a difference, the former just clears the data from the cells but otherwise Excel considers the cells to have content (retains styles etc). If you do the latter it actually deletes the rows entirely from its internal representation.
Have you opened up the xlsx file as a zip and had a look at the XML file for the sheet? Since you're still seeing SAX processing those rows it's my guess that when you saved the file Excel thought they were still valid content cells and wrote out empty cells for all of them. Try deleting the rows using the method I described, and then seeing if SAX still processes them.
>________________________________
> From: Enrique <en...@gmail.com>
>To: user@poi.apache.org
>Sent: Monday, March 10, 2014 10:48 AM
>Subject: XSSF - detecting end of file when reading XLSX files
>
>
>Hello,
>
>I'm working on making extraction of some excel files (very large files),
>using XSSF + SAX.
>
>I've used some tests files, specifically one that contained thousands of rows
>of data. I only wanted to keep few of them, so I deleted all the content from
>the 20th row. When I process this excel file, I see that since the 21th row,
>every row I processed, with empty results obviously.
>
>I would rather expect to have this process finished. Is there anyway to say
>to the XSSF processor not to enter this empty rows?
>
>I suppose there is a relation with the fact that the file contained more data
>in this rows before, but this is a real case that our end users could do.
>
>Thanks for your help.
>
>Kind regards,
>Enrique
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>For additional commands, e-mail: user-help@poi.apache.org
>
>
>
>