You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Oscar Bastidas <ob...@umn.edu.INVALID> on 2021/05/03 10:31:50 UTC

Reading Massive Excel Files to csv

Hello,

I am trying to read a large Excel spreadsheet (60,000 rows) but I get what
appears to be a memory leak error from the JVM when I use the *XSSFWorkbook
*API.  I learned recently that there are size limitations on Excel files
being read in this way and the Apache POI website specifically recommends
reading the file in a streaming fashion instead of taking the whole file in
memory.  To do this, POI recommends using something called *XLSX2CSV* but
the provided link to teach how to use this returns a "page not found error."

Would someone please point me in the direction of how to handle reading my
big Excel file?

The Apache POI URL that contains the link to *XLSX2CSV* is:

http://poi.apache.org/components/spreadsheet/limitations.html

Thanks for any help anyone can provide.

Oscar

Oscar Bastidas
Research Associate
University of Minnesota

Re: Reading Massive Excel Files to csv

Posted by Oscar Bastidas <ob...@umn.edu.INVALID>.
Sounds good.  Thanks Nick.

Oscar

Oscar Bastidas
Research Associate
University of Minnesota

On Mon, May 3, 2021, 1:05 PM Nick Burch <ap...@gagravarr.org> wrote:

> On Mon, 3 May 2021, Oscar Bastidas wrote:
> > Thanks Nick for the updated link.
>
> I've updated the link in the sources for the site, hopefully one of the
> other devs who has all of the tools installed can republish shortly
>
> > With your comment on moving to Gradle, does this mean there will be a
> time
> > when Apache POI will not be available on Maven?
>
> Nope! Will always be available on Maven central, for users of Maven,
> Gradle or Ivy.
>
> This only affects people trying to build Apache POI from source, and
> hopefully makes their lives easier than before when we used Ant.
>
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: Reading Massive Excel Files to csv

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 3 May 2021, Oscar Bastidas wrote:
> Thanks Nick for the updated link.

I've updated the link in the sources for the site, hopefully one of the 
other devs who has all of the tools installed can republish shortly

> With your comment on moving to Gradle, does this mean there will be a time
> when Apache POI will not be available on Maven?

Nope! Will always be available on Maven central, for users of Maven, 
Gradle or Ivy.

This only affects people trying to build Apache POI from source, and 
hopefully makes their lives easier than before when we used Ant.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Reading Massive Excel Files to csv

Posted by Oscar Bastidas <ob...@umn.edu.INVALID>.
Thanks Nick for the updated link.

With your comment on moving to Gradle, does this mean there will be a time
when Apache POI will not be available on Maven?

Thanks again.

Oscar

Oscar Bastidas
Research Associate
University of Minnesota

On Mon, May 3, 2021, 6:17 AM Nick Burch <ap...@gagravarr.org> wrote:

> On Mon, 3 May 2021, Oscar Bastidas wrote:
> > I am trying to read a large Excel spreadsheet (60,000 rows) but I get
> > what appears to be a memory leak error from the JVM when I use the
> > *XSSFWorkbook *API.  I learned recently that there are size limitations
> > on Excel files being read in this way and the Apache POI website
> > specifically recommends reading the file in a streaming fashion instead
> > of taking the whole file in memory.  To do this, POI recommends using
> > something called *XLSX2CSV* but the provided link to teach how to use
> > this returns a "page not found error."
>
> That file has moved to
>
> https://svn.apache.org/repos/asf/poi/trunk/poi-examples/src/main/java/org/apache/poi/examples/xssf/eventusermodel/XLSX2CSV.java
>
> We have re-organised the svn area as part of our move to Gradle as our
> build system, and not all the links have been updated yet
>
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: Reading Massive Excel Files to csv

Posted by Andreas Reichel <an...@manticore-projects.com>.
On Mon, 2021-05-03 at 12:16 +0100, Nick Burch wrote:
> We have re-organised the svn area as part of our move to Gradle as
> our 
> build system

When I just started to move my stuff from ANT to Maven.
Well done, Nick and Team :)


Re: Reading Massive Excel Files to csv

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 3 May 2021, Oscar Bastidas wrote:
> I am trying to read a large Excel spreadsheet (60,000 rows) but I get 
> what appears to be a memory leak error from the JVM when I use the 
> *XSSFWorkbook *API.  I learned recently that there are size limitations 
> on Excel files being read in this way and the Apache POI website 
> specifically recommends reading the file in a streaming fashion instead 
> of taking the whole file in memory.  To do this, POI recommends using 
> something called *XLSX2CSV* but the provided link to teach how to use 
> this returns a "page not found error."

That file has moved to
https://svn.apache.org/repos/asf/poi/trunk/poi-examples/src/main/java/org/apache/poi/examples/xssf/eventusermodel/XLSX2CSV.java

We have re-organised the svn area as part of our move to Gradle as our 
build system, and not all the links have been updated yet

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Reading Massive Excel Files to csv

Posted by Oscar Bastidas <ob...@umn.edu.INVALID>.
Wonderful, I'll give that try then.  Thanks again Andreas.

Oscar

Oscar Bastidas
Research Associate
University of Minnesota

On Mon, May 3, 2021, 6:26 AM Andreas Reichel <an...@manticore-projects.com>
wrote:

> Oscar,
>
> unfortunately I am just an accountant and so my answer might not be
> technically accurate.
> In my understanding, you will read row by row and write the cells into
> a BufferedOutPutStream which is feeding into a FileOutPutStream.
>
> You could even put a ZIP/BZIP2 stream in between to compress the output
> on the fly.
>
> This won't take much memory (especially not in comparison to POI
> holding the whole XSSF Objects in Memory).
> 60k lines are nothing. I wold expects that -Xmx250MB would serve
> already, probably less.
>
> Cheers
> Andreas
>
>
> On Mon, 2021-05-03 at 06:19 -0500, Oscar Bastidas wrote:
> > Great.  Thanks Andreas, I'll check this one out too.
> >
> > I just started taking a quick look at the first Github you sent me
> > and I
> > didn't see anything explicit about writing the streamed data from the
> > Excel
> > workbook to a csv file.  Pardon my elementary question, but is it
> > possible
> > to write a big and single csv file using these streaming methods
> > where only
> > a handful of rows are kept in memory at a time?  Or would I need to
> > write a
> > bunch of mini csv files and then concatenate the file contents later
> > after
> > the Github code is finished?  I ask this because my only experience
> > with my
> > past work is where a file is read and held in memory at once and then
> > written to the file vs. the streaming technique.  Thanks again.
> >
> > Oscar
> >
> > Oscar Bastidas
> > Research Associate
> > University of Minnesota
> >
> > On Mon, May 3, 2021, 6:12 AM Andreas Reichel
> > <an...@manticore-projects.com>
> > wrote:
> >
> > > Oscar,
> > >
> > > there is also a fork of Excel Streamer which is worth to
> > > mention: https://github.com/pjfanning/excel-streaming-reader
> > > Personally I prefer that fork, but I have no good reason or
> > > argument.
> > > In the past this fork picked up version changes of POI faster.
> > >
> > > Cheers
> > > Andreas
> > >
> > >
> > >
> > > On Mon, 2021-05-03 at 06:05 -0500, Oscar Bastidas wrote:
> > > > Awesome.  Thanks, I'll give this a try.
> > > >
> > > > Oscar
> > > >
> > > > Oscar Bastidas
> > > > Research Associate
> > > > University of Minnesota
> > > >
> > > > On Mon, May 3, 2021, 6:04 AM Andreas Reichel
> > > > <an...@manticore-projects.com>
> > > > wrote:
> > > >
> > > > > Greetings.
> > > > >
> > > > > Please use the Excel Streaming Reader when reading large
> > > > > files: https://github.com/monitorjbl/excel-streaming-reader
> > > > >
> > > > > import com.monitorjbl.xlsx.StreamingReader;
> > > > >
> > > > > InputStream is = new FileInputStream(new
> > > > > File("/path/to/workbook.xlsx"));
> > > > > Workbook workbook = StreamingReader.builder()
> > > > >         .rowCacheSize(100)    // number of rows to keep in
> > > > > memory
> > > > > (defaults to 10)
> > > > >         .bufferSize(4096)     // buffer size to use when
> > > > > reading
> > > > > InputStream to file (defaults to 1024)
> > > > >         .open(is);            // InputStream or File for XLSX
> > > > > file
> > > > > (required)
> > > > >
> > > > >
> > > > >
> > > > > With the code above you can loop through your rows and write it
> > > > > to
> > > > > CSV.
> > > > > Best regards
> > > > > Andreas
> > > > >
> > > > >
> > > > > On Mon, 2021-05-03 at 05:31 -0500, Oscar Bastidas wrote:
> > > > > > Hello,
> > > > > >
> > > > > > I am trying to read a large Excel spreadsheet (60,000 rows)
> > > > > > but I
> > > > > > get
> > > > > > what
> > > > > > appears to be a memory leak error from the JVM when I use the
> > > > > > *XSSFWorkbook
> > > > > > *API.  I learned recently that there are size limitations on
> > > > > > Excel
> > > > > > files
> > > > > > being read in this way and the Apache POI website
> > > > > > specifically
> > > > > > recommends
> > > > > > reading the file in a streaming fashion instead of taking the
> > > > > > whole
> > > > > > file in
> > > > > > memory.  To do this, POI recommends using something called
> > > > > > *XLSX2CSV*
> > > > > > but
> > > > > > the provided link to teach how to use this returns a "page
> > > > > > not
> > > > > > found
> > > > > > error."
> > > > > >
> > > > > > Would someone please point me in the direction of how to
> > > > > > handle
> > > > > > reading my
> > > > > > big Excel file?
> > > > > >
> > > > > > The Apache POI URL that contains the link to *XLSX2CSV* is:
> > > > > >
> > > > > > http://poi.apache.org/components/spreadsheet/limitations.html
> > > > > >
> > > > > > Thanks for any help anyone can provide.
> > > > > >
> > > > > > Oscar
> > > > > >
> > > > > > Oscar Bastidas
> > > > > > Research Associate
> > > > > > University of Minnesota
> > > > >
> > > > >
> > >
> > >
>
>

Re: Reading Massive Excel Files to csv

Posted by Andreas Reichel <an...@manticore-projects.com>.
Oscar,

unfortunately I am just an accountant and so my answer might not be
technically accurate.
In my understanding, you will read row by row and write the cells into
a BufferedOutPutStream which is feeding into a FileOutPutStream.

You could even put a ZIP/BZIP2 stream in between to compress the output
on the fly.

This won't take much memory (especially not in comparison to POI
holding the whole XSSF Objects in Memory).
60k lines are nothing. I wold expects that -Xmx250MB would serve
already, probably less.

Cheers
Andreas


On Mon, 2021-05-03 at 06:19 -0500, Oscar Bastidas wrote:
> Great.  Thanks Andreas, I'll check this one out too.
> 
> I just started taking a quick look at the first Github you sent me
> and I
> didn't see anything explicit about writing the streamed data from the
> Excel
> workbook to a csv file.  Pardon my elementary question, but is it
> possible
> to write a big and single csv file using these streaming methods
> where only
> a handful of rows are kept in memory at a time?  Or would I need to
> write a
> bunch of mini csv files and then concatenate the file contents later
> after
> the Github code is finished?  I ask this because my only experience
> with my
> past work is where a file is read and held in memory at once and then
> written to the file vs. the streaming technique.  Thanks again.
> 
> Oscar
> 
> Oscar Bastidas
> Research Associate
> University of Minnesota
> 
> On Mon, May 3, 2021, 6:12 AM Andreas Reichel
> <an...@manticore-projects.com>
> wrote:
> 
> > Oscar,
> > 
> > there is also a fork of Excel Streamer which is worth to
> > mention: https://github.com/pjfanning/excel-streaming-reader
> > Personally I prefer that fork, but I have no good reason or
> > argument.
> > In the past this fork picked up version changes of POI faster.
> > 
> > Cheers
> > Andreas
> > 
> > 
> > 
> > On Mon, 2021-05-03 at 06:05 -0500, Oscar Bastidas wrote:
> > > Awesome.  Thanks, I'll give this a try.
> > > 
> > > Oscar
> > > 
> > > Oscar Bastidas
> > > Research Associate
> > > University of Minnesota
> > > 
> > > On Mon, May 3, 2021, 6:04 AM Andreas Reichel
> > > <an...@manticore-projects.com>
> > > wrote:
> > > 
> > > > Greetings.
> > > > 
> > > > Please use the Excel Streaming Reader when reading large
> > > > files: https://github.com/monitorjbl/excel-streaming-reader
> > > > 
> > > > import com.monitorjbl.xlsx.StreamingReader;
> > > > 
> > > > InputStream is = new FileInputStream(new
> > > > File("/path/to/workbook.xlsx"));
> > > > Workbook workbook = StreamingReader.builder()
> > > >         .rowCacheSize(100)    // number of rows to keep in
> > > > memory
> > > > (defaults to 10)
> > > >         .bufferSize(4096)     // buffer size to use when
> > > > reading
> > > > InputStream to file (defaults to 1024)
> > > >         .open(is);            // InputStream or File for XLSX
> > > > file
> > > > (required)
> > > > 
> > > > 
> > > > 
> > > > With the code above you can loop through your rows and write it
> > > > to
> > > > CSV.
> > > > Best regards
> > > > Andreas
> > > > 
> > > > 
> > > > On Mon, 2021-05-03 at 05:31 -0500, Oscar Bastidas wrote:
> > > > > Hello,
> > > > > 
> > > > > I am trying to read a large Excel spreadsheet (60,000 rows)
> > > > > but I
> > > > > get
> > > > > what
> > > > > appears to be a memory leak error from the JVM when I use the
> > > > > *XSSFWorkbook
> > > > > *API.  I learned recently that there are size limitations on
> > > > > Excel
> > > > > files
> > > > > being read in this way and the Apache POI website
> > > > > specifically
> > > > > recommends
> > > > > reading the file in a streaming fashion instead of taking the
> > > > > whole
> > > > > file in
> > > > > memory.  To do this, POI recommends using something called
> > > > > *XLSX2CSV*
> > > > > but
> > > > > the provided link to teach how to use this returns a "page
> > > > > not
> > > > > found
> > > > > error."
> > > > > 
> > > > > Would someone please point me in the direction of how to
> > > > > handle
> > > > > reading my
> > > > > big Excel file?
> > > > > 
> > > > > The Apache POI URL that contains the link to *XLSX2CSV* is:
> > > > > 
> > > > > http://poi.apache.org/components/spreadsheet/limitations.html
> > > > > 
> > > > > Thanks for any help anyone can provide.
> > > > > 
> > > > > Oscar
> > > > > 
> > > > > Oscar Bastidas
> > > > > Research Associate
> > > > > University of Minnesota
> > > > 
> > > > 
> > 
> > 


Re: Reading Massive Excel Files to csv

Posted by Oscar Bastidas <ob...@umn.edu.INVALID>.
Great.  Thanks Andreas, I'll check this one out too.

I just started taking a quick look at the first Github you sent me and I
didn't see anything explicit about writing the streamed data from the Excel
workbook to a csv file.  Pardon my elementary question, but is it possible
to write a big and single csv file using these streaming methods where only
a handful of rows are ke?  My only experience with my past work is where a
file is read and held in memory and then written to the file vs. the
streaming technique.  Thanks again.

Oscar

Oscar Bastidas
Research Associate
University of Minnesota

On Mon, May 3, 2021, 6:19 AM Oscar Bastidas <ob...@umn.edu> wrote:

> Great.  Thanks Andreas, I'll check this one out too.
>
> I just started taking a quick look at the first Github you sent me and I
> didn't see anything explicit about writing the streamed data from the Excel
> workbook to a csv file.  Pardon my elementary question, but is it possible
> to write a big and single csv file using these streaming methods where only
> a handful of rows are kept in memory at a time?  Or would I need to write a
> bunch of mini csv files and then concatenate the file contents later after
> the Github code is finished?  I ask this because my only experience with my
> past work is where a file is read and held in memory at once and then
> written to the file vs. the streaming technique.  Thanks again.
>
> Oscar
>
> Oscar Bastidas
> Research Associate
> University of Minnesota
>
> On Mon, May 3, 2021, 6:12 AM Andreas Reichel <
> andreas@manticore-projects.com> wrote:
>
>> Oscar,
>>
>> there is also a fork of Excel Streamer which is worth to
>> mention: https://github.com/pjfanning/excel-streaming-reader
>> Personally I prefer that fork, but I have no good reason or argument.
>> In the past this fork picked up version changes of POI faster.
>>
>> Cheers
>> Andreas
>>
>>
>>
>> On Mon, 2021-05-03 at 06:05 -0500, Oscar Bastidas wrote:
>> > Awesome.  Thanks, I'll give this a try.
>> >
>> > Oscar
>> >
>> > Oscar Bastidas
>> > Research Associate
>> > University of Minnesota
>> >
>> > On Mon, May 3, 2021, 6:04 AM Andreas Reichel
>> > <an...@manticore-projects.com>
>> > wrote:
>> >
>> > > Greetings.
>> > >
>> > > Please use the Excel Streaming Reader when reading large
>> > > files: https://github.com/monitorjbl/excel-streaming-reader
>> > >
>> > > import com.monitorjbl.xlsx.StreamingReader;
>> > >
>> > > InputStream is = new FileInputStream(new
>> > > File("/path/to/workbook.xlsx"));
>> > > Workbook workbook = StreamingReader.builder()
>> > >         .rowCacheSize(100)    // number of rows to keep in memory
>> > > (defaults to 10)
>> > >         .bufferSize(4096)     // buffer size to use when reading
>> > > InputStream to file (defaults to 1024)
>> > >         .open(is);            // InputStream or File for XLSX file
>> > > (required)
>> > >
>> > >
>> > >
>> > > With the code above you can loop through your rows and write it to
>> > > CSV.
>> > > Best regards
>> > > Andreas
>> > >
>> > >
>> > > On Mon, 2021-05-03 at 05:31 -0500, Oscar Bastidas wrote:
>> > > > Hello,
>> > > >
>> > > > I am trying to read a large Excel spreadsheet (60,000 rows) but I
>> > > > get
>> > > > what
>> > > > appears to be a memory leak error from the JVM when I use the
>> > > > *XSSFWorkbook
>> > > > *API.  I learned recently that there are size limitations on
>> > > > Excel
>> > > > files
>> > > > being read in this way and the Apache POI website specifically
>> > > > recommends
>> > > > reading the file in a streaming fashion instead of taking the
>> > > > whole
>> > > > file in
>> > > > memory.  To do this, POI recommends using something called
>> > > > *XLSX2CSV*
>> > > > but
>> > > > the provided link to teach how to use this returns a "page not
>> > > > found
>> > > > error."
>> > > >
>> > > > Would someone please point me in the direction of how to handle
>> > > > reading my
>> > > > big Excel file?
>> > > >
>> > > > The Apache POI URL that contains the link to *XLSX2CSV* is:
>> > > >
>> > > > http://poi.apache.org/components/spreadsheet/limitations.html
>> > > >
>> > > > Thanks for any help anyone can provide.
>> > > >
>> > > > Oscar
>> > > >
>> > > > Oscar Bastidas
>> > > > Research Associate
>> > > > University of Minnesota
>> > >
>> > >
>>
>>

Re: Reading Massive Excel Files to csv

Posted by Oscar Bastidas <ob...@umn.edu.INVALID>.
Great.  Thanks Andreas, I'll check this one out too.

I just started taking a quick look at the first Github you sent me and I
didn't see anything explicit about writing the streamed data from the Excel
workbook to a csv file.  Pardon my elementary question, but is it possible
to write a big and single csv file using these streaming methods where only
a handful of rows are kept in memory at a time?  Or would I need to write a
bunch of mini csv files and then concatenate the file contents later after
the Github code is finished?  I ask this because my only experience with my
past work is where a file is read and held in memory at once and then
written to the file vs. the streaming technique.  Thanks again.

Oscar

Oscar Bastidas
Research Associate
University of Minnesota

On Mon, May 3, 2021, 6:12 AM Andreas Reichel <an...@manticore-projects.com>
wrote:

> Oscar,
>
> there is also a fork of Excel Streamer which is worth to
> mention: https://github.com/pjfanning/excel-streaming-reader
> Personally I prefer that fork, but I have no good reason or argument.
> In the past this fork picked up version changes of POI faster.
>
> Cheers
> Andreas
>
>
>
> On Mon, 2021-05-03 at 06:05 -0500, Oscar Bastidas wrote:
> > Awesome.  Thanks, I'll give this a try.
> >
> > Oscar
> >
> > Oscar Bastidas
> > Research Associate
> > University of Minnesota
> >
> > On Mon, May 3, 2021, 6:04 AM Andreas Reichel
> > <an...@manticore-projects.com>
> > wrote:
> >
> > > Greetings.
> > >
> > > Please use the Excel Streaming Reader when reading large
> > > files: https://github.com/monitorjbl/excel-streaming-reader
> > >
> > > import com.monitorjbl.xlsx.StreamingReader;
> > >
> > > InputStream is = new FileInputStream(new
> > > File("/path/to/workbook.xlsx"));
> > > Workbook workbook = StreamingReader.builder()
> > >         .rowCacheSize(100)    // number of rows to keep in memory
> > > (defaults to 10)
> > >         .bufferSize(4096)     // buffer size to use when reading
> > > InputStream to file (defaults to 1024)
> > >         .open(is);            // InputStream or File for XLSX file
> > > (required)
> > >
> > >
> > >
> > > With the code above you can loop through your rows and write it to
> > > CSV.
> > > Best regards
> > > Andreas
> > >
> > >
> > > On Mon, 2021-05-03 at 05:31 -0500, Oscar Bastidas wrote:
> > > > Hello,
> > > >
> > > > I am trying to read a large Excel spreadsheet (60,000 rows) but I
> > > > get
> > > > what
> > > > appears to be a memory leak error from the JVM when I use the
> > > > *XSSFWorkbook
> > > > *API.  I learned recently that there are size limitations on
> > > > Excel
> > > > files
> > > > being read in this way and the Apache POI website specifically
> > > > recommends
> > > > reading the file in a streaming fashion instead of taking the
> > > > whole
> > > > file in
> > > > memory.  To do this, POI recommends using something called
> > > > *XLSX2CSV*
> > > > but
> > > > the provided link to teach how to use this returns a "page not
> > > > found
> > > > error."
> > > >
> > > > Would someone please point me in the direction of how to handle
> > > > reading my
> > > > big Excel file?
> > > >
> > > > The Apache POI URL that contains the link to *XLSX2CSV* is:
> > > >
> > > > http://poi.apache.org/components/spreadsheet/limitations.html
> > > >
> > > > Thanks for any help anyone can provide.
> > > >
> > > > Oscar
> > > >
> > > > Oscar Bastidas
> > > > Research Associate
> > > > University of Minnesota
> > >
> > >
>
>

Re: Reading Massive Excel Files to csv

Posted by Andreas Reichel <an...@manticore-projects.com>.
Oscar,

there is also a fork of Excel Streamer which is worth to
mention: https://github.com/pjfanning/excel-streaming-reader
Personally I prefer that fork, but I have no good reason or argument.
In the past this fork picked up version changes of POI faster. 

Cheers
Andreas



On Mon, 2021-05-03 at 06:05 -0500, Oscar Bastidas wrote:
> Awesome.  Thanks, I'll give this a try.
> 
> Oscar
> 
> Oscar Bastidas
> Research Associate
> University of Minnesota
> 
> On Mon, May 3, 2021, 6:04 AM Andreas Reichel
> <an...@manticore-projects.com>
> wrote:
> 
> > Greetings.
> > 
> > Please use the Excel Streaming Reader when reading large
> > files: https://github.com/monitorjbl/excel-streaming-reader
> > 
> > import com.monitorjbl.xlsx.StreamingReader;
> > 
> > InputStream is = new FileInputStream(new
> > File("/path/to/workbook.xlsx"));
> > Workbook workbook = StreamingReader.builder()
> >         .rowCacheSize(100)    // number of rows to keep in memory
> > (defaults to 10)
> >         .bufferSize(4096)     // buffer size to use when reading
> > InputStream to file (defaults to 1024)
> >         .open(is);            // InputStream or File for XLSX file
> > (required)
> > 
> > 
> > 
> > With the code above you can loop through your rows and write it to
> > CSV.
> > Best regards
> > Andreas
> > 
> > 
> > On Mon, 2021-05-03 at 05:31 -0500, Oscar Bastidas wrote:
> > > Hello,
> > > 
> > > I am trying to read a large Excel spreadsheet (60,000 rows) but I
> > > get
> > > what
> > > appears to be a memory leak error from the JVM when I use the
> > > *XSSFWorkbook
> > > *API.  I learned recently that there are size limitations on
> > > Excel
> > > files
> > > being read in this way and the Apache POI website specifically
> > > recommends
> > > reading the file in a streaming fashion instead of taking the
> > > whole
> > > file in
> > > memory.  To do this, POI recommends using something called
> > > *XLSX2CSV*
> > > but
> > > the provided link to teach how to use this returns a "page not
> > > found
> > > error."
> > > 
> > > Would someone please point me in the direction of how to handle
> > > reading my
> > > big Excel file?
> > > 
> > > The Apache POI URL that contains the link to *XLSX2CSV* is:
> > > 
> > > http://poi.apache.org/components/spreadsheet/limitations.html
> > > 
> > > Thanks for any help anyone can provide.
> > > 
> > > Oscar
> > > 
> > > Oscar Bastidas
> > > Research Associate
> > > University of Minnesota
> > 
> > 


Re: Reading Massive Excel Files to csv

Posted by Oscar Bastidas <ob...@umn.edu.INVALID>.
Awesome.  Thanks, I'll give this a try.

Oscar

Oscar Bastidas
Research Associate
University of Minnesota

On Mon, May 3, 2021, 6:04 AM Andreas Reichel <an...@manticore-projects.com>
wrote:

> Greetings.
>
> Please use the Excel Streaming Reader when reading large
> files: https://github.com/monitorjbl/excel-streaming-reader
>
> import com.monitorjbl.xlsx.StreamingReader;
>
> InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
> Workbook workbook = StreamingReader.builder()
>         .rowCacheSize(100)    // number of rows to keep in memory
> (defaults to 10)
>         .bufferSize(4096)     // buffer size to use when reading
> InputStream to file (defaults to 1024)
>         .open(is);            // InputStream or File for XLSX file
> (required)
>
>
>
> With the code above you can loop through your rows and write it to CSV.
> Best regards
> Andreas
>
>
> On Mon, 2021-05-03 at 05:31 -0500, Oscar Bastidas wrote:
> > Hello,
> >
> > I am trying to read a large Excel spreadsheet (60,000 rows) but I get
> > what
> > appears to be a memory leak error from the JVM when I use the
> > *XSSFWorkbook
> > *API.  I learned recently that there are size limitations on Excel
> > files
> > being read in this way and the Apache POI website specifically
> > recommends
> > reading the file in a streaming fashion instead of taking the whole
> > file in
> > memory.  To do this, POI recommends using something called *XLSX2CSV*
> > but
> > the provided link to teach how to use this returns a "page not found
> > error."
> >
> > Would someone please point me in the direction of how to handle
> > reading my
> > big Excel file?
> >
> > The Apache POI URL that contains the link to *XLSX2CSV* is:
> >
> > http://poi.apache.org/components/spreadsheet/limitations.html
> >
> > Thanks for any help anyone can provide.
> >
> > Oscar
> >
> > Oscar Bastidas
> > Research Associate
> > University of Minnesota
>
>

Re: Reading Massive Excel Files to csv

Posted by Andreas Reichel <an...@manticore-projects.com>.
Greetings.

Please use the Excel Streaming Reader when reading large
files: https://github.com/monitorjbl/excel-streaming-reader

import com.monitorjbl.xlsx.StreamingReader;

InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
Workbook workbook = StreamingReader.builder()
        .rowCacheSize(100)    // number of rows to keep in memory (defaults to 10)
        .bufferSize(4096)     // buffer size to use when reading InputStream to file (defaults to 1024)
        .open(is);            // InputStream or File for XLSX file (required)



With the code above you can loop through your rows and write it to CSV.
Best regards
Andreas


On Mon, 2021-05-03 at 05:31 -0500, Oscar Bastidas wrote:
> Hello,
> 
> I am trying to read a large Excel spreadsheet (60,000 rows) but I get
> what
> appears to be a memory leak error from the JVM when I use the
> *XSSFWorkbook
> *API.  I learned recently that there are size limitations on Excel
> files
> being read in this way and the Apache POI website specifically
> recommends
> reading the file in a streaming fashion instead of taking the whole
> file in
> memory.  To do this, POI recommends using something called *XLSX2CSV*
> but
> the provided link to teach how to use this returns a "page not found
> error."
> 
> Would someone please point me in the direction of how to handle
> reading my
> big Excel file?
> 
> The Apache POI URL that contains the link to *XLSX2CSV* is:
> 
> http://poi.apache.org/components/spreadsheet/limitations.html
> 
> Thanks for any help anyone can provide.
> 
> Oscar
> 
> Oscar Bastidas
> Research Associate
> University of Minnesota