You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by "Leimbach, Johannes" <JL...@CONET.DE> on 2006/08/01 08:26:45 UTC

AW: Extract Text From Excel

Hello,

first of all: I do have a first name. It's "Johannes", and I prefer being called "Johannes" over "Leimbach". Thanks ;)

To your problem: 
Can you provide more information about it? What's in the cells and where do these errors come from?
As far as I know HSSF is not able to read formulas or macros from Excel.

Bye,
Johannes

-----Ursprüngliche Nachricht-----
Von: Feris Thia [mailto:feris.apache@gmail.com] 
Gesendet: Montag, 31. Juli 2006 18:01
An: POI Users List
Betreff: Re: Extract Text From Excel

Hello Suba, Michael and Leimbach,

Thanks for the responses... it greatly helps me. Especially to Leimbach, I
have used your wrapper and tested it with my application. It works great :)

But I have some warnings (attach below) ... . Is it a limitation of HSSF not
to be able to read some Excel format ?

[java] [WARNING] Unknown Ptg 14 (20) at cell (5,2)
[java] [WARNING] Unknown Ptg 14 (20) at cell (6,2)
[java] [WARNING] Unknown Ptg 14 (20) at cell (16,2)
[java] [WARNING] Unknown Ptg 14 (20) at cell (5,1)
[java] [WARNING] Unknown Ptg 14 (20) at cell (6,1)
[java] [WARNING] Unknown Ptg 14 (20) at cell (6,4)
[java] [WARNING] Unknown Ptg 14 (20) at cell (24,1)
[java] [WARNING] Unknown Ptg 14 (20) at cell (24,4)
[java] [WARNING] Unknown Ptg 14 (20) at cell (25,1)
[java] [WARNING] Unknown Ptg 14 (20) at cell (25,4)
[java] [WARNING] Unknown Ptg 14 (20) at cell (26,1)
[java] [WARNING] Unknown Ptg 14 (20) at cell (26,4)
[java] [WARNING] Unknown Ptg 14 (20) at cell (27,1)
[java] [WARNING] Unknown Ptg 14 (20) at cell (27,4)

And one more thins... so HSSF do not read the value of formulas ?

Regards,

Feris

On 7/31/06, Leimbach, Johannes <JL...@conet.de> wrote:
>
> Hello,
>
> last week I wrote a wrapper class to facilitate text extraction from Excel
> files, please see the sourcecode below.
> Maybe this (or another example, don't care) should be posted on the POI
> homepage - I see very few beginner's documentation there.
>
> Anyway, here's the class, should be self explanatory:
>
> package fulltext.common.processing.helpers.poi;
>
> import java.io.FileInputStream;
> import java.io.IOException;
> import java.util.Iterator;
>
> import org.apache.poi.hssf.usermodel.HSSFCell;
> import org.apache.poi.hssf.usermodel.HSSFRow;
> import org.apache.poi.hssf.usermodel.HSSFSheet;
> import org.apache.poi.hssf.usermodel.HSSFWorkbook;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
>
> /**
> * Wraps around the POI stuff to read an Excel (XLS) file from disk
> */
> public class ExcelFileWrapper
> {
>         private POIFSFileSystem _fileSystem;
>         private HSSFWorkbook _workbook;
>
>         /**
>          * Initialize the object - does not read yet
>          * @throws IOException
>          */
>         public ExcelFileWrapper(FileInputStream stream) throws IOException
>         {
>                 if (stream == null)
>                         throw new NullPointerException ("in
> ExcelFileWrapper: ctor parameter 'stream' is null.");
>                 //
>         _fileSystem = new POIFSFileSystem(stream);
>         _workbook = new HSSFWorkbook (_fileSystem);
>         }
>
>         /**
>          * Return the contents of all sheets as string.
>          * Every textual cell's content is added here.
>          */
>         public String readContents ()
>         {
>                 // return this
>                 StringBuilder builder = new StringBuilder();
>
>                 // for each sheet
>                 for (int numSheets = 0; numSheets <
> _workbook.getNumberOfSheets(); numSheets++)
>                 {
>                 HSSFSheet sheet = _workbook.getSheetAt(numSheets);
>
>                 // Iterate over each row in the sheet
>                 Iterator rows = sheet.rowIterator();
>                 while( rows.hasNext() )
>                 {
>                     HSSFRow row = (HSSFRow) rows.next();
>
>                     // Iterate over each cell in the row and add the
> cell's content
>                     Iterator cells = row.cellIterator();
>                     while( cells.hasNext() )
>                     {
>                         // get cell..
>                         HSSFCell cell = (HSSFCell) cells.next();
>                         // .. add to stringbuilder
>                         processCell (cell, builder);
>                     }
>
>                 }
>
>         } // for numSheets ..
>
>                 //
>                 return builder.toString();
>         }
>
>         /**
>          * Add the cells's content to the stringbuilder (if appropiate
> content, i.e. text - no numbers)
>          */
>         private void processCell (HSSFCell cell, StringBuilder builder)
>         {
>         switch ( cell.getCellType() )
>         {
>         /*
>             case HSSFCell.CELL_TYPE_NUMERIC:
>                 System.out.println( cell.getNumericCellValue() );
>                 break;
>         */
>             case HSSFCell.CELL_TYPE_STRING:
>                 builder.append (cell.getStringCellValue());
>                 builder.append (" ");
>                 break;
>
>             default:
>                 break;
>         }
>         }
>
> }
>
>
> - Johannes
>
>
> -----Ursprüngliche Nachricht-----
> Von: Michael J. Prichard [mailto:michael_prichard@mac.com]
> Gesendet: Montag, 31. Juli 2006 15:36
> An: POI Users List
> Betreff: Re: Extract Text From Excel
>
> Hey Feris,
>
> That [HSSF] is what I use as well and it works pretty good.
>
> -Michael
>
> Suba Suresh wrote:
>
> > You can use the hssf libraries for excel text extraction. I used it
> > for lucene indexing.
> >
> > suba suresh.
> >
> > Feris Thia wrote:
> >
> >> Hi All,
> >>
> >> I'm new to this user group. Is there any way to extract all the text
> >> from
> >> Excel documents ? Want to perform indexing using POI + Lucene :)
> >>
> >> Thanks,
> >>
> >> Feris
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> > Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> > The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: Extract Text From Excel

Posted by Feris Thia <fe...@gmail.com>.
Dear Johannes,

I'm sorry for the false calling :)

I've been doing some googling and found that the warning resulting from
parsing my excel document is a bug issue in poi. The url is below

http://issues.apache.org/bugzilla/show_bug.cgi?id=30862

My cells that got errors have some formula and "$" character in each cell.

But I think it's not a big problem in my application, and I find your
wrapper very helpful and very thankful for that... If you don't mind I'll
use it in my application, can I ?

Regards,

Feris

On 8/1/06, Leimbach, Johannes <JL...@conet.de> wrote:
>
> Hello,
>
> first of all: I do have a first name. It's "Johannes", and I prefer being
> called "Johannes" over "Leimbach". Thanks ;)
>
> To your problem:
> Can you provide more information about it? What's in the cells and where
> do these errors come from?
> As far as I know HSSF is not able to read formulas or macros from Excel.
>
> Bye,
> Johannes
>
>