You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Adriano Labate <la...@verticali.com> on 2003/05/28 14:03:27 UTC

RE : Parsers

The www.textmining.org text extractors work very well for Word and pdf
documents. 
They use both PDFBox and POI.

For Excel, using POI directly is very easy. Tell me if you want to see
code samples.

I'm looking myself for a Powerpoint text extractor, if you know one...

Adriano Labate


-----Message d'origine-----
De : Pete Lewis [mailto:pete@uptima.co.uk] 
Envoyé : mercredi, 28 mai 2003 12:48
À : Lucene Users List
Objet : Parsers


Hi all,

I have a rather nice html parser that I got from SourceForge.  Does
anyone know of any good parsers for pdf and Microsoft Office Suite
(.doc, .ppt, .xls, etc), any help would be much appreciated.

Pete Lewis




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: RE : Parsers

Posted by Pete Lewis <pe...@uptima.co.uk>.
Hi Victor

Thanks.

In the past I have used the Inso OutsideIn filters and found them very good;
however I'd like to come up with a pure Java solution, so if there is a Java
equivalent to the Inso filters I be grateful for any details.  Failing that,
I thought that I'd go for individual parsers initially using the file
extensions to select the correct parser but in the future adding a file type
recogniser for files without extensions.  Hence my request for anyone
knowing of good parsers particularly for the most common formats.

That being said, has anyone come across a Powerpoint parser?

Pete

----- Original Message -----
From: "Victor Hadianto" <vi...@nuix.com.au>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, May 29, 2003 12:01 AM
Subject: Re: RE : Parsers


> > The www.textmining.org text extractors work very well for Word and pdf
> > documents.
> > They use both PDFBox and POI.
> >
> > For Excel, using POI directly is very easy. Tell me if you want to see
> > code samples.
> >
> > I'm looking myself for a Powerpoint text extractor, if you know one...
>
> Another solution is to use Microsoft Office itself. You can setup a server
> that serve request to convert Microsoft Office doc. There are many ways of
> doing this, for example using Python to directly call Office then put your
> python script in a webserver.
>
> Or you can set a .Net conversion server and you can call this .Net service
> using a Web Service, and many other interesting technique.
>
> victor
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: RE : Parsers

Posted by Pete Lewis <pe...@uptima.co.uk>.
Hi guys

Thanks, Jawin looks really nice :) 

Pete
----- Original Message ----- 
From: "Andrzej Bialecki" <ab...@getopt.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, May 29, 2003 9:45 AM
Subject: Re: RE : Parsers


> Victor Hadianto wrote:
> >>I'm using successfully a combination of Office automation via Jawin
> >>(free Java/COM bridge) to convert PPT files. You need to learn a bit
> >>about the pseudo-object model of PowerPoint to properly convert various
> >>objects, but this information can be found at msdn.microsoft.com.
> > 
> > 
> > Hmm this is really a nice idea, I've never heard of Jawin until now. 
> > 
> >
> 
> I highly recommend it - it works pretty well, it's stable, mature, and 
> most of all free :-) Sure, it has a well-known range of problems, e.g. 
> with calls to functions that require structs, but as it happens most of 
> the automation interfaces don't use them. I've been using it for 
> Java-Windows integration on various occasions, solving such "taboo" 
> problems like reading/creating Windows shortcuts, file conversion, 
> reading Outlook mail etc.
> 
> It works also with DLL's, although this is a bit more involved... It 
> uses an extensible marshaller/de-marshaller, so if you know COM pretty 
> well you can extend it to handle any conceivable parameter types.
> 
> -- 
> Best regards,
> Andrzej Bialecki
> 
> -------------------------------------------------
> Software Architect, System Integration Specialist
> CEN/ISSS EC Workshop, ECIMF project chair
> EU FP6 E-Commerce Expert/Evaluator
> -------------------------------------------------
> FreeBSD developer (http://www.freebsd.org)
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: RE : Parsers

Posted by Andrzej Bialecki <ab...@getopt.org>.
Victor Hadianto wrote:
>>I'm using successfully a combination of Office automation via Jawin
>>(free Java/COM bridge) to convert PPT files. You need to learn a bit
>>about the pseudo-object model of PowerPoint to properly convert various
>>objects, but this information can be found at msdn.microsoft.com.
> 
> 
> Hmm this is really a nice idea, I've never heard of Jawin until now. 
> 
>

I highly recommend it - it works pretty well, it's stable, mature, and 
most of all free :-) Sure, it has a well-known range of problems, e.g. 
with calls to functions that require structs, but as it happens most of 
the automation interfaces don't use them. I've been using it for 
Java-Windows integration on various occasions, solving such "taboo" 
problems like reading/creating Windows shortcuts, file conversion, 
reading Outlook mail etc.

It works also with DLL's, although this is a bit more involved... It 
uses an extensible marshaller/de-marshaller, so if you know COM pretty 
well you can extend it to handle any conceivable parameter types.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: RE : Parsers

Posted by Victor Hadianto <vi...@nuix.com.au>.
> I'm using successfully a combination of Office automation via Jawin
> (free Java/COM bridge) to convert PPT files. You need to learn a bit
> about the pseudo-object model of PowerPoint to properly convert various
> objects, but this information can be found at msdn.microsoft.com.

Hmm this is really a nice idea, I've never heard of Jawin until now. 

wes



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: RE : Parsers

Posted by David Warnock <da...@sundayta.com>.
Andrzej,

> Yes, I checked this solution in the past, but (unless something changed 
> drastically) OpenOffice converters and Java integration are coupled 
> tightly with the whole suite, so basically you have to install the whole 
> suite (50MB?) just to be able to use the converters. In my case (a 
> desktop utility) that would be an overkill... However, for server-based 
> converters this could make a lot of sense - but then I believe you can 
> work directly with the internal OO object model instead of xml files.

Sorry, I am so deeply into server mode these days I don't remember about 
desktop uses.

Dave
-- 
David Warnock, Sundayta Ltd. http://www.sundayta.com
iDocSys for Document Management. VisibleResults for Fundraising.
Development and Hosting of Web Applications and Sites.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: RE : Parsers

Posted by Andrzej Bialecki <ab...@getopt.org>.
David Warnock wrote:
> Andrzej,
> 
> Another solution for all MS Office formats is to use openoffice.org the 
> latest betas have a powerful Java SDK. So for example you could script a 
> central copy to open MS Docs and save as html for parsing in lucene. Or 
> you could save in Openoffice.org formats (which are zipped xml) and 
> throw those at lucene.
> 
> Dave
> 
>>> Another solution is to use Microsoft Office itself. You can setup a 
>>> server that serve request to convert Microsoft Office doc. There are 
>>> many ways of doing this, for example using Python to directly call 
>>> Office then put your python script in a webserver.
> 
> 
> 

Yes, I checked this solution in the past, but (unless something changed 
drastically) OpenOffice converters and Java integration are coupled 
tightly with the whole suite, so basically you have to install the whole 
suite (50MB?) just to be able to use the converters. In my case (a 
desktop utility) that would be an overkill... However, for server-based 
converters this could make a lot of sense - but then I believe you can 
work directly with the internal OO object model instead of xml files.

And I agree that their Java SDK has almost everything you may want, even 
a nice document bean that allows you to work with a document editor in 
JComponent.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: RE : Parsers

Posted by David Warnock <da...@sundayta.com>.
Andrzej,

Another solution for all MS Office formats is to use openoffice.org the 
latest betas have a powerful Java SDK. So for example you could script a 
central copy to open MS Docs and save as html for parsing in lucene. Or 
you could save in Openoffice.org formats (which are zipped xml) and 
throw those at lucene.

Dave
>> Another solution is to use Microsoft Office itself. You can setup a 
>> server that serve request to convert Microsoft Office doc. There are 
>> many ways of doing this, for example using Python to directly call 
>> Office then put your python script in a webserver.


-- 
David Warnock, Sundayta Ltd. http://www.sundayta.com
iDocSys for Document Management. VisibleResults for Fundraising.
Development and Hosting of Web Applications and Sites.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: RE : Parsers

Posted by Andrzej Bialecki <ab...@getopt.org>.
Victor Hadianto wrote:
>>The www.textmining.org text extractors work very well for Word and pdf
>>documents.
>>They use both PDFBox and POI.
>>
>>For Excel, using POI directly is very easy. Tell me if you want to see
>>code samples.
>>
>>I'm looking myself for a Powerpoint text extractor, if you know one...
> 
> 
> Another solution is to use Microsoft Office itself. You can setup a server 
> that serve request to convert Microsoft Office doc. There are many ways of 
> doing this, for example using Python to directly call Office then put your 
> python script in a webserver.
> 
> Or you can set a .Net conversion server and you can call this .Net service 
> using a Web Service, and many other interesting technique.

I'm using successfully a combination of Office automation via Jawin 
(free Java/COM bridge) to convert PPT files. You need to learn a bit 
about the pseudo-object model of PowerPoint to properly convert various 
objects, but this information can be found at msdn.microsoft.com.

Obviously I'd love to learn about an alternative, because then I could 
free my clients from dependance on Office... I already use POI to 
convert XLS and DOC files, and it works _very_ well.


-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: RE : Parsers

Posted by Victor Hadianto <vi...@nuix.com.au>.
> The www.textmining.org text extractors work very well for Word and pdf
> documents.
> They use both PDFBox and POI.
>
> For Excel, using POI directly is very easy. Tell me if you want to see
> code samples.
>
> I'm looking myself for a Powerpoint text extractor, if you know one...

Another solution is to use Microsoft Office itself. You can setup a server 
that serve request to convert Microsoft Office doc. There are many ways of 
doing this, for example using Python to directly call Office then put your 
python script in a webserver.

Or you can set a .Net conversion server and you can call this .Net service 
using a Web Service, and many other interesting technique.

victor


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE : Parsers

Posted by Adriano Labate <la...@verticali.com>.
Pete,

Here's some samples.

For Word using Textmining:
	String textContent = new
WordExtractor().extractText(inputStream);

For PDF using Textmining:
	String textContent = new
PDFExtractor().extractText(inputStream);

For Excel using POI:
(From
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakart
a.apache.org&msgId=698633)

    /**
     * Extract text from an Microsoft Excel input stream.
     * @param inputStream
     * @return The raw text obtained by concatenating all text cells
from top to bottom, left to right.
     * @throws IOException
     */
    private static String extractExcelContent(InputStream inputStream)
throws IOException {
        HSSFWorkbook wb = new HSSFWorkbook(inputStream);
        int nbSheets = wb.getNumberOfSheets();
        StringBuffer content = new StringBuffer(1024);

        for (int i = 0; i < nbSheets; i++) {
            HSSFSheet sheet = wb.getSheetAt(i);
            int nbRows = sheet.getLastRowNum();

            for (int j = 0; j < nbRows; j++) {
                HSSFRow row = sheet.getRow(j);
                if (row == null)    // empty row
                    continue;

                boolean isLineFound = false;
                Iterator it = row.cellIterator();
                while (it.hasNext()) {
                    HSSFCell cell = (HSSFCell)it.next();
                    int type = cell.getCellType();

                    if (type == HSSFCell.CELL_TYPE_STRING) {
                        content.append(cell.getStringCellValue());
                        content.append(" ");
                        isLineFound = true;
                    }
                }

                if (isLineFound)
                    content.append("\n");       // separate lines/raws
            }
        }

        return content.toString();
    }

Adriano


-----Message d'origine-----
De : Pete Lewis [mailto:pete@uptima.co.uk] 
Envoyé : mercredi, 28 mai 2003 15:02
À : Lucene Users List
Objet : Re: Parsers


Hi Adriano

Thanks.  Code samples would be nice :)

Will come back if I find something for .ppt.

Pete

----- Original Message -----
From: "Adriano Labate" <la...@verticali.com>
To: "'Lucene Users List'" <lu...@jakarta.apache.org>
Sent: Wednesday, May 28, 2003 1:03 PM
Subject: RE : Parsers


The www.textmining.org text extractors work very well for Word and pdf
documents. They use both PDFBox and POI.

For Excel, using POI directly is very easy. Tell me if you want to see
code samples.

I'm looking myself for a Powerpoint text extractor, if you know one...

Adriano Labate


-----Message d'origine-----
De : Pete Lewis [mailto:pete@uptima.co.uk]
Envoyé : mercredi, 28 mai 2003 12:48
À : Lucene Users List
Objet : Parsers


Hi all,

I have a rather nice html parser that I got from SourceForge.  Does
anyone know of any good parsers for pdf and Microsoft Office Suite
(.doc, .ppt, .xls, etc), any help would be much appreciated.

Pete Lewis




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org






---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Parsers

Posted by Pete Lewis <pe...@uptima.co.uk>.
Hi Adriano

Thanks.  Code samples would be nice :)

Will come back if I find something for .ppt.

Pete

----- Original Message -----
From: "Adriano Labate" <la...@verticali.com>
To: "'Lucene Users List'" <lu...@jakarta.apache.org>
Sent: Wednesday, May 28, 2003 1:03 PM
Subject: RE : Parsers


The www.textmining.org text extractors work very well for Word and pdf
documents.
They use both PDFBox and POI.

For Excel, using POI directly is very easy. Tell me if you want to see
code samples.

I'm looking myself for a Powerpoint text extractor, if you know one...

Adriano Labate


-----Message d'origine-----
De : Pete Lewis [mailto:pete@uptima.co.uk]
Envoyé : mercredi, 28 mai 2003 12:48
À : Lucene Users List
Objet : Parsers


Hi all,

I have a rather nice html parser that I got from SourceForge.  Does
anyone know of any good parsers for pdf and Microsoft Office Suite
(.doc, .ppt, .xls, etc), any help would be much appreciated.

Pete Lewis




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org