You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Christiaan Fluit <ch...@aduna.biz> on 2006/02/09 13:09:57 UTC

Re: Word files & Build vs. Buy?

Hello all,

I'm replying to two threads at once as what I have to say relates to both.

My company recently started an open source project called Aperture 
(http://sourceforge.net/projects/aperture), together with the German 
DFKI institute. The project is still very much in alpha stage, but I do 
believe we already have some code parts that could help people here.

Basically, it's a framework for crawling information sources (file 
systems, mail folders, websites, ...) and extracting as much information 
from it as possible. Besides full-text extraction, we also put a lot of 
effort in extraction and modeling of the metadata occurring in these 
sources and document formats. Both parties have some proprietary code 
lying on the shelf that is being open sourced and ported to the Aperture 
architecture.

Now on to the raised questions:

arnaudbuffet@free.fr wrote:
> WordDocument wd = new WordDocument(is);

jwang@dicarta.com wrote:
> MS Word - I know that POI exists, but development on the Word portion
> seems to have stopped, and there are a lot of nasty looking bugs in
> their DB.  Since we're involved in dealing with contracts, many of our
> Word files are large and complicated.  How has everyone's experience
> with POI's Word parsing been?

My experience is that the WordDocument class crashes on about 25% of the 
documents, i.e. it throws some sort of Exception. I've tested POI 
2.5.1-final as well as the current code in CVS, but both produce this 
result. I even suspect the output to be 100% the same, but I haven't 
verified this.

Another reason I don't like this class is that it operates on an 
InputStream and internally creates a POIFSFileSystem which you cannot 
access, so that it becomes hard to extract document metadata as well 
(for which you need the PFSFS) without buffering the entire InputStream. 
The same applies to TextMining's WordExtractor, which also operates on 
top of lower level POI components.

I've recently committed a WordExtractor to Aperture that uses its own 
code operating on these lower level POI datastructures, which works a 
lot better, failing only 5% of my 300 test docs. I don't pretend to 
understand all the internals of the POI APIs, but it Works For Me.

When POI throws an exception, the WordExtractor will revert to applying 
a heuristic string extraction algorithm to extract as much 
human-readable text as possible from the binary stream, which works 
quite well on MS Office files, i.e. the output is reasonably well for 
indexing purposes.

Be sure to checkout Aperture from CVS as this code isn't part of the 
alpha 1 release. A next official release is expected in a month.

jwang@dicarta.com wrote:
> RTF - javax.swing looks fine, we use those classes already.

Swing's RTFEditorKit does indeed work surpringly well. "Surprisingly" 
because in the past I had many issues with it, typically throwing 
exceptions on 25-50% of my test documents. Recently I haven't seen a 
single one (using Java 1.5.0), so either I am now feeding it a more 
optimal document set or the Swing people have worked on the 
implementation. In that case people using Java 1.4.x may see different 
results.

> Word Perfect - There doesn't seem to be any converters for this format?

I'm actively working on this :) We have some proprietary code that will 
become part of Aperture. Right now I cannot say how well it performs in 
practice though, although we've never had complaints with our 
proprietary apps.

The code uses a heuristic string extraction algorithm tuned for 
WordPerfect documents. This may be an issue, e.g. when you also want to 
display the extraction results to end users.

If you're interested: one way you can help me get the most out of it is 
by sending me some example WordPerfect documents because I hardly have 
those on my hard drive. Fake documents made with very new or old 
WordPerfect versions are also most welcome.

Regards,

Chris
http://aduna.biz
--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Word files & Build vs. Buy?

Posted by Christiaan Fluit <ch...@aduna.biz>.

Dmitry Goldenberg wrote:
> Awesome stuff. A few questions: is your Excel extractor somehow
> better than POI's? and, what do you see as the timeframe for adding
> WordPerfect support? Are you considering supporting any other sources
> such as MS Project, Framemaker, etc?

I just committed a WordPerfectExtractor ;)

It's based on code developed in-house at Aduna and it seems to work 
quite well on my test collection of WordPerfect documents. Only 
sometimes words are split in the middle, I'm still looking into that.

The test set has a bias for older WordPerfect documents though, I'm 
trying to get my hands on a recent copy of WordPerfect to see if the 
latest format is also supported and to create unit tests for it.

To interactively test the extractor(s) yourselves:

- checkout Aperture from CVS (see 
http://sourceforge.net/cvs/?group_id=150969)
- do "ant release"
- go to build\release\bin and execute fileinspector.bat
- drag any file (WordPerfect or any other format) to see what MIME type 
Aperture thinks it is and to execute the corresponding Extractor, if 
available. The two tabs show the extracted full-text and an RDF dump of 
the metadata. For WordPerfect, only full-text extraction is currently 
supported.

Our ExcelExtractor is basically nothing more than glue code between POI 
and the rest of our framework, meaning that an application using the 
framework can request an Extractor implementation for 
"application/vnd.ms-excel", feed it an InputStream and get the text and 
metadata back.

The only advantage of our ExcelExtractor over direct use of POI is that, 
when POI throws an Exception on a particular document, it reverts to a 
heuristic string extraction algorithm which is often able to extract 
full-text from a document with reasonable quality, i.e. suited for indexing.

We are surely considering supporting more formats. Which ones we will 
work on depends on a number of factors, e.g. availability of open source 
libs for that format, complexity of the file format (we did WordPerfect 
by ourselves), customer demand, code contributions from others, etc. In 
any case, if you need support for format XYZ, you can always send me 
some example files and I'll take a look at how hard it is to add support 
for it.

Chris
--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Word files & Build vs. Buy?

Posted by Dmitry Goldenberg <dm...@weblayers.com>.

Chris,

Awesome stuff. A few questions: is your Excel extractor somehow better than POI's? and, what do you see as the timeframe for adding WordPerfect support? Are you considering supporting any other sources such as MS Project, Framemaker, etc?

Thanx,
- Dmitry

________________________________

From: Christiaan Fluit [mailto:christiaan.fluit@aduna.biz]
Sent: Thu 2/9/2006 4:09 AM
To: java-user@lucene.apache.org
Subject: Re: Word files & Build vs. Buy?

Hello all,

I'm replying to two threads at once as what I have to say relates to both.

My company recently started an open source project called Aperture
(http://sourceforge.net/projects/aperture), together with the German
DFKI institute. The project is still very much in alpha stage, but I do
believe we already have some code parts that could help people here.

Basically, it's a framework for crawling information sources (file
systems, mail folders, websites, ...) and extracting as much information
from it as possible. Besides full-text extraction, we also put a lot of
effort in extraction and modeling of the metadata occurring in these
sources and document formats. Both parties have some proprietary code
lying on the shelf that is being open sourced and ported to the Aperture
architecture.

Now on to the raised questions:

arnaudbuffet@free.fr wrote:
> WordDocument wd = new WordDocument(is);

jwang@dicarta.com wrote:
> MS Word - I know that POI exists, but development on the Word portion
> seems to have stopped, and there are a lot of nasty looking bugs in
> their DB.  Since we're involved in dealing with contracts, many of our
> Word files are large and complicated.  How has everyone's experience
> with POI's Word parsing been?

My experience is that the WordDocument class crashes on about 25% of the
documents, i.e. it throws some sort of Exception. I've tested POI
2.5.1-final as well as the current code in CVS, but both produce this
result. I even suspect the output to be 100% the same, but I haven't
verified this.

Another reason I don't like this class is that it operates on an
InputStream and internally creates a POIFSFileSystem which you cannot
access, so that it becomes hard to extract document metadata as well
(for which you need the PFSFS) without buffering the entire InputStream.
The same applies to TextMining's WordExtractor, which also operates on
top of lower level POI components.

I've recently committed a WordExtractor to Aperture that uses its own
code operating on these lower level POI datastructures, which works a
lot better, failing only 5% of my 300 test docs. I don't pretend to
understand all the internals of the POI APIs, but it Works For Me.

When POI throws an exception, the WordExtractor will revert to applying
a heuristic string extraction algorithm to extract as much
human-readable text as possible from the binary stream, which works
quite well on MS Office files, i.e. the output is reasonably well for
indexing purposes.

Be sure to checkout Aperture from CVS as this code isn't part of the
alpha 1 release. A next official release is expected in a month.

jwang@dicarta.com wrote:
> RTF - javax.swing looks fine, we use those classes already.

Swing's RTFEditorKit does indeed work surpringly well. "Surprisingly"
because in the past I had many issues with it, typically throwing
exceptions on 25-50% of my test documents. Recently I haven't seen a
single one (using Java 1.5.0), so either I am now feeding it a more
optimal document set or the Swing people have worked on the
implementation. In that case people using Java 1.4.x may see different
results.

> Word Perfect - There doesn't seem to be any converters for this format?

I'm actively working on this :) We have some proprietary code that will
become part of Aperture. Right now I cannot say how well it performs in
practice though, although we've never had complaints with our
proprietary apps.

The code uses a heuristic string extraction algorithm tuned for
WordPerfect documents. This may be an issue, e.g. when you also want to
display the extraction results to end users.

If you're interested: one way you can help me get the most out of it is
by sending me some example WordPerfect documents because I hardly have
those on my hard drive. Fake documents made with very new or old
WordPerfect versions are also most welcome.

Regards,

Chris
http://aduna.biz
--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Word files & Build vs. Buy?

Posted by Nick Burch <ni...@torchbox.com>.

On Thu, 9 Feb 2006, Christiaan Fluit wrote:
> Yes, that's exactly what I'm doing. Having this in POI would benefit me 
> a lot though, as I hardly understand the POI basics to be honest (my 
> fault, not POI's).

OK, that's now in POI (you'll need a scratchpad build from late yesterday 
or today, see http://encore.torchbox.com/poi-cvs-build/ for jars)

The code is in org.apache.poi.hwpf.extractor.WordExtractor, and it 
supports grabbing all the text, or grabbing an array of the text in each 
paragraph

If you have any problems/queries/comments on it, then you'll probably get 
a better response on poi-user than here!

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Word files & Build vs. Buy?

Posted by Christiaan Fluit <ch...@aduna.biz>.

Nick Burch wrote:
> You could try using org.apache.poi.hwpf.HWPFDocument, and getting the 
> range, then the paragraphs, and grab the text from each paragraph. If 
> there's interest, I could probably commit an extractor that does this to 
> poi.

Yes, that's exactly what I'm doing. Having this in POI would benefit me 
a lot though, as I hardly understand the POI basics to be honest (my 
fault, not POI's).

This is my current code (adapted from Aperture code in CVS):

HWPFDocument doc = new HWPFDocument(poiFileSystem);
StringBuffer buffer = new StringBuffer(4096);

Iterator textPieces = doc.getTextTable().getTextPieces().iterator();
while (textPieces.hasNext()) {
	TextPiece piece = (TextPiece) textPieces.next();

	// the following is derived from
	// http://article.gmane.org/gmane.comp.jakarta.poi.devel/7406
	String encoding = "Cp1252";
	if (piece.usesUnicode()) {
		encoding = "UTF-16LE";
	}

	buffer.append(new String(piece.getRawBytes(), encoding));
}

// normalize end-of-line characters and remove any lines
// containing macros
BufferedReader reader = new BufferedReader(new
     StringReader(buffer.toString()));
buffer.setLength(0);

String line;
while ((line = reader.readLine()) != null) {
	if (line.indexOf("DOCPROPERTY") == -1) {
		buffer.append(line);
		buffer.append(END_OF_LINE);
	}
}

// fetch the extracted full-text
String text = buffer.toString();


Regards,

Chris
--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Word files & Build vs. Buy?

Posted by Nick Burch <ni...@torchbox.com>.

On Thu, 9 Feb 2006, Christiaan Fluit wrote:
> My experience is that the WordDocument class crashes on about 25% of the 
> documents, i.e. it throws some sort of Exception. I've tested POI 
> 2.5.1-final as well as the current code in CVS, but both produce this 
> result. I even suspect the output to be 100% the same, but I haven't 
> verified this.

You could try using org.apache.poi.hwpf.HWPFDocument, and getting the 
range, then the paragraphs, and grab the text from each paragraph. If 
there's interest, I could probably commit an extractor that does this to 
poi.

(WordDocument is from the hdf package, which is older and less reliable 
than the current hwpf stuff)

> Another reason I don't like this class is that it operates on an 
> InputStream and internally creates a POIFSFileSystem which you cannot 
> access, so that it becomes hard to extract document metadata as well 
> (for which you need the PFSFS) without buffering the entire InputStream.

If you're using HWPFDocument from cvs, then you can create that from a 
POIFSFileSystem.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org