You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by maxSchlein <m_...@hotmail.com> on 2010/01/11 21:04:34 UTC
Text extraction from ms word doc
I was looking for an option for Text extraction from a word doc.
Currently I am using POI; however, when there is a table in the doc, for
each column POI brings back a . The whitespace analyzer is not filtering
out this character. So whatever word or phrase that is the last word or
phrase within a table column is not found during searching. That is, if the
word dog is the only word in a column, a search for the word dog would
return nothing because the word that was indexed was "dog".
I can create a filter to fix this, using Apache's
StringUtils.isAsciiPrintable, but I would rather not.
Any and all help is welcome and thanked.
--
View this message in context: http://old.nabble.com/Text-extraction-from-ms-word-doc-tp27116739p27116739.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Text extraction from ms word doc
Posted by Karl Wettin <ka...@gmail.com>.
Have you tried antiword?
http://www.winfield.demon.nl/
karl
11 jan 2010 kl. 21.04 skrev maxSchlein:
>
> I was looking for an option for Text extraction from a word doc.
>
> Currently I am using POI; however, when there is a table in the doc,
> for
> each column POI brings back a . The whitespace analyzer is not
> filtering
> out this character. So whatever word or phrase that is the last
> word or
> phrase within a table column is not found during searching. That
> is, if the
> word dog is the only word in a column, a search for the word dog would
> return nothing because the word that was indexed was "dog".
>
> I can create a filter to fix this, using Apache's
> StringUtils.isAsciiPrintable, but I would rather not.
>
> Any and all help is welcome and thanked.
> --
> View this message in context: http://old.nabble.com/Text-extraction-from-ms-word-doc-tp27116739p27116739.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Text extraction from ms word doc
Posted by Michael McCandless <lu...@mikemccandless.com>.
We could also fix WhitespaceAnalyzer to filter that character out?
(Or you could make your own analyzer to do so...).
You could also try asking on the tika-user list whether Tika has a
solution for mapping "extended" whitespace characters...
Mike
On Mon, Jan 11, 2010 at 3:04 PM, maxSchlein <m_...@hotmail.com> wrote:
>
> I was looking for an option for Text extraction from a word doc.
>
> Currently I am using POI; however, when there is a table in the doc, for
> each column POI brings back a . The whitespace analyzer is not filtering
> out this character. So whatever word or phrase that is the last word or
> phrase within a table column is not found during searching. That is, if the
> word dog is the only word in a column, a search for the word dog would
> return nothing because the word that was indexed was "dog ".
>
> I can create a filter to fix this, using Apache's
> StringUtils.isAsciiPrintable, but I would rather not.
>
> Any and all help is welcome and thanked.
> --
> View this message in context: http://old.nabble.com/Text-extraction-from-ms-word-doc-tp27116739p27116739.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org