You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by tamski <ak...@istel.ru> on 2008/05/13 14:25:10 UTC
Rubbish in extracted text
Hi.
When I'm trying to extract pure text from doc-file with
org.apache.poi.hwpf.extractor.WordExtractor, I get text with rubbish like
\* ARABIC 3
PAGE
PAGE 7
and other unreadable characters.
Is it possible to restrict it while extracting or by using some additional
POI tools?
Thanks in advance.
--
View this message in context: http://www.nabble.com/Rubbish-in-extracted-text-tp17207175p17207175.html
Sent from the POI - User mailing list archive at Nabble.com.
Re: Rubbish in extracted text
Posted by Nick Burch <ni...@torchbox.com>.
On Fri, 16 May 2008, Rainer Schwarze wrote:
> these are fields. A quick solution is this: Pass the extracted text
> string through a filter which removes the field codes. Fields are
> delimited by 0x13 (start), 0x14 (separator) and 0x15 (end) bytes. With
> fields which don't have a separator (0x14), remove all from 0x13 to
> 0x15.
I've just added some code to svn to implement this algorithm. It's on
Range, and is Range.stripFields(String)
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
Re: Rubbish in extracted text
Posted by Rainer Schwarze <rs...@admadic.de>.
tamski wrote:
> Hi.
> When I'm trying to extract pure text from doc-file with
> org.apache.poi.hwpf.extractor.WordExtractor, I get text with rubbish like
>
>
> \* ARABIC 3
> PAGE
Hi,
these are fields. A quick solution is this: Pass the extracted text
string through a filter which removes the field codes. Fields are
delimited by 0x13 (start), 0x14 (separator) and 0x15 (end) bytes. With
fields which don't have a separator (0x14), remove all from 0x13 to
0x15. If a separator exists between start and end, remove from 0x13 to
0x14 and then remove the 0x15 (keep text between 0x14 and 0x15).
However, beware that fields can be nested, so you can well encounter
sequences like 0x13 ... 0x13 ... 0x15 ... 0x15 and much more complicated
stuff.
Best wishes, Rainer
> PAGE 7
>
> and other unreadable characters.
>
> Is it possible to restrict it while extracting or by using some additional
> POI tools?
>
> Thanks in advance.
--
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org