You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@poi.apache.org by tamski <ak...@istel.ru> on 2008/05/13 14:25:10 UTC

Rubbish in extracted text

Hi.
When I'm trying to extract pure text from doc-file with
org.apache.poi.hwpf.extractor.WordExtractor, I get text with rubbish like


\* ARABIC 3
PAGE  


PAGE  7

and other unreadable characters.

Is it possible to restrict it while extracting or by using some additional
POI tools?

Thanks in advance.
-- 
View this message in context: http://www.nabble.com/Rubbish-in-extracted-text-tp17207175p17207175.html
Sent from the POI - User mailing list archive at Nabble.com.

Re: Rubbish in extracted text

Posted by Nick Burch <ni...@torchbox.com>.

On Fri, 16 May 2008, Rainer Schwarze wrote:
> these are fields. A quick solution is this: Pass the extracted text 
> string through a filter which removes the field codes. Fields are 
> delimited by 0x13 (start), 0x14 (separator) and 0x15 (end) bytes. With 
> fields which don't have a separator (0x14), remove all from 0x13 to 
> 0x15.

I've just added some code to svn to implement this algorithm. It's on 
Range, and is Range.stripFields(String)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: Rubbish in extracted text

Posted by Rainer Schwarze <rs...@admadic.de>.

tamski wrote:
> Hi.
> When I'm trying to extract pure text from doc-file with
> org.apache.poi.hwpf.extractor.WordExtractor, I get text with rubbish like
> 
> 
> \* ARABIC 3
> PAGE  

Hi,

these are fields. A quick solution is this: Pass the extracted text 
string through a filter which removes the field codes. Fields are 
delimited by 0x13 (start), 0x14 (separator) and 0x15 (end) bytes. With 
fields which don't have a separator (0x14), remove all from 0x13 to 
0x15. If a separator exists between start and end, remove from 0x13 to 
0x14 and then remove the 0x15 (keep text between 0x14 and 0x15). 
However, beware that fields can be nested, so you can well encounter 
sequences like 0x13 ... 0x13 ... 0x15 ... 0x15 and much more complicated 
stuff.

Best wishes, Rainer

> PAGE  7
> 
> and other unreadable characters.
> 
> Is it possible to restrict it while extracting or by using some additional
> POI tools?
> 
> Thanks in advance.

-- 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org