You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by makadefia <se...@gmail.com> on 2016/01/13 09:48:29 UTC

removing hidden characters

Hi, I'm parsing a word document using Apache POI.
The problem I have right now is that after parsing, the resulting String
(I'm using Java) still has some special characters.
A couple examples:

<TitreType>DRAFT REPORT</TitreType>

<RefProcLect>***I</RefProcLect>

So I'm not sure how to remove this because if I take let's say everything
that is in a <> and remove it then I might end up removing parts of the real
document.
Is there a way to remove only the special characters added by word?



--
View this message in context: http://apache-poi.1045710.n5.nabble.com/removing-hidden-characters-tp5721564.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: removing hidden characters

Posted by makadefia <se...@gmail.com>.
I think this file should be enough to realize the problem.
The whole document is private so I don't think I can share it.
Regards. example-for-poi.doc
<http://apache-poi.1045710.n5.nabble.com/file/n5721592/example-for-poi.doc>  



--
View this message in context: http://apache-poi.1045710.n5.nabble.com/removing-hidden-characters-tp5721564p5721592.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: removing hidden characters

Posted by makadefia <se...@gmail.com>.
tried docx format with

XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor ex = new XWPFWordExtractor(doc);
String toReturn = ex.getText();

same result, the resulting String has the hidden characters.



--
View this message in context: http://apache-poi.1045710.n5.nabble.com/removing-hidden-characters-tp5721564p5721593.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: removing hidden characters

Posted by Dominik Stadler <do...@gmx.at>.
Hi,

Do you have a sample file to go along? Otherwise it is hard to say much,
only that it sounds like a bug in the text-extraction. So it would be best
if you can open a bug entry at https://bz.apache.org/bugzilla/enter_bug.cgi
together with additional information/files so we can track any potential
fix there.

Dominik.

On Wed, Jan 13, 2016 at 10:04 AM, makadefia <se...@gmail.com>
wrote:

> the way I'm getting the text is like this
>
>                 HWPFDocument doc = new HWPFDocument(inputStream);
>                 WordExtractor ex = new WordExtractor(doc);
>                 String toReturn = ex.getText();
>                 ex.close();
>                 return toReturn;
>
>
>
> --
> View this message in context:
> http://apache-poi.1045710.n5.nabble.com/removing-hidden-characters-tp5721564p5721565.html
> Sent from the POI - User mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: removing hidden characters

Posted by makadefia <se...@gmail.com>.
the way I'm getting the text is like this

 		HWPFDocument doc = new HWPFDocument(inputStream);
		WordExtractor ex = new WordExtractor(doc);
		String toReturn = ex.getText();
		ex.close();
		return toReturn;



--
View this message in context: http://apache-poi.1045710.n5.nabble.com/removing-hidden-characters-tp5721564p5721565.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org