You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Baby Periasamy <ba...@gmail.com> on 2011/07/27 14:46:45 UTC

How to extract special character and symbols from the word document

Hi POI users,

Can you please guide me how can i get the special characters from the word
document.

I am using WordToHtmlConverter class and other classes to generate the html
from word document file, to display on the jsp page.

Here I am able to get the image, tables and other paragraph text.
I am not able to extract special character to display on the jsp page.

Plz help me out. Very urgent please.

Thanks in advance.



--
View this message in context: http://apache-poi.1045710.n5.nabble.com/How-to-extract-special-character-and-symbols-from-the-word-document-tp4638645p4638645.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract special character and symbols from the word document

Posted by Baby Periasamy <ba...@gmail.com>.
Hi,

Sorry, I have not allowed to send the document. Actually you can see the
symbols/characters in this post by rajeev mohanraj.

Can you please help me out.

Thank you.

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/How-to-extract-special-character-and-symbols-from-the-word-document-tp4638645p4658869.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract special character and symbols from the word document

Posted by Baby Periasamy <ba...@gmail.com>.
Hi Nick,

I cold not found any related information. Plz help me out.

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/How-to-extract-special-character-and-symbols-from-the-word-document-tp4638645p4658562.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract special character and symbols from the word document

Posted by Rajeev Mohanraj <ra...@gmail.com>.
Nick Burch <nick.burch <at> alfresco.com> writes:

> 
> On Mon, 1 Aug 2011, Rajeev Mohanraj wrote:
> > Am also facing similiar kind of problem. Poi doesnt read the special 
> > characters & symbols from word document. for example in my word document 
> > contains µĪĦĜăĂ content, when i read this with poi it gives ?????? like 
> > that. how to get the exact special character.. Pls help me out.
> 
> Looks like you've sent an incorrect encoding on your output. This comes up 
> a lot, mostly with people trying to use excel, see the archives for 
> details on how to resolve it for your given platform
> 
> Nick
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe <at> poi.apache.org
> For additional commands, e-mail: user-help <at> poi.apache.org


Hi Nick,

I already set encoding method to UTF-8. am using wordtohtml conversion using 
poi. and i set output encoding method as UTF-8. but i still get special 
characters µĪĦĜăĂ as ?????? only. whether i need to read content as UTF-8?


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract special character and symbols from the word document

Posted by Sergey Vladimirov <vl...@gmail.com>.
Rajeev,

There can be a problem with bidirection documents. Such things can be
complicated and errors can occur (first of all because i usually don't
work with bidirectional documents). To fix the issues please open new
bug in bugzilla and attach document and result HTML, generated using
latest build.

Also, on new topic please create new letter (do not use reply-to-all)
when starting new topic.

Best regards,
Sergey.

On Mon, Aug 8, 2011 at 10:36 AM, Rajeev Mohanraj <ra...@gmail.com> wrote:
> Nick Burch <nick.burch <at> alfresco.com> writes:
>
>>
>> On Mon, 1 Aug 2011, Rajeev Mohanraj wrote:
>> > Am also facing similiar kind of problem. Poi doesnt read the special
>> > characters & symbols from word document. for example in my word document
>> > contains µĪĦĜăĂ content, when i read this with poi it gives ?????? like
>> > that. how to get the exact special character.. Pls help me out.
>>
>> Looks like you've sent an incorrect encoding on your output. This comes up
>> a lot, mostly with people trying to use excel, see the archives for
>> details on how to resolve it for your given platform
>>
>> Nick
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe <at> poi.apache.org
>> For additional commands, e-mail: user-help <at> poi.apache.org
>
>
> Hi,
>  Again i got trouble in Alignment issue. iconvert word documnt to html using
> poi htmltoword converter. but the alignment didnt come properly. the right
> side content display in left side. the alignment format is missing. please
> help me out.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>



-- 
Sergey Vladimirov

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract special character and symbols from the word document

Posted by Rajeev Mohanraj <ra...@gmail.com>.
Nick Burch <nick.burch <at> alfresco.com> writes:

> 
> On Mon, 1 Aug 2011, Rajeev Mohanraj wrote:
> > Am also facing similiar kind of problem. Poi doesnt read the special 
> > characters & symbols from word document. for example in my word document 
> > contains µĪĦĜăĂ content, when i read this with poi it gives ?????? like 
> > that. how to get the exact special character.. Pls help me out.
> 
> Looks like you've sent an incorrect encoding on your output. This comes up 
> a lot, mostly with people trying to use excel, see the archives for 
> details on how to resolve it for your given platform
> 
> Nick
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe <at> poi.apache.org
> For additional commands, e-mail: user-help <at> poi.apache.org


Hi,
 Again i got trouble in Alignment issue. iconvert word documnt to html using 
poi htmltoword converter. but the alignment didnt come properly. the right 
side content display in left side. the alignment format is missing. please 
help me out.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract special character and symbols from the word document

Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 1 Aug 2011, Rajeev Mohanraj wrote:
> Am also facing similiar kind of problem. Poi doesnt read the special 
> characters & symbols from word document. for example in my word document 
> contains µĪĦĜăĂ content, when i read this with poi it gives ?????? like 
> that. how to get the exact special character.. Pls help me out.

Looks like you've sent an incorrect encoding on your output. This comes up 
a lot, mostly with people trying to use excel, see the archives for 
details on how to resolve it for your given platform

Nick

Re: How to extract special character and symbols from the word document

Posted by Rajeev Mohanraj <ra...@gmail.com>.
Sergey Vladimirov <vlsergey <at> gmail.com> writes:

> 
> If it's non-secret document, you can upload it to some file-hosting
> servers (like www.rapidshare.com) or send it to me privatly.
> 


Am also facing similiar kind of problem. Poi doesnt read the special 
characters & symbols from word document. for example in my word document 
contains µĪĦĜăĂ content, when i read this with poi it gives ?????? like that. 
how to get the exact special character.. Pls help me out.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract special character and symbols from the word document

Posted by Sergey Vladimirov <vl...@gmail.com>.
If it's non-secret document, you can upload it to some file-hosting
servers (like www.rapidshare.com) or send it to me privatly.

-- 
Sergey

On Fri, Jul 29, 2011 at 2:13 PM, Baby Periasamy
<ba...@gmail.com> wrote:
> Yeah i have the doc. how can I send that. here there is no attachment option.
>
> --
> View this message in context: http://apache-poi.1045710.n5.nabble.com/How-to-extract-special-character-and-symbols-from-the-word-document-tp4638645p4646467.html
> Sent from the POI - User mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>



-- 
Sergey Vladimirov

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract special character and symbols from the word document

Posted by Baby Periasamy <ba...@gmail.com>.
Yeah i have the doc. how can I send that. here there is no attachment option.

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/How-to-extract-special-character-and-symbols-from-the-word-document-tp4638645p4646467.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract special character and symbols from the word document

Posted by Sergey Vladimirov <vl...@gmail.com>.
Hi,

Okay, i got the idea, but can't reproduce th problem by myself. Do you
have an example of word document with such characters that are not
output by WordToHtmlConverter?

> The special character n symbols could be anything. The word dcoument can
> have all the special character n symbols, like Mu and everything.

-- 
Sergey Vladimirov

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract special character and symbols from the word document

Posted by Baby Periasamy <ba...@gmail.com>.
Hi,

The special character n symbols could be anything. The word dcoument can
have all the special character n symbols, like Mu and everything.

After extracting for mu, its giving a small square box as output. Its
happening like this.

If copy the special symbol here its coming like boxes and ?. So I could not
list those.

Please help me out.

Thanks in advance.

Regards,
Baby Periasamy.





--
View this message in context: http://apache-poi.1045710.n5.nabble.com/How-to-extract-special-character-and-symbols-from-the-word-document-tp4638645p4642644.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract special character and symbols from the word document

Posted by Sergey Vladimirov <vl...@gmail.com>.
Hi,

What is the special character that you need to get from Word document?
Can you provide some king of example of describe it more specifically?

According to specification, "special" is the characer that have
"special" flag in it's properies. AFAIR, such characters have special
meaning and processed by special methods (like processDeadField /
processField / processHyperlink / etc) or not printed in HTML at all.

Sergey.

On Wed, Jul 27, 2011 at 4:46 PM, Baby Periasamy
<ba...@gmail.com> wrote:
> Hi POI users,
>
> Can you please guide me how can i get the special characters from the word
> document.
>
> I am using WordToHtmlConverter class and other classes to generate the html
> from word document file, to display on the jsp page.
>
> Here I am able to get the image, tables and other paragraph text.
> I am not able to extract special character to display on the jsp page.
>
> Plz help me out. Very urgent please.
>
> Thanks in advance.
>
>
>
> --
> View this message in context: http://apache-poi.1045710.n5.nabble.com/How-to-extract-special-character-and-symbols-from-the-word-document-tp4638645p4638645.html
> Sent from the POI - User mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>



-- 
Sergey Vladimirov

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org