You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Doppelhofer Andreas <An...@salomon.at> on 2010/01/19 13:59:14 UTC

how to set character encoding in new doc file

hi all,
i try to generate a doc file with some different character encodings
like ISO-8859-1 and ISO-8859-5.
In my app i read an existing doc file and then i want to write it to a
new file after doing some changes.
But after writing the new file i try to open it with Microsoft Word
(2003 SP3), there i only get "?" for all unknown characters?!
When i print some debug messages to stdout i get all characters (i think
unicode).
 
How can i set an encoding for Paragraph/Text?
 
I am working with eclipse on windows using poi-bin-3.6-20091214.
 
thx
dops
 

-- 


Salomon Automation GmbH - Friesachstrasse 15 - A-8114 Friesach bei Graz
Sitz der Gesellschaft: Friesach bei Graz
UID-NR:ATU28654300 - Firmenbuchnummer: 49324 K
Firmenbuchgericht: Landesgericht fur Zivilrechtssachen Graz


AW: AW: how to set character encoding in new doc file

Posted by Doppelhofer Andreas <An...@salomon.at>.
This code shows my read of the doc file...

        for (int i = 0; i < range.numParagraphs(); i++) {

            Paragraph myparagraph = range.getParagraph(i);
            // line of a line
            String mytext = myparagraph.text();
            mytext = mytext.replace("\r", "");
            mytext = mytext.replace("\n", "");
            ...
            ...
        } 
Is this ok?

> -----Ursprüngliche Nachricht-----
> Von: Doppelhofer Andreas [mailto:Andreas.Doppelhofer@salomon.at] 
> Gesendet: Freitag, 22. Januar 2010 12:06
> An: POI Users List
> Betreff: AW: AW: how to set character encoding in new doc file
> 
> I use HWPFDocument(...) to read the document. When i print 
> the string (some text in doc) to stdout/stderr all characters 
> are displayed correctly, put when i write it to a new doc 
> file, all russian characters are stored with "?".
> 
> This is ok:
> System.out.println(line);
> 
> This is nok: (after opening with word)
> range.insertAfter(line);
> 
> dops
> 
> > -----Ursprüngliche Nachricht-----
> > Von: Nick Burch [mailto:nick.burch@alfresco.com]
> > Gesendet: Freitag, 22. Januar 2010 11:20
> > An: POI Users List
> > Betreff: Re: AW: how to set character encoding in new doc file
> > 
> > On Fri, 22 Jan 2010, Doppelhofer Andreas wrote:
> > > Can anybody help me with this problem?
> > 
> > Word (plus excel, powerpoint etc) can store strings as unicode or 
> > non-unicode. POI works only with java unicode strings, and handles 
> > reading and writing the strings to the appropriate kinds of 
> bytes for 
> > you.
> > 
> > Make sure you're correctly passing your strings as unicode 
> into java, 
> > converting the encoding as needed.
> > 
> > Nick
> > 
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@poi.apache.org For 
> additional 
> > commands, e-mail: user-help@poi.apache.org
> > 
> > 
> 
> -- 
> 
> 
> Salomon Automation GmbH - Friesachstrasse 15 - A-8114 
> Friesach bei Graz Sitz der Gesellschaft: Friesach bei Graz 
> UID-NR:ATU28654300 - Firmenbuchnummer: 49324 K
> Firmenbuchgericht: Landesgericht für Zivilrechtssachen Graz
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org For 
> additional commands, e-mail: user-help@poi.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


AW: AW: AW: how to set character encoding in new doc file

Posted by Doppelhofer Andreas <An...@salomon.at>.
I want to store unicode characters in word doc, but if i store some russian
Characters only "?" will be displayed. (these chracters exists in unicode)
I think the encoding of these characters are unicode because when i print it to
sysout they will be display correctly.

This sample get's the text from doc and print it to stdout

            System.out.println("#########");
            TextPiece piece;
            Iterator textPieces = mydoc_output.getTextTable().getTextPieces().iterator();
            String text1;
            StringBuffer buffer = new StringBuffer();
            while (textPieces.hasNext()) {
                piece = (TextPiece) textPieces.next();

                try {
                    text1 = new String(piece.getRawBytes(), "UTF-16LE");

                    buffer.append(text1);

                } catch (UnsupportedEncodingException e) {
                    throw new InternalError("Standard Encoding " + "UTF-16LE" + "not found, JVM broken");
                }
            }
            text1 = buffer.toString();
            System.out.println(text1);
            System.out.println("+#+#+#+#+#+");

e.q.
#########
ﻱﺑẬ


"April"
"Апрель"
+#+#+#+#+#+

Then i add text1 to the range, i am getting only "?" for russian characters.
--begin output word doc 

???

"April"
"??????" 
-- end word doc

dops



> -----Ursprüngliche Nachricht-----
> Von: MSB [mailto:markbrdsly@tiscali.co.uk] 
> Gesendet: Freitag, 22. Januar 2010 15:16
> An: user@poi.apache.org
> Betreff: Re: AW: AW: how to set character encoding in new doc file
> 
> 
> Hello Andreas,
> 
> I think that Nick is referring to explictly encoding the 
> Strings using the required/desired character encoding; there 
> are constructors for the java.lang.String class that do allow 
> you to specify the character encoding to the bytes you can 
> strip from the String you have read.
> 
> Remember that HWPF is still very imature as an API and it 
> could well be that the sort of thing you are asking for has 
> not yet been included. The best long term solution may be to 
> join the development team and contribute.
> 
> Yours
> 
> Mark B
> 
> 
> Doppelhofer Andreas wrote:
> > 
> > I use HWPFDocument(...) to read the document. When i print 
> the string 
> > (some text in doc) to stdout/stderr all characters are displayed 
> > correctly, put when i write it to a new doc file, all russian 
> > characters are stored with "?".
> > 
> > This is ok:
> > System.out.println(line);
> > 
> > This is nok: (after opening with word) range.insertAfter(line);
> > 
> > dops
> > 
> >> -----Ursprüngliche Nachricht-----
> >> Von: Nick Burch [mailto:nick.burch@alfresco.com]
> >> Gesendet: Freitag, 22. Januar 2010 11:20
> >> An: POI Users List
> >> Betreff: Re: AW: how to set character encoding in new doc file
> >> 
> >> On Fri, 22 Jan 2010, Doppelhofer Andreas wrote:
> >> > Can anybody help me with this problem?
> >> 
> >> Word (plus excel, powerpoint etc) can store strings as unicode or 
> >> non-unicode. POI works only with java unicode strings, and handles 
> >> reading and writing the strings to the appropriate kinds 
> of bytes for 
> >> you.
> >> 
> >> Make sure you're correctly passing your strings as unicode 
> into java, 
> >> converting the encoding as needed.
> >> 
> >> Nick
> >> 
> >> 

-- 


Salomon Automation GmbH - Friesachstrasse 15 - A-8114 Friesach bei Graz
Sitz der Gesellschaft: Friesach bei Graz
UID-NR:ATU28654300 - Firmenbuchnummer: 49324 K
Firmenbuchgericht: Landesgericht für Zivilrechtssachen Graz


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: AW: AW: how to set character encoding in new doc file

Posted by MSB <ma...@tiscali.co.uk>.
Hello Andreas,

I think that Nick is referring to explictly encoding the Strings using the
required/desired character encoding; there are constructors for the
java.lang.String class that do allow you to specify the character encoding
to the bytes you can strip from the String you have read.

Remember that HWPF is still very imature as an API and it could well be that
the sort of thing you are asking for has not yet been included. The best
long term solution may be to join the development team and contribute.

Yours

Mark B


Doppelhofer Andreas wrote:
> 
> I use HWPFDocument(...) to read the document. When i print the string
> (some text in doc) to stdout/stderr
> all characters are displayed correctly, put when i write it to a new doc
> file, all russian characters are
> stored with "?".
> 
> This is ok:
> System.out.println(line);
> 
> This is nok: (after opening with word)
> range.insertAfter(line);
> 
> dops
> 
>> -----Ursprüngliche Nachricht-----
>> Von: Nick Burch [mailto:nick.burch@alfresco.com] 
>> Gesendet: Freitag, 22. Januar 2010 11:20
>> An: POI Users List
>> Betreff: Re: AW: how to set character encoding in new doc file
>> 
>> On Fri, 22 Jan 2010, Doppelhofer Andreas wrote:
>> > Can anybody help me with this problem?
>> 
>> Word (plus excel, powerpoint etc) can store strings as 
>> unicode or non-unicode. POI works only with java unicode 
>> strings, and handles reading and writing the strings to the 
>> appropriate kinds of bytes for you.
>> 
>> Make sure you're correctly passing your strings as unicode 
>> into java, converting the encoding as needed.
>> 
>> Nick
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org For 
>> additional commands, e-mail: user-help@poi.apache.org
>> 
>> 
> 
> -- 
> 
> 
> Salomon Automation GmbH - Friesachstrasse 15 - A-8114 Friesach bei Graz
> Sitz der Gesellschaft: Friesach bei Graz
> UID-NR:ATU28654300 - Firmenbuchnummer: 49324 K
> Firmenbuchgericht: Landesgericht für Zivilrechtssachen Graz
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/how-to-set-character-encoding-in-new-doc-file-tp27225418p27273764.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


AW: AW: how to set character encoding in new doc file

Posted by Doppelhofer Andreas <An...@salomon.at>.
I use HWPFDocument(...) to read the document. When i print the string (some text in doc) to stdout/stderr
all characters are displayed correctly, put when i write it to a new doc file, all russian characters are
stored with "?".

This is ok:
System.out.println(line);

This is nok: (after opening with word)
range.insertAfter(line);

dops

> -----Ursprüngliche Nachricht-----
> Von: Nick Burch [mailto:nick.burch@alfresco.com] 
> Gesendet: Freitag, 22. Januar 2010 11:20
> An: POI Users List
> Betreff: Re: AW: how to set character encoding in new doc file
> 
> On Fri, 22 Jan 2010, Doppelhofer Andreas wrote:
> > Can anybody help me with this problem?
> 
> Word (plus excel, powerpoint etc) can store strings as 
> unicode or non-unicode. POI works only with java unicode 
> strings, and handles reading and writing the strings to the 
> appropriate kinds of bytes for you.
> 
> Make sure you're correctly passing your strings as unicode 
> into java, converting the encoding as needed.
> 
> Nick
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org For 
> additional commands, e-mail: user-help@poi.apache.org
> 
> 

-- 


Salomon Automation GmbH - Friesachstrasse 15 - A-8114 Friesach bei Graz
Sitz der Gesellschaft: Friesach bei Graz
UID-NR:ATU28654300 - Firmenbuchnummer: 49324 K
Firmenbuchgericht: Landesgericht für Zivilrechtssachen Graz


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: AW: how to set character encoding in new doc file

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 22 Jan 2010, Doppelhofer Andreas wrote:
> Can anybody help me with this problem?

Word (plus excel, powerpoint etc) can store strings as unicode or 
non-unicode. POI works only with java unicode strings, and handles
reading and writing the strings to the appropriate kinds of bytes for you.

Make sure you're correctly passing your strings as unicode into java, 
converting the encoding as needed.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


AW: how to set character encoding in new doc file

Posted by Doppelhofer Andreas <An...@salomon.at>.
Can anybody help me with this problem? Are there some how to's where i can get infos
About character encodings?
Thx dops 

> -----Ursprüngliche Nachricht-----
> Von: Doppelhofer Andreas [mailto:Andreas.Doppelhofer@salomon.at] 
> Gesendet: Dienstag, 19. Januar 2010 13:59
> An: user@poi.apache.org
> Betreff: how to set character encoding in new doc file
> 
> hi all,
> i try to generate a doc file with some different character 
> encodings like ISO-8859-1 and ISO-8859-5.
> In my app i read an existing doc file and then i want to 
> write it to a new file after doing some changes.
> But after writing the new file i try to open it with Microsoft Word
> (2003 SP3), there i only get "?" for all unknown characters?!
> When i print some debug messages to stdout i get all 
> characters (i think unicode).
>  
> How can i set an encoding for Paragraph/Text?
>  
> I am working with eclipse on windows using poi-bin-3.6-20091214.
>  
> thx
> dops
>  
> 
> -- 
> 
> 
> Salomon Automation GmbH - Friesachstrasse 15 - A-8114 
> Friesach bei Graz Sitz der Gesellschaft: Friesach bei Graz 
> UID-NR:ATU28654300 - Firmenbuchnummer: 49324 K
> Firmenbuchgericht: Landesgericht fur Zivilrechtssachen Graz
> 
> 

-- 


Salomon Automation GmbH - Friesachstrasse 15 - A-8114 Friesach bei Graz
Sitz der Gesellschaft: Friesach bei Graz
UID-NR:ATU28654300 - Firmenbuchnummer: 49324 K
Firmenbuchgericht: Landesgericht für Zivilrechtssachen Graz


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org