You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@poi.apache.org by Tales Paiva Nogueira <ta...@great.ufc.br> on 2006/12/05 21:52:52 UTC

PPT Unicode

Hi List,

    When PowerPoint stores text in Unicode a unknown char (byte value = 
0) is placed between every "normal" char making the text 2 times longer 
than it really is. I can ignore these garbage chars, but I lost the text 
style informations, as it's indexes are based in the original unicode 
text with all that unicode trash. :(

    Is there any way to keep the style information and get the text as a 
TextByteAtom, instead of TextCharsAtom?

Thank you very much.
--
Tales Paiva

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: PPT Unicode

Posted by Nick Burch <ni...@torchbox.com>.

On Thu, 7 Dec 2006, Tales Paiva Nogueira wrote:
>   I'm not changing the text. I just read it. My problem occurs when there is 
> any TextCharsAtom because the platform I am using doesn't support Unicode,

!

> just ISO-8859-1. So I had to change the code replacing UTF-16LE by 
> ISO-8859-1.

If it runs java, it ought (must?) to be able to handle unicode internally, 
even if it can't display it.

So, you should be able to get away with fetching the strings from the 
TextRun, transcoding them into ISO-8859-1 in your code, and throwing away 
anything you can't cope with.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: PPT Unicode

Posted by Tales Paiva Nogueira <ta...@great.ufc.br>.

Hi,

    I'm not changing the text. I just read it. My problem occurs when 
there is any TextCharsAtom because the platform I am using doesn't 
support Unicode, just ISO-8859-1. So I had to change the code replacing 
UTF-16LE by ISO-8859-1.
    So I think I have no way out but show the text, without styles.

Thanks a lot,
--
Tales Paiva


Nick Burch wrote:
> On Tue, 5 Dec 2006, Tales Paiva Nogueira wrote:
>> When PowerPoint stores text in Unicode a unknown char (byte value = 
>> 0) is placed between every "normal" char making the text 2 times 
>> longer than it really is.
>
> TextCharsAtoms, and other unicode containing fields in powerpoint 
> files, are stored as UTF-16. That means two bytes are used to store 
> every character. US-ASCII will be stored with the second byte zero, 
> but other characters will need to make some use of the second byte.
>
> If you call getText() on a TextCharsAtom, it'll convert it to a string 
> for you. You should really be using that, not getting the bytes directly.
>
>
>> Is there any way to keep the style information and get the text as a 
>> TextByteAtom, instead of TextCharsAtom?
>
> Why? PowerPoint decided to make it a TextCharsAtom, rather than a 
> TextByteAtom, since your string contained at least one character that 
> couldn't be represented in a TextByteAtom.
>
> HSLF supports upgrading a TextByteAtom to a TextCharsAtom if you try 
> to set text that can't be held in a TextByteAtom. It doesn't do the 
> other way around.
>
>
> If you really want just the low order bytes, call getText() on the 
> TextCharsAtom, and mangle the string yourself. Not sure why you'd want 
> to though....
>
> Nick
>


Yegor Kozlov wrote:
> Hi,
>
> Could you provide a test case?
>
> As I understood you did something like this:
>
>  - take a ppt file with a text.
>  - programmatically change the text using HSLF API
>  - save file
>  - style information is wrong after save.
>
>  Is it correct?
>  
>  Yegor
>
> TPN> Hi List,
>
> TPN>     When PowerPoint stores text in Unicode a unknown char (byte value = 
> TPN> 0) is placed between every "normal" char making the text 2 times longer 
> TPN> than it really is. I can ignore these garbage chars, but I lost the text 
> TPN> style informations, as it's indexes are based in the original unicode 
> TPN> text with all that unicode trash. :(
>
> TPN>     Is there any way to keep the style information and get the text as a 
> TPN> TextByteAtom, instead of TextCharsAtom?
>
> TPN> Thank you very much.
> TPN> --
> TPN> Tales Paiva
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: PPT Unicode

Posted by Yegor Kozlov <ye...@dinom.ru>.

Hi,

Could you provide a test case?

As I understood you did something like this:

 - take a ppt file with a text.
 - programmatically change the text using HSLF API
 - save file
 - style information is wrong after save.

 Is it correct?
 
 Yegor

TPN> Hi List,

TPN>     When PowerPoint stores text in Unicode a unknown char (byte value = 
TPN> 0) is placed between every "normal" char making the text 2 times longer 
TPN> than it really is. I can ignore these garbage chars, but I lost the text 
TPN> style informations, as it's indexes are based in the original unicode 
TPN> text with all that unicode trash. :(

TPN>     Is there any way to keep the style information and get the text as a 
TPN> TextByteAtom, instead of TextCharsAtom?

TPN> Thank you very much.
TPN> --
TPN> Tales Paiva

TPN> ---------------------------------------------------------------------
TPN> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
TPN> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
TPN> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: PPT Unicode

Posted by Nick Burch <ni...@torchbox.com>.

On Tue, 5 Dec 2006, Tales Paiva Nogueira wrote:
> When PowerPoint stores text in Unicode a unknown char (byte value = 0) 
> is placed between every "normal" char making the text 2 times longer 
> than it really is.

TextCharsAtoms, and other unicode containing fields in powerpoint files, 
are stored as UTF-16. That means two bytes are used to store every 
character. US-ASCII will be stored with the second byte zero, but other 
characters will need to make some use of the second byte.

If you call getText() on a TextCharsAtom, it'll convert it to a string for 
you. You should really be using that, not getting the bytes directly.

> Is there any way to keep the style information and get the text as a 
> TextByteAtom, instead of TextCharsAtom?

Why? PowerPoint decided to make it a TextCharsAtom, rather than a 
TextByteAtom, since your string contained at least one character that 
couldn't be represented in a TextByteAtom.

HSLF supports upgrading a TextByteAtom to a TextCharsAtom if you try to 
set text that can't be held in a TextByteAtom. It doesn't do the other way 
around.

If you really want just the low order bytes, call getText() on the 
TextCharsAtom, and mangle the string yourself. Not sure why you'd want to 
though....

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/