You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@poi.apache.org by Chris Gioran <hi...@gmail.com> on 2007/02/08 19:48:11 UTC

hslf way of getting slideshow text language

Hi all,

I am searching for a way to extract language information from .ppt
documents that corresponds to specific bits of text. Googling up and
searching both through scratchpad hslf code and file format specs from
wotsit gave me no useful information. The thing i learned however is
that it is altogether different from .doc documents, where i have
successfully carried out this task in the past. Any pointers at all
would be appreciated.
In an attempt to be more specific, from my understanding, I need to grab
record with type 4010, which the code at Sheet.java in method
findTextRuns() just skips as "Safe to ignore". Is this stored and parsed
somewhere that i missed, or must i do all the parsing myself? In such a
case, could you please provide me with some pointers as to where to
begin?

Thanks in advance for any help.

Chris Gioran


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: hslf way of getting slideshow text language

Posted by Chris Gioran <hi...@gmail.com>.

regards everyone,

it appears that i have some useful results regarding content language 
information extraction from ppt documents using hslf. Note that all 
information below is derived from reverse engineering and its correctness can 
be argued upon until more experimentation.

The data set i used consisted of many ppt documents from different versions of 
ms powerpoint in Greek, French, English and Italian, and i have noticed the 
following patterns:

4008 atoms are followed by a 4010 atom that stores language information. This 
is true also for 4000 atoms that have unicode text *with* non-Unicode (my 
guess, the system's default language is stored as unicode, all others as 
ascii - more to find on that). Now, the 4010 atoms that correspond to 4008 
atoms have a fairly consistent appearence, that is:

first 4 bytes as known (record type and code)
next 4 bytes length (also known)

what follows are records that hold information regarding language ID and 
spelling info in the following format:

first, the no of characters this record applies to (4 bytes)

the next bytes are a bit more complicated. So far I have encountered 2 types 
of data. Either the value 0x00000006 or 0x000k00000007 (the lengths are 
correct, that is they *are* different) that certainly have spelling 
information, which is apparent from the transition from the second to the 
first when the "ignore spelling" option is selected for that text. k above is 
a value that varies and i have been unable so far to attribute to some 
property, presumably due to the simplicity of the text i have used.

After that comes the language information (2 bytes, as known for ms formats) 
and then a trailer value of 2 bytes that is constantly 0x0000

The atom ends with bytes that i have been so far unable to control as to how 
they appear. They are of small length, say 8 or 10 bytes, mostly zero. They 
could be information that applies to the slide as a whole, but i have nothing 
on it yet.

Hope this information is of some value for more generalized results. Time 
contstraints do not allow me to work on this much however, so some feedback 
is as always appreciated. I will post any futher findings i come up with.

cheers,
Chris Gioran

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: Re[2]: hslf way of getting slideshow text language

Posted by himicos <hi...@gmail.com>.

On 2/14/07, himicos <hi...@gmail.com> wrote:
> Atoms 4000 are not necessarily followed by such an atom,
> and if they do it has no value that can be mapped to a language code.....

Correction on that. It *has* language info if it contains English text
(i.e. English mixed with Greek) and it has language ID 1033 (which
obviously corresponds to the English text).

Moreover, i experimented with French (as another language) and noticed
that it is stored in 4008 with language info containing 4010 atoms
following.This is however an early result, it has to be put into
context.

Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: Re[2]: hslf way of getting slideshow text language

Posted by himicos <hi...@gmail.com>.

On 2/13/07, Yegor Kozlov <ye...@dinom.ru> wrote:

> Probably the language info is really stored in TextSpecInfoAtom.
> Do you have any idea about internal structure of this record? If you
> do, please share it.
>
> Yegor
>

I do not have any knowledge about the internal structure of 4010 atoms
(or any other for that matter) that does not come from either the SDK
specs or from reverse engineering. However, I will tell you what i
have noticed.

First, let me state that for the moment all of my efforts are focused
on Slide text extraction. I will do notes later. Also, though it does
not harm generality, i work with English and greek text, so all of my
examples are with MS language codes 1032 and 1033. My PowerPoint
version is 2003 SP2 on WinXP Pro SP2.

>From my early tests, i notice that text is stored in two types of
records. Atom 4000 holds Unicode text and atom 4008 holds any text
that can be represented as pure ascii. This comes from the fact that
any attempt to insert greek characters in the text results in storing
the string as Unicode, whereas all pure english text is stored as atom
4008. What is interesting is that all 4008 atoms are followed by a
4010 atom that has a value 1033 (0x0904 little endian) at offset 0x12
or 0x16 (pointed by the atom header - to find out more). Atoms 4000
are not necessarily followed by such an atom, and if they do it has no
value that can be mapped to a language code, a fact that can be
explained as that 4010 atom holding some information for the text that
has nothing to do with language (as per the spec, 4010 atoms also hold
non-language info).

I believe that my presentation is consistent, since at Unicode text
the notion of language ID does not apply and any ascii text has
language info attached to it. However, besides the fact that I could
be way of the mark here, I am troubled by the absence of codepages in
non-english text. I have yet to get PowerPoint to store greek text as
non-Unicode.

I will continue my work with more ppt's and get back with more info. I
would appreciate some feedback though, even if it is simply ideas for
test cases.

cheers,
Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re[2]: hslf way of getting slideshow text language

Posted by Yegor Kozlov <ye...@dinom.ru>.


h> This led me to believe that this atom holds the information I seek. It
h> is also the only place that language information in mentioned in the
h> spec. Even if I am wrong, is there any knowledge as to where language
h> ID for the text runs is held? Right now I am experimenting with
h> StyleTextPropAtom. I will get back with more info and hopefully more
h> precise questions.

Probably the language info is really stored in TextSpecInfoAtom.
Do you have any idea about internal structure of this record? If you
do, please share it.

Yegor


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: hslf way of getting slideshow text language

Posted by himicos <hi...@gmail.com>.

On 2/9/07, Yegor Kozlov <ye...@dinom.ru> wrote:
>
> 4010 is TextSpecInfoAtom. It stores special format runs that can't be
> described by normal style records. For example,
> if a part of a string is a hyperlink this info is stored in TextSpecInfoAtom.
> AFAIK it has nothing to do with the language information.
>
> Yegor

Quote from ppt spec downloaded from wotsit.org describing the
TextSpecInfoAtom (4010)

"The special info runs contained in this text. "Special infos" are
character properties that don't follow styles, such as background
spelling info or language ID. Special parsing code is needed to parse
content of this atom."

End quote.

This led me to believe that this atom holds the information I seek. It
is also the only place that language information in mentioned in the
spec. Even if I am wrong, is there any knowledge as to where language
ID for the text runs is held? Right now I am experimenting with
StyleTextPropAtom. I will get back with more info and hopefully more
precise questions.

Thanks for your time,
Chris Gioran

> CG> Hi all,
>
> CG> I am searching for a way to extract language information from .ppt
> CG> documents that corresponds to specific bits of text. Googling up and
> CG> searching both through scratchpad hslf code and file format specs from
> CG> wotsit gave me no useful information. The thing i learned however is
> CG> that it is altogether different from .doc documents, where i have
> CG> successfully carried out this task in the past. Any pointers at all
> CG> would be appreciated.
> CG> In an attempt to be more specific, from my understanding, I need to grab
> CG> record with type 4010, which the code at Sheet.java in method
> CG> findTextRuns() just skips as "Safe to ignore". Is this stored and parsed
> CG> somewhere that i missed, or must i do all the parsing myself? In such a
> CG> case, could you please provide me with some pointers as to where to
> CG> begin?
>
> CG> Thanks in advance for any help.
>
> CG> Chris Gioran

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: hslf way of getting slideshow text language

Posted by Yegor Kozlov <ye...@dinom.ru>.

4010 is TextSpecInfoAtom. It stores special format runs that can't be
described by normal style records. For example,
if a part of a string is a hyperlink this info is stored in TextSpecInfoAtom.
AFAIK it has nothing to do with the language information.

Yegor

CG> Hi all,

CG> I am searching for a way to extract language information from .ppt
CG> documents that corresponds to specific bits of text. Googling up and
CG> searching both through scratchpad hslf code and file format specs from
CG> wotsit gave me no useful information. The thing i learned however is
CG> that it is altogether different from .doc documents, where i have
CG> successfully carried out this task in the past. Any pointers at all
CG> would be appreciated.
CG> In an attempt to be more specific, from my understanding, I need to grab
CG> record with type 4010, which the code at Sheet.java in method
CG> findTextRuns() just skips as "Safe to ignore". Is this stored and parsed
CG> somewhere that i missed, or must i do all the parsing myself? In such a
CG> case, could you please provide me with some pointers as to where to
CG> begin?

CG> Thanks in advance for any help.

CG> Chris Gioran


CG> ---------------------------------------------------------------------
CG> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
CG> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
CG> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/