You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by teena21 <fr...@yahoo.com> on 2008/04/01 15:44:37 UTC
about HWPF
hi all
i m using HWPF to Extract data from word file.
i extract only plain data.
i m not able to extract properties of a particular word or character like
bold, itelic,font name.
when i use iteration of CharacterRun it returns fontproperties only when
it(Properties) changed.
by this i m not geting that which word is bold or which is unbold.
plz help me to extract data with its properties.
--
View this message in context: http://www.nabble.com/about-HWPF-tp16418437p16418437.html
Sent from the POI - User mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
Re: about HWPF
Posted by teena21 <fr...@yahoo.com>.
teena21 wrote:
> hi all
>
> i m using HWPF to Extract data from word file.
> i extract only plain data.
> i m not able to extract properties of a particular word or character like
> bold, itelic,font name.
>
> when i use iteration of CharacterRun it returns fontproperties only when
> it(Properties) changed.
> by this i m not geting that which word is bold or which is unbold.
>
> plz help me to extract data with its properties.
>
Hi,
you can only retrieve the formatting information for a specific
character or word by finding the CharacterRun(s) which contains it and
then retrieve its properties. Word files contain formatting information
in "layers". A paragraph may be bold by default and the text within it
may have specific formatting which turns "bold" off again. CharacterRun
takes care of these layers and delivers the final formatting.
To retrieve the formatting for specific words, I would suggest to
identify the position of the word in the document's text - for instance
in document content "abc def", the word "def" is at 4-7 (counting starts
at 0, end is after last character of word). Now walk through the list of
CharacterRuns and find all which have a range which intersects the
interval of the word. If you are lucky, its only one CharacterRun; it
gets complicated when more are matching.
For instance "def" could be formatted to be bold, and only 'e' is
italic. Then you get three CharacterRuns intersecting the word interval.
So if each intersecting CharacterRun says isBold()==true, then the
word is completely bold.
Beware of CharacterRuns which have an interval outside of the text range
and also beware of CharacterRuns with length 0. I've encountered both in
various Word files.
Let me know, if you need more information :-)
Best wishes, Rainer
--
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
thanks
plz give me a simple code that solve this prblem
--
View this message in context: http://www.nabble.com/about-HWPF-tp16418437p16442991.html
Sent from the POI - User mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org
Re: about HWPF
Posted by Rainer Schwarze <rs...@admadic.de>.
teena21 wrote:
> hi all
>
> i m using HWPF to Extract data from word file.
> i extract only plain data.
> i m not able to extract properties of a particular word or character like
> bold, itelic,font name.
>
> when i use iteration of CharacterRun it returns fontproperties only when
> it(Properties) changed.
> by this i m not geting that which word is bold or which is unbold.
>
> plz help me to extract data with its properties.
>
Hi,
you can only retrieve the formatting information for a specific
character or word by finding the CharacterRun(s) which contains it and
then retrieve its properties. Word files contain formatting information
in "layers". A paragraph may be bold by default and the text within it
may have specific formatting which turns "bold" off again. CharacterRun
takes care of these layers and delivers the final formatting.
To retrieve the formatting for specific words, I would suggest to
identify the position of the word in the document's text - for instance
in document content "abc def", the word "def" is at 4-7 (counting starts
at 0, end is after last character of word). Now walk through the list of
CharacterRuns and find all which have a range which intersects the
interval of the word. If you are lucky, its only one CharacterRun; it
gets complicated when more are matching.
For instance "def" could be formatted to be bold, and only 'e' is
italic. Then you get three CharacterRuns intersecting the word interval.
So if each intersecting CharacterRun says isBold()==true, then the
word is completely bold.
Beware of CharacterRuns which have an interval outside of the text range
and also beware of CharacterRuns with length 0. I've encountered both in
various Word files.
Let me know, if you need more information :-)
Best wishes, Rainer
--
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org