You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by teena21 <fr...@yahoo.com> on 2008/04/01 15:44:37 UTC

about HWPF

hi all

i m using HWPF to Extract data from word file.
i extract only plain data.
i m not able to extract properties of a particular word or character like
bold, itelic,font name.

when i use iteration of CharacterRun it returns fontproperties only when
it(Properties) changed. 
by this i m not geting that which word is bold or which is unbold.

plz help me to extract data with its properties.

-- 
View this message in context: http://www.nabble.com/about-HWPF-tp16418437p16418437.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: about HWPF

Posted by teena21 <fr...@yahoo.com>.


teena21 wrote:
> hi all
> 
> i m using HWPF to Extract data from word file.
> i extract only plain data.
> i m not able to extract properties of a particular word or character like
> bold, itelic,font name.
> 
> when i use iteration of CharacterRun it returns fontproperties only when
> it(Properties) changed. 
> by this i m not geting that which word is bold or which is unbold.
> 
> plz help me to extract data with its properties.
> 

Hi,

you can only retrieve the formatting information for a specific 
character or word by finding the CharacterRun(s) which contains it and 
then retrieve its properties. Word files contain formatting information 
in "layers". A paragraph may be bold by default and the text within it 
may have specific formatting which turns "bold" off again. CharacterRun 
takes care of these layers and delivers the final formatting.

To retrieve the formatting for specific words, I would suggest to 
identify the position of the word in the document's text - for instance 
in document content "abc def", the word "def" is at 4-7 (counting starts 
at 0, end is after last character of word). Now walk through the list of 
CharacterRuns and find all which have a range which intersects the 
interval of the word. If you are lucky, its only one CharacterRun; it 
gets complicated when more are matching.

For instance "def" could be formatted to be bold, and only 'e' is 
italic. Then you get three CharacterRuns intersecting the word interval. 
  So if each intersecting CharacterRun says isBold()==true, then the 
word is completely bold.

Beware of CharacterRuns which have an interval outside of the text range 
and also beware of CharacterRuns with length 0. I've encountered both in 
various Word files.

Let me know, if you need more information :-)

Best wishes, Rainer
-- 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

thanks

plz give me a simple code that solve this prblem
-- 
View this message in context: http://www.nabble.com/about-HWPF-tp16418437p16442991.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: about HWPF

Posted by Rainer Schwarze <rs...@admadic.de>.
teena21 wrote:
> hi all
> 
> i m using HWPF to Extract data from word file.
> i extract only plain data.
> i m not able to extract properties of a particular word or character like
> bold, itelic,font name.
> 
> when i use iteration of CharacterRun it returns fontproperties only when
> it(Properties) changed. 
> by this i m not geting that which word is bold or which is unbold.
> 
> plz help me to extract data with its properties.
> 

Hi,

you can only retrieve the formatting information for a specific 
character or word by finding the CharacterRun(s) which contains it and 
then retrieve its properties. Word files contain formatting information 
in "layers". A paragraph may be bold by default and the text within it 
may have specific formatting which turns "bold" off again. CharacterRun 
takes care of these layers and delivers the final formatting.

To retrieve the formatting for specific words, I would suggest to 
identify the position of the word in the document's text - for instance 
in document content "abc def", the word "def" is at 4-7 (counting starts 
at 0, end is after last character of word). Now walk through the list of 
CharacterRuns and find all which have a range which intersects the 
interval of the word. If you are lucky, its only one CharacterRun; it 
gets complicated when more are matching.

For instance "def" could be formatted to be bold, and only 'e' is 
italic. Then you get three CharacterRuns intersecting the word interval. 
  So if each intersecting CharacterRun says isBold()==true, then the 
word is completely bold.

Beware of CharacterRuns which have an interval outside of the text range 
and also beware of CharacterRuns with length 0. I've encountered both in 
various Word files.

Let me know, if you need more information :-)

Best wishes, Rainer
-- 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org