You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by "Polk, Scott W" <Sc...@Pearson.com> on 2012/10/11 20:30:23 UTC

[HWPF] Get Style Name for Paragraph/Character from Word Doc (.doc)

I have a MS Word document created in Word 2007, but saved as Word
97-2003 format (.doc).  This document contains 2 lines as follows:

Test Test2
Test3

The style of the first line is set to Quote, while the style of the
second line is set to Strong.

Using poi-3.8-20120326.jar and poi-scratchpad-3.8-20120326.jar, I am
looping through each paragraph, retrieving the Style Index
(p.getStyleIndex), and getting the Style Name (style.getName).  This
works great for the first paragraph and returns the name Quote.  For the
second paragraph (Test3), it returns the Style Name Normal.  Even the
Style Index is returned as 0.  To make things more interesting, if I
change the style of the first paragraph to Strong and leave the second
paragraph as Strong, the code returns the Style Name Quote for the first
paragraph and Normal for the second paragraph.

Any help would be appreciated.  I would be happy to provide my code and
test document, if needed.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: [HWPF] Get Style Name for Paragraph/Character from Word Doc (.doc)

Posted by "Polk, Scott W" <Sc...@Pearson.com>.
I retrieved the start and end offset of each paragraph, and the start
and end offset of each character run in each paragraph.  Here are the
results:

Paragraph: "Test Test2
"
  Start: 0
  End: 11
CharacterRun: "Test Test2"
  Start: 0
  End: 10
CharacterRun: "
"
  Start: 10
  End: 11

Paragraph: "Test3
"
  Start: 11
  End: 17
CharacterRun: "Test3
"
  Start: 11
  End: 17

How would I get the character style info rather than the paragraph style
info?  I see in Paragraph that you can use getStyleIndex to get the
style name from the StyleSheet, but there is nothing like this for
CharacterRun.

This is the code I am using.  Maybe I am doing something incorrectly?

POIFSFileSystem poifs = new POIFSFileSystem(new FileInputStream(path));
HWPFDocument wdDoc = new HWPFDocument(poifs);
		
// set range for entire document
Range range = wdDoc.getRange();

// loop through all paragraphs in range
for (int i = 0; i < range.numParagraphs(); i++) {
	Paragraph p = range.getParagraph(i);
	System.out.println("Paragraph: \"" + p.text() + "\"");
	System.out.println("  Start: " + p.getStartOffset());
	System.out.println("  End: " + p.getEndOffset());
	
	for (int j = 0; j < p.numCharacterRuns(); j++) {
		CharacterRun cr = p.getCharacterRun(j);
		System.out.println("CharacterRun: \"" + cr.text() +
"\"");
		System.out.println("  Start: " + cr.getStartOffset());
		System.out.println("  End: " + cr.getEndOffset());
	}
	
	// check if style index is greater than total number of styles
	if (wdDoc.getStyleSheet().numStyles() > p.getStyleIndex()) {
		System.out.println("Returned Style Index -> " +
p.getStyleIndex());
		StyleDescription style =
wdDoc.getStyleSheet().getStyleDescription(p.getStyleIndex());
		String styleName = style.getName();
		// write style name and associated text
		System.out.println(styleName + " -> " + p.text());
	} else {
		System.out.println("\n" +
wdDoc.getStyleSheet().numStyles() + " ----> " + p.getStyleIndex());
	}
}

Scott


-----Original Message-----
From: Nick Burch [mailto:nick@apache.org] 
Sent: Friday, October 12, 2012 4:47 AM
To: user@poi.apache.org
Subject: Re: [HWPF] Get Style Name for Paragraph/Character from Word Doc
(.doc)

On 11/10/12 19:30, Polk, Scott W wrote:
> The style of the first line is set to Quote, while the style of the
> second line is set to Strong.

Is that the style of the paragraph, or just of some text? IIRC, you can 
style either a paragraph or some text in it (possibly all of it!), and 
they end up differently in the file

It might be worth checking the start and end of the character runs 
within in the paragraphs, to check what's happening. You might find you 
need to get character style info rather than paragraph style info

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: [HWPF] Get Style Name for Paragraph/Character from Word Doc (.doc)

Posted by Nick Burch <ni...@apache.org>.
On 11/10/12 19:30, Polk, Scott W wrote:
> The style of the first line is set to Quote, while the style of the
> second line is set to Strong.

Is that the style of the paragraph, or just of some text? IIRC, you can 
style either a paragraph or some text in it (possibly all of it!), and 
they end up differently in the file

It might be worth checking the start and end of the character runs 
within in the paragraphs, to check what's happening. You might find you 
need to get character style info rather than paragraph style info

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org