You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Branden Visser <mr...@gmail.com> on 2016/07/15 02:27:56 UTC

Finding and navigating headings

Hello,

I'm using Apache POI to extract heading (not header/footer) content
from a document. Essentially, I need to:

1. Determine if it has headings
2. Get the headings in order as they appear in the document (i.e.,
reading order)
3. Get the "heading levels" if possible

The point of this is to inspect if the document is set up in such a
way that it can be navigated using accessibility tools.

I've read a bit about how to do this on StackOverflow, but I was
wondering if there are more direct ways to get it than inspecting the
style names?

I've seen at least in the OLE2 document format, the paragraph model
has a getLvl which gives its outline level. Is there an "outline"
model available that can be navigated to find the headings (if any) in
the document? Is that what the "Bookmarks" are or am I going down the
wrong path?

Note that I'll be writing an interface that can do this for both the
XML and OLE2 variations of this content so information for both would
be extra helpful.

If there is anything that is built into the underlying document
formats but just not exposed in the POI APIs, I'd certainly consider
having a look and contributing some additions if existing ways are
very unreliable.

Thanks,
Branden

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Finding and navigating headings

Posted by Mark Murphy <jm...@gmail.com>.
In XWPF, there is nothing about a style that indicates that it is a heading
style. If you want to understand how TOC determines what to call a heading,
you can look in the specs Part 4: Markup Language Reference, December 2006
section 2.16.5.75. You will see that it generally will use the style names
HeadingX where X is a number from 1-9. These are built in style names. POI
does not directly support these yet, but you can use them as Word does
understand them, even if they are not included in the document.

On Sat, Jul 16, 2016 at 3:36 PM, Dominik Stadler <do...@gmx.at>
wrote:

> Hi,
>
> it seems nobody can provide much information for this task. So your best
> option is likely to take a look at the spec-documents listed at
> *http://poi.apache.org/guidelines.html#GetInvolved
> <http://poi.apache.org/guidelines.html#GetInvolved>* and check if you can
> find anything useful.
>
> If you know if there is something that could be used for this, then we can
> guide you how you can make this information available to you via some POI
> low-level interfaces.
>
> Dominik.
>
> On Fri, Jul 15, 2016 at 4:27 AM, Branden Visser <mr...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I'm using Apache POI to extract heading (not header/footer) content
> > from a document. Essentially, I need to:
> >
> > 1. Determine if it has headings
> > 2. Get the headings in order as they appear in the document (i.e.,
> > reading order)
> > 3. Get the "heading levels" if possible
> >
> > The point of this is to inspect if the document is set up in such a
> > way that it can be navigated using accessibility tools.
> >
> > I've read a bit about how to do this on StackOverflow, but I was
> > wondering if there are more direct ways to get it than inspecting the
> > style names?
> >
> > I've seen at least in the OLE2 document format, the paragraph model
> > has a getLvl which gives its outline level. Is there an "outline"
> > model available that can be navigated to find the headings (if any) in
> > the document? Is that what the "Bookmarks" are or am I going down the
> > wrong path?
> >
> > Note that I'll be writing an interface that can do this for both the
> > XML and OLE2 variations of this content so information for both would
> > be extra helpful.
> >
> > If there is anything that is built into the underlying document
> > formats but just not exposed in the POI APIs, I'd certainly consider
> > having a look and contributing some additions if existing ways are
> > very unreliable.
> >
> > Thanks,
> > Branden
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> > For additional commands, e-mail: user-help@poi.apache.org
> >
> >
>

Re: Finding and navigating headings

Posted by Dominik Stadler <do...@gmx.at>.
Hi,

it seems nobody can provide much information for this task. So your best
option is likely to take a look at the spec-documents listed at
*http://poi.apache.org/guidelines.html#GetInvolved
<http://poi.apache.org/guidelines.html#GetInvolved>* and check if you can
find anything useful.

If you know if there is something that could be used for this, then we can
guide you how you can make this information available to you via some POI
low-level interfaces.

Dominik.

On Fri, Jul 15, 2016 at 4:27 AM, Branden Visser <mr...@gmail.com> wrote:

> Hello,
>
> I'm using Apache POI to extract heading (not header/footer) content
> from a document. Essentially, I need to:
>
> 1. Determine if it has headings
> 2. Get the headings in order as they appear in the document (i.e.,
> reading order)
> 3. Get the "heading levels" if possible
>
> The point of this is to inspect if the document is set up in such a
> way that it can be navigated using accessibility tools.
>
> I've read a bit about how to do this on StackOverflow, but I was
> wondering if there are more direct ways to get it than inspecting the
> style names?
>
> I've seen at least in the OLE2 document format, the paragraph model
> has a getLvl which gives its outline level. Is there an "outline"
> model available that can be navigated to find the headings (if any) in
> the document? Is that what the "Bookmarks" are or am I going down the
> wrong path?
>
> Note that I'll be writing an interface that can do this for both the
> XML and OLE2 variations of this content so information for both would
> be extra helpful.
>
> If there is anything that is built into the underlying document
> formats but just not exposed in the POI APIs, I'd certainly consider
> having a look and contributing some additions if existing ways are
> very unreliable.
>
> Thanks,
> Branden
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>