You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Yury Batrakov <ba...@gmail.com> on 2008/04/08 11:58:46 UTC

XWPFDocument and relationships

Hi there!

I'm working on extending XWPFWordExtractor's functionality to support
extraction of hyperlinks. I am already done with hyperlinks' text, but
unable to get their URLs. Exploring OOXML document structure, I found
that URLs are stored as  relationships in
/word/_rels/document.xml.rels but i can't find POI code that loads
them.
Does POI support such relationships?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: XWPFDocument and relationships

Posted by Yury Batrakov <ba...@gmail.com>.
I process docx for extraction only, but it seems that openxml4j itself
is as far from the ideal of text extraction tool as POI+openxml is :)

Thanks for your reply, I'll think how it is better to  solve my problem!

On 4/9/08, Nick Burch <ni...@torchbox.com> wrote:
> On Wed, 9 Apr 2008, Yury Batrakov wrote:
>  > I've just tried to wrap hyperlinks and comments to XWPF classes but
>  > stumbled on the same problem: in XWPF we should deal with paragraphs,
>  > records, etc to fetch (for example) hyperlink text, instead of using
>  > pretty methods such as CTWorksheet.getHyperlinks() . Could you give me
>  > more convinient-to-read explanation (rather than build.xml) to implement
>  > such methods via xmlbeans?
>
>
> If you're doing serious processing of .docx files, and not just text
>  extraction, you might find docx4j a better fit for you. For now, with poi,
>  we're just concentrating on text extraction for .docx. It seems silly to
>  put lots of work into a full .docx implementation, when there's already
>  one in openxml4j, and another in docx4j!
>
>  The xmlbeans built jar is just a compiled version of the ooxml xsd schema
>  files. You'll need to read the ooxml specifications to figure out how it
>  all fits together :/
>
>
>  Nick
>
>  ---------------------------------------------------------------------
>  To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>  For additional commands, e-mail: user-help@poi.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: XWPFDocument and relationships

Posted by Nick Burch <ni...@torchbox.com>.
On Wed, 9 Apr 2008, Yury Batrakov wrote:
> I've just tried to wrap hyperlinks and comments to XWPF classes but
> stumbled on the same problem: in XWPF we should deal with paragraphs,
> records, etc to fetch (for example) hyperlink text, instead of using
> pretty methods such as CTWorksheet.getHyperlinks() . Could you give me
> more convinient-to-read explanation (rather than build.xml) to implement
> such methods via xmlbeans?

If you're doing serious processing of .docx files, and not just text
extraction, you might find docx4j a better fit for you. For now, with poi,
we're just concentrating on text extraction for .docx. It seems silly to
put lots of work into a full .docx implementation, when there's already
one in openxml4j, and another in docx4j!

The xmlbeans built jar is just a compiled version of the ooxml xsd schema
files. You'll need to read the ooxml specifications to figure out how it
all fits together :/

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: XWPFDocument and relationships

Posted by Yury Batrakov <ba...@gmail.com>.
Hi Nick!

On 4/9/08, Nick Burch <ni...@torchbox.com> wrote:
> Thanks, I've committed something like that to the svn branch. There's an
>  issue with getting the hyperlinks to display at the right point in the
>  text, but that'll need some xmlbeans digging...

I've just tried to wrap hyperlinks and comments  to XWPF classes but
stumbled on the same problem: in XWPF we should deal with paragraphs,
records, etc to fetch (for example) hyperlink text, instead of using
pretty methods such as CTWorksheet.getHyperlinks() . Could you give me
more convinient-to-read explanation (rather than build.xml) to
implement such methods via xmlbeans?

> You need a newer version of openxml4j - try the one that ant should've
>  downloaded for you handily when you did "compile-ooxml" on a svn checkout
>  from yesterday or today!

I've just fixed this locally :)

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: XWPFDocument and relationships

Posted by Yury Batrakov <ba...@gmail.com>.
Hi Nick,

I submitted patch in issue 44821
https://issues.apache.org/bugzilla/show_bug.cgi?id=44821, could you
give some kind of prognose on when it will (or will not) appear in
ooxml branch?


On 4/11/08, Yury Batrakov <ba...@gmail.com> wrote:
>  BTW, i also wrapped paragraphs, hyperlinks, tables and comments to
>  classes and implemented paragraph parsing as following:
>
>  Iterator<XWPFParagraph> i = document.getParagraphsIterator();
>  while(i.hasNext())
>  {
>     XMLParagraph par = new XWPFCommentsDecorator(new
>  XWPFHyperlinkDecorator(i.next()));
>     text.append(par.getText()+"\n");
>  }
>
>  i'll hope to submit patches on monday
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: XWPFDocument and relationships

Posted by Yury Batrakov <ba...@gmail.com>.
> There's an issue with getting the hyperlinks to display at the right point in the
text

and there is another issue: when we assign hyperlink to multiple lines
of text, word treats it as a paragraph with extra text run with tag
<w:instrText...>HYPERLINK "url"</w:instrText> i'll try to implement it
too

BTW, i also wrapped paragraphs, hyperlinks, tables and comments to
classes and implemented paragraph parsing as following:

Iterator<XWPFParagraph> i = document.getParagraphsIterator();
while(i.hasNext())
{
    XMLParagraph par = new XWPFCommentsDecorator(new
XWPFHyperlinkDecorator(i.next()));
    text.append(par.getText()+"\n");
}

i'll hope to submit patches on monday

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: XWPFDocument and relationships

Posted by Nick Burch <ni...@torchbox.com>.
On Tue, 8 Apr 2008, Yury Batrakov wrote:
> Looks like i am done with that. i'm not sure if my code is useful and
> good enough to submit it to POI source tree :) . I still uses low-level
> stuff and needs to be wrapped to something like XWPFHyperlink. I attach
> my patch here for those who may be interested in it.

Thanks, I've committed something like that to the svn branch. There's an
issue with getting the hyperlinks to display at the right point in the
text, but that'll need some xmlbeans digging...

> Also there is a bug in latest openxml4j snapshot that prevents my code
> from working :), I reported about it on openxml4j dev forum here:
> https://sourceforge.net/forum/forum.php?thread_id=2000813&forum_id=603903

You need a newer version of openxml4j - try the one that ant should've
downloaded for you handily when you did "compile-ooxml" on a svn checkout
from yesterday or today!

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: XWPFDocument and relationships

Posted by Yury Batrakov <ba...@gmail.com>.
On 4/8/08, Yury Batrakov <ba...@gmail.com> wrote:
> Thanks for your reply, I'll look to the code.
>

Looks like i am done with that. i'm not sure if my code is useful and
good enough to submit it to POI source tree :) . I still uses
low-level stuff and needs to be wrapped to something like
XWPFHyperlink. I attach my patch here for those who may be interested
in it.

Also there is a bug in latest openxml4j snapshot that prevents my code
from working :), I reported about it on openxml4j dev forum here:
https://sourceforge.net/forum/forum.php?thread_id=2000813&forum_id=603903

Re: XWPFDocument and relationships

Posted by Nick Burch <ni...@torchbox.com>.
On Tue, 8 Apr 2008, Yury Batrakov wrote:
> On 4/8/08, Nick Burch <ni...@torchbox.com> wrote:
> >  You'll probably want to take a look at how we do it for excel, as there's
> >  now support for ooxml excel hyperlinks. See:
> >   src/ooxml/java/org/apache/poi/xssf/usermodel/XSSFHyperlink.java
> >   src/ooxml/java/org/apache/poi/xssf/usermodel/XSSFSheet.java
>
> Thanks for your reply, I'll look to the code.
>
> BTW: XSSFHyperlink.java can't be compiled because of missing class
> org.apache.poi.ss.usermodel.Hyperlink

If you use ant, it should build it for you. Otherwise, make sure you have
src/ooxml/interfaces-jdk15 in your buildpath

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: XWPFDocument and relationships

Posted by Yury Batrakov <ba...@gmail.com>.
On 4/8/08, Nick Burch <ni...@torchbox.com> wrote:
>  You'll probably want to take a look at how we do it for excel, as there's
>  now support for ooxml excel hyperlinks. See:
>   src/ooxml/java/org/apache/poi/xssf/usermodel/XSSFHyperlink.java
>   src/ooxml/java/org/apache/poi/xssf/usermodel/XSSFSheet.java

Thanks for your reply, I'll look to the code.

BTW: XSSFHyperlink.java can't be compiled because of missing class
org.apache.poi.ss.usermodel.Hyperlink, how can I get it without
installing telepathy4j library? :)

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: XWPFDocument and relationships

Posted by Nick Burch <ni...@torchbox.com>.
On Tue, 8 Apr 2008, Yury Batrakov wrote:
> I'm working on extending XWPFWordExtractor's functionality to support
> extraction of hyperlinks.

Great news :) Please upload the diff to bugzilla once you're done, so
everyone can benefit from your improvements

> I am already done with hyperlinks' text, but unable to get their URLs.
> Exploring OOXML document structure, I found that URLs are stored as
> relationships in /word/_rels/document.xml.rels but i can't find POI code
> that loads them.

It's all handled via openxml4j - once you have the PackagePart for the
word document, you want to fetch all the relationships with the
relationship type of "hyperlink for word document" (check in the _rels
file to see what it actually is). From that PackageRelationshipCollection,
you can fetch out individual relationships by their id (it'll be r:id in
the word doc). The hyperlink's target is the target on the package
relationship

You'll probably want to take a look at how we do it for excel, as there's
now support for ooxml excel hyperlinks. See:
  src/ooxml/java/org/apache/poi/xssf/usermodel/XSSFHyperlink.java
  src/ooxml/java/org/apache/poi/xssf/usermodel/XSSFSheet.java

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org