You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Mike Dalrymple <mi...@mousedown.com> on 2020/01/03 17:04:33 UTC

Setting PDF2XHTML img src

Hello,

I've just started using Tika to process PDFs with embedded images.  I'm
getting fantastic results but I'm having to post-process the generated
XHTML to correct the value of the src attribute on the img elements.  The
generated XHTML has elements like:

<img src="embedded:image1.jpg" alt="image1.jpg" />

My EmbeddedDocumentExtractor is saving image1.jpg in the same directory as
the generated XHTML.  Looking in PDF2XHTML.java it appears that the img
element is written with a hard coded src of:  "embedded:" + fileName

My questions are:

   1. Is there a significance to the word "embedded"?   I can't find any
   reference to "embedded" in xhtml img elements.  I was thinking that it
   might indicate there's a base64 encoded object in the page but that does
   not appear to be the case.
   2. Is there a pattern for overriding the embedded img src value?  I see
   that "parseEmbedded" is called with outputHtml=false.   Would there be a
   way to have parseEmbedded return the img element if that were set to true?

Any direction would be greatly appreciated.  I'm currently just passing the
generated XHTML through a regex that converts the src attributes and that
works fine, it just feels like there may be a more idiomatic way that I'm
not seeing.

Cheers,
Mike

Re: Setting PDF2XHTML img src

Posted by Mike Dalrymple <mi...@mousedown.com>.

This makes sense and I think that ContentHandlerDecorator in your code may
actually help me improve my processing elsewhere.

Thank you for the detailed reply, it's appreciated.

Mike

On Fri, Jan 3, 2020 at 9:17 AM Nick Burch <ap...@gagravarr.org> wrote:

> On Fri, 3 Jan 2020, Mike Dalrymple wrote:
> > I've just started using Tika to process PDFs with embedded images.  I'm
> > getting fantastic results but I'm having to post-process the generated
> > XHTML to correct the value of the src attribute on the img elements.
>
> That is expected. A simple sax handler should let you do that, to re-write
> it to where you're saving the images
>
> > The generated XHTML has elements like:
> >
> > <img src="embedded:image1.jpg" alt="image1.jpg" />
>
> The embedded prefix is Tika's way of letting you know there was an
> embedded image there, and what name it would have if you extracted it
> (which you may not of done).
>
> The idea is that, for the extract+display case, you re-write it to match
> where you stored the image. For other cases, you know it was an embedded
> image rather than an external reference
>
> > Any direction would be greatly appreciated.  I'm currently just passing
> > the generated XHTML through a regex that converts the src attributes and
> > that works fine, it just feels like there may be a more idiomatic way
> > that I'm not seeing.
>
> Several jobs ago, I wrote some code to do this for Alfresco:
>
> https://github.com/alfresco-mirror/alfresco-mirror/blob/b3d815063d3634d4bde83b4a214db62215a490fd/root/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java#L490
>
> The idea is it re-writes just the embedded image links to point to a
> specific folder path or prefix where the embedded images were written,
> while leaving all other (external) images alone
>
> Nick
>

Re: Setting PDF2XHTML img src

Posted by Nick Burch <ap...@gagravarr.org>.

On Fri, 3 Jan 2020, Mike Dalrymple wrote:
> I've just started using Tika to process PDFs with embedded images.  I'm
> getting fantastic results but I'm having to post-process the generated
> XHTML to correct the value of the src attribute on the img elements.

That is expected. A simple sax handler should let you do that, to re-write 
it to where you're saving the images

> The generated XHTML has elements like:
>
> <img src="embedded:image1.jpg" alt="image1.jpg" />

The embedded prefix is Tika's way of letting you know there was an 
embedded image there, and what name it would have if you extracted it 
(which you may not of done).

The idea is that, for the extract+display case, you re-write it to match 
where you stored the image. For other cases, you know it was an embedded 
image rather than an external reference

> Any direction would be greatly appreciated.  I'm currently just passing 
> the generated XHTML through a regex that converts the src attributes and 
> that works fine, it just feels like there may be a more idiomatic way 
> that I'm not seeing.

Several jobs ago, I wrote some code to do this for Alfresco:
https://github.com/alfresco-mirror/alfresco-mirror/blob/b3d815063d3634d4bde83b4a214db62215a490fd/root/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java#L490

The idea is it re-writes just the embedded image links to point to a 
specific folder path or prefix where the embedded images were written, 
while leaving all other (external) images alone

Nick