You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by sgraessle <st...@gmail.com> on 2011/05/31 18:01:10 UTC

Image Extraction

1. Can anyone point me in the direction of where I should look within Tika
to modify/create code to not only extract the metadata for an image but also
extract it's relative position in a document. (For example: between words A
and word B) and then save this information.

2. I need to be able to extract the images within the parsed documents and
saved them as well. Would the best place to do this be to create my own
ImageParser and add a few lines in the Parse method?

Thank you for your time,

- Stephen

--
View this message in context: http://lucene.472066.n3.nabble.com/Image-Extraction-tp3006668p3006668.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Image Extraction

Posted by sgraessle <st...@gmail.com>.
Max,

Thanks for the information. I seem to be unable to locate information on the
usage of the '-z' switch, would you have a link to a page I can read up on?
Having a hard time find if and where the content is being extracted too.

Thanks!

Stephen

--
View this message in context: http://lucene.472066.n3.nabble.com/Image-Extraction-tp3006668p3010520.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Image Extraction

Posted by Maxim Valyanskiy <ma...@jet.msk.su>.
Hello!

31.05.2011 20:01, sgraessle пишет:
> 2. I need to be able to extract the images within the parsed documents and
> saved them as well. Would the best place to do this be to create my own
> ImageParser and add a few lines in the Parse method?
>
Tika command line application with '-z' switch can extract images (and other 
attachements) to separate files. Look at its implementation

best wishes, Max

Re: Image Extraction

Posted by sgraessle <st...@gmail.com>.
Jukka,

I am having trouble following the flow of the streams to the
ImageSavingParser class... Sorry to be such a nuisance but my boss wants
each image to be extracted from word documents and their relative position
saved (so you can go back later and see roughly where it fits in). Is there
existing property that I can simply utilize or something that might have
this information. 

Also how does the image actually get written to file? I never seem to trace
that down and figure that either

Sorry and thanks again!

- Stephen

--
View this message in context: http://lucene.472066.n3.nabble.com/Image-Extraction-tp3006668p3050264.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Image Extraction

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Jun 2, 2011 at 5:56 PM, sgraessle <st...@gmail.com> wrote:
> Thanks for the quick response... Apparently I made a few mistakes in testing
> it. One quick question for you, I was asked to store a general location for
> where an image was in the file, (ie line # 5). Could you point me in the
> direction of the code that is used to extract the images so I can include a
> small routine of my own?

See the getHtmlHandler method in the TikaGUI class. You might also
want to check out the ImageDocumentSelector and ImageSavingParser
classes inside TikaGUI for some other related code.

BR,

Jukka Zitting

Re: Image Extraction

Posted by sgraessle <st...@gmail.com>.
Jukka, 

Thanks for the quick response... Apparently I made a few mistakes in testing
it. One quick question for you, I was asked to store a general location for
where an image was in the file, (ie line # 5). Could you point me in the
direction of the code that is used to extract the images so I can include a
small routine of my own?

Thank you so much!

--
View this message in context: http://lucene.472066.n3.nabble.com/Image-Extraction-tp3006668p3015712.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Image Extraction

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Jun 2, 2011 at 5:00 PM, sgraessle <st...@gmail.com> wrote:
> I went ahead and tried to piece together what I needed to do to test Tika
> with the code provided above.
> [...]
> All I really need is to use a ImageParser that will save the embedded images
> to some arbitrary directory in addition to parsing the files... is there
> some other package that I should use to perform this extraction before I
> parse the files with Tika?

It looks like you're down a much more complicated path than you'd need to be.

As Maxim noted, see the TikaCLI class and the
FileEmbeddedDocumentExtractor one inside it for an example of how the
"--extract" option of the CLI works under the hood. That should be
pretty much similar to what you're trying to achieve. No need to
implement your own parser classes, etc.

BR,

Jukka Zitting

Re: Image Extraction

Posted by sgraessle <st...@gmail.com>.
I went ahead and tried to piece together what I needed to do to test Tika
with the code provided above. 


tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
One of the link references provided suggesting modifying this above file, do
I am need to? I don't believe it is necessary to modify this file because it
only contains definitions of MIME types and the image files are already
defined. Does that seem correct? 

tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
I went ahead and copied the HTMLRenderingEngine.java file into
Tika-Parsers/src/main/java/org/apache/tika/parser/image/ with the same name
HTMLRenderingEngine.java. Then I went into the file in the above folder and
added a line with the following contents:
org.apache.tika.image.TikaImageExtractingParser and rebuilt the project and
then packaged it and attempted to run it to see if the new functionality
worked and it ran but did nothing new. I am sorry for all the basic
questions, but what am I missing?

All I really need is to use a ImageParser that will save the embedded images
to some arbitrary directory in addition to parsing the files... is there
some other package that I should use to perform this extraction before I
parse the files with Tika?


--
View this message in context: http://lucene.472066.n3.nabble.com/Image-Extraction-tp3006668p3015474.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Image Extraction

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 1 Jun 2011, sgraessle wrote:
> If I wanted to go about making the code you provided the default parser 
> for all images, where in the Tika framework would I need to change 
> things?

Your best bet is probably to take a look at the comments on these two 
bugs:
   https://issues.apache.org/jira/browse/TIKA-527
   https://issues.apache.org/jira/browse/TIKA-288

And this recent discussion:
   http://comments.gmane.org/gmane.comp.apache.tika.user/684

Nick

Re: Image Extraction

Posted by sgraessle <st...@gmail.com>.
Sorry, ignore that last question Nick. I answered my own question.

If I wanted to go about making the code you provided the default parser for
all images, where in the Tika framework would I need to change things?

Thanks!


--
View this message in context: http://lucene.472066.n3.nabble.com/Image-Extraction-tp3006668p3010622.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Image Extraction

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 31 May 2011, sgraessle wrote:
> Could you provide more insight as to where the temporary html files are 
> located?

Which temporary html files?

Nick

Re: Image Extraction

Posted by sgraessle <st...@gmail.com>.
Nick,

Thanks for the quick response. As far as the Tika project I am extremely
new.  My boss is asking me to integrate it within his project. Could you
provide more insight as to where the temporary html files are located?

Working through the code you provided.

Thank you for your time!

- Stephen


--
View this message in context: http://lucene.472066.n3.nabble.com/Image-Extraction-tp3006668p3006871.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Image Extraction

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 9 Jun 2011, sgraessle wrote:
> I am looking at integrating the Alfresco code in order to handle the images.
> What would be the most efficient way to do so? I went ahead and downloaded
> the entire Alfresco project, but I don't need all of it only the HTML
> rendering capacity and I would like to deal eliminate the extra code.

Just use code from the class I pointed you at, it's largely standalone. 
Also be aware of the license - Alfresco is not under the same license as 
Tika (it's LGPL instead of ASL)

You may also want to just crib off/use the Tika app code for embedded 
document extraction, that's already built into Tika and does everything 
you need

Nick

Re: Image Extraction

Posted by sgraessle <st...@gmail.com>.
Nick,

I am looking at integrating the Alfresco code in order to handle the images.
What would be the most efficient way to do so? I went ahead and downloaded
the entire Alfresco project, but I don't need all of it only the HTML
rendering capacity and I would like to deal eliminate the extra code.
Thoughts? Suggestions?

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Image-Extraction-tp3006668p3044844.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Image Extraction

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 31 May 2011, sgraessle wrote:
> 1. Can anyone point me in the direction of where I should look within 
> Tika to modify/create code to not only extract the metadata for an image 
> but also extract it's relative position in a document. (For example: 
> between words A and word B) and then save this information.

You'll need to look at the HTML version of the parent file, and watch the 
img tags

> 2. I need to be able to extract the images within the parsed documents 
> and saved them as well. Would the best place to do this be to create my 
> own ImageParser and add a few lines in the Parse method?

You'll want your own parser, registered for the image types, and then add 
that to the parse context

You may find this class from Alfresco worth a look:
    http://svn.alfresco.com/repos/alfresco-open-mirror/alfresco/HEAD/root/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java
It handles saving embedded images out, and tweaking the <img> tags for 
them

Nick