You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/02/07 16:58:48 UTC

Using BP ImageExtractor

Hi,

For Apache Nutch we'd like to see if we can use Boilerpipe to extract a 
meaningful image for a given document. The BP API provides a method to return 
a set of images for a given TextDocument object and an extractor.

Tika does not return us a TextDocument object after parsing so it seems i 
cannot use the API with Tika as-is. 

Right now Nutch is about to use the TeeContentHandler for retrieving 
hyperlinks of the whole document plus parsed content by Boilerpipe (this will 
be committed when we upgrade to Tika 1.1). Is there an easy way to use that 
ImageExtractor with Tika? If so, how and if not, what can we do?

Thanks

Re: Using BP ImageExtractor

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

We build both BP and Tika from trunk for usage in Nutch. However, i am unsure 
how to use BP's ImageExtractor with Tika's API's. BP's API asks for a 
TextDocument object which we don't have. Is there another API i am unaware of 
we can use with Tika's TeeContentHandler? Or do you happen to have some 
example for this?
We can succesfully extract images with BP standalone but we need to do it with 
Tika.

Thanks

> I've used BP for this purpose.
> You need to build from trunk.
> 
> --
> Dotan, @jondot <http://twitter.com/jondot>
> 
> On Tue, Feb 7, 2012 at 5:58 PM, Markus Jelsma 
<ma...@openindex.io>wrote:
> > Hi,
> > 
> > For Apache Nutch we'd like to see if we can use Boilerpipe to extract a
> > meaningful image for a given document. The BP API provides a method to
> > return
> > a set of images for a given TextDocument object and an extractor.
> > 
> > Tika does not return us a TextDocument object after parsing so it seems i
> > cannot use the API with Tika as-is.
> > 
> > Right now Nutch is about to use the TeeContentHandler for retrieving
> > hyperlinks of the whole document plus parsed content by Boilerpipe (this
> > will
> > be committed when we upgrade to Tika 1.1). Is there an easy way to use
> > that ImageExtractor with Tika? If so, how and if not, what can we do?
> > 
> > Thanks

Re: Using BP ImageExtractor

Posted by "Dotan N." <di...@gmail.com>.

I've used BP for this purpose.
You need to build from trunk.

--
Dotan, @jondot <http://twitter.com/jondot>



On Tue, Feb 7, 2012 at 5:58 PM, Markus Jelsma <ma...@openindex.io>wrote:

> Hi,
>
> For Apache Nutch we'd like to see if we can use Boilerpipe to extract a
> meaningful image for a given document. The BP API provides a method to
> return
> a set of images for a given TextDocument object and an extractor.
>
> Tika does not return us a TextDocument object after parsing so it seems i
> cannot use the API with Tika as-is.
>
> Right now Nutch is about to use the TeeContentHandler for retrieving
> hyperlinks of the whole document plus parsed content by Boilerpipe (this
> will
> be committed when we upgrade to Tika 1.1). Is there an easy way to use that
> ImageExtractor with Tika? If so, how and if not, what can we do?
>
> Thanks
>