You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/02/07 16:58:48 UTC
Using BP ImageExtractor
Hi,
For Apache Nutch we'd like to see if we can use Boilerpipe to extract a
meaningful image for a given document. The BP API provides a method to return
a set of images for a given TextDocument object and an extractor.
Tika does not return us a TextDocument object after parsing so it seems i
cannot use the API with Tika as-is.
Right now Nutch is about to use the TeeContentHandler for retrieving
hyperlinks of the whole document plus parsed content by Boilerpipe (this will
be committed when we upgrade to Tika 1.1). Is there an easy way to use that
ImageExtractor with Tika? If so, how and if not, what can we do?
Thanks
Re: Using BP ImageExtractor
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
We build both BP and Tika from trunk for usage in Nutch. However, i am unsure
how to use BP's ImageExtractor with Tika's API's. BP's API asks for a
TextDocument object which we don't have. Is there another API i am unaware of
we can use with Tika's TeeContentHandler? Or do you happen to have some
example for this?
We can succesfully extract images with BP standalone but we need to do it with
Tika.
Thanks
> I've used BP for this purpose.
> You need to build from trunk.
>
> --
> Dotan, @jondot <http://twitter.com/jondot>
>
> On Tue, Feb 7, 2012 at 5:58 PM, Markus Jelsma
<ma...@openindex.io>wrote:
> > Hi,
> >
> > For Apache Nutch we'd like to see if we can use Boilerpipe to extract a
> > meaningful image for a given document. The BP API provides a method to
> > return
> > a set of images for a given TextDocument object and an extractor.
> >
> > Tika does not return us a TextDocument object after parsing so it seems i
> > cannot use the API with Tika as-is.
> >
> > Right now Nutch is about to use the TeeContentHandler for retrieving
> > hyperlinks of the whole document plus parsed content by Boilerpipe (this
> > will
> > be committed when we upgrade to Tika 1.1). Is there an easy way to use
> > that ImageExtractor with Tika? If so, how and if not, what can we do?
> >
> > Thanks
Re: Using BP ImageExtractor
Posted by "Dotan N." <di...@gmail.com>.
I've used BP for this purpose.
You need to build from trunk.
--
Dotan, @jondot <http://twitter.com/jondot>
On Tue, Feb 7, 2012 at 5:58 PM, Markus Jelsma <ma...@openindex.io>wrote:
> Hi,
>
> For Apache Nutch we'd like to see if we can use Boilerpipe to extract a
> meaningful image for a given document. The BP API provides a method to
> return
> a set of images for a given TextDocument object and an extractor.
>
> Tika does not return us a TextDocument object after parsing so it seems i
> cannot use the API with Tika as-is.
>
> Right now Nutch is about to use the TeeContentHandler for retrieving
> hyperlinks of the whole document plus parsed content by Boilerpipe (this
> will
> be committed when we upgrade to Tika 1.1). Is there an easy way to use that
> ImageExtractor with Tika? If so, how and if not, what can we do?
>
> Thanks
>