You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Haya AL-Tuwaijri <ha...@hotmail.com> on 2012/02/18 06:49:43 UTC

Tika with nutch

Hi all ,,

I'm developing a plug-in in Nutch that implement HtmlParserFilter, I want to use Tika tool kit to be able to convert the web page to plain text to be processed.
I knew that Tika is now integrated with Nutch since version 1.1, so I didn't download anything and start coding.

found that BodyContentHandler may help so I use this code:

//=======
//import packages:

import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.apache.tika.io.TikaInputStream;

//=====


public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) 
      {
Metadata metadata = new Metadata();
BodyContentHandler texthandler = new BodyContentHandler();
Parser parser = new AutoDetectParser();
InputStream in = TikaInputStream.get(content.getContent());
parser.parse(in, texthandler, metadata, new ParseContext());    
LOG.info("Content: " + texthandler.toString());
LOG.info("is Empty? " + texthandler.toString().isEmpty());
     }

Now, The content is always empty, and isEmpty() gives me true all the time !

I don't know why, I've searched a lot, resources are rare, so I asked this question here in the mailing list

Thanks in advanced and I appreciated :)

Re: Tika with nutch

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Mmmm... this is really a Tika question, this probably shadows why you have
received very little response from the community unfortunately.

So the problem is that you are always getting back isEmpty indicating that
_nothing_ is being produced as an output from your parser.

I would add in a try catch, like we do in TikaParser to either feed content
the output stream or catch when there is no content to be fed.

Maybe you should have a look at
http://tika.apache.org/1.0/parser.html

there is content there on the BodyContentHandler as well as the various
readers and writers you need to get your implementation up and running.

2012/2/19 HaYa aziz <ha...@hotmail.com>

>
>
> I try to use writer also without any luck !
>
> StringWriter writer = new StringWriter();
> Metadata metadata = new Metadata();
> ContentHandler texthandler = new BodyContentHandler(writer);
> Parser parser = new AutoDetectParser();
> InputStream in = TikaInputStream.get(content.getContent());
> parser.parse(in, texthandler, metadata, new ParseContext());
> LOG.info("Content: " + writer .toString());
> LOG.info("is Empty? " + writer .toString().isEmpty());
>
>
> Where is the problem !!!!
>
>
> > To: user@nutch.apache.org
> > Subject: Tika with nutch
> > Date: Sat, 18 Feb 2012 08:49:43 +0300
> >
> >
> > Hi all ,,
> >
> > I'm developing a plug-in in Nutch that implement HtmlParserFilter, I
> want to use Tika tool kit to be able to convert the web page to plain text
> to be processed.
> > I knew that Tika is now integrated with Nutch since version 1.1, so I
> didn't download anything and start coding.
> >
> > found that BodyContentHandler may help so I use this code:
> >
> > //=======
> > //import packages:
> >
> > import org.apache.tika.sax.BodyContentHandler;
> > import org.apache.tika.metadata.Metadata;
> > import org.apache.tika.parser.ParseContext;
> > import org.apache.tika.parser.AutoDetectParser;
> > import org.apache.tika.parser.Parser;
> > import org.apache.tika.io.TikaInputStream;
> >
> > //=====
> >
> >
> > public ParseResult filter(Content content, ParseResult parseResult,
> HTMLMetaTags metaTags, DocumentFragment doc)
> >       {
> > Metadata metadata = new Metadata();
> > BodyContentHandler texthandler = new BodyContentHandler();
> > Parser parser = new AutoDetectParser();
> > InputStream in = TikaInputStream.get(content.getContent());
> > parser.parse(in, texthandler, metadata, new ParseContext());
> > LOG.info("Content: " + texthandler.toString());
> > LOG.info("is Empty? " + texthandler.toString().isEmpty());
> >      }
> >
> > Now, The content is always empty, and isEmpty() gives me true all the
> time !
> >
> > I don't know why, I've searched a lot, resources are rare, so I asked
> this question here in the mailing list
> >
> > Thanks in advanced and I appreciated :)
> >
> >
>
>



-- 
*Lewis*

RE: Tika with nutch

Posted by HaYa aziz <ha...@hotmail.com>.


I try to use writer also without any luck !

StringWriter writer = new StringWriter();
Metadata metadata = new Metadata();
ContentHandler texthandler = new BodyContentHandler(writer);
Parser parser = new AutoDetectParser();
InputStream in = TikaInputStream.get(content.getContent());
parser.parse(in, texthandler, metadata, new ParseContext());    
LOG.info("Content: " + writer .toString());
LOG.info("is Empty? " + writer .toString().isEmpty());


Where is the problem !!!!
 

> To: user@nutch.apache.org
> Subject: Tika with nutch
> Date: Sat, 18 Feb 2012 08:49:43 +0300
> 
> 
> Hi all ,,
> 
> I'm developing a plug-in in Nutch that implement HtmlParserFilter, I want to use Tika tool kit to be able to convert the web page to plain text to be processed.
> I knew that Tika is now integrated with Nutch since version 1.1, so I didn't download anything and start coding.
> 
> found that BodyContentHandler may help so I use this code:
> 
> //=======
> //import packages:
> 
> import org.apache.tika.sax.BodyContentHandler;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.io.TikaInputStream;
> 
> //=====
> 
> 
> public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) 
>       {
> Metadata metadata = new Metadata();
> BodyContentHandler texthandler = new BodyContentHandler();
> Parser parser = new AutoDetectParser();
> InputStream in = TikaInputStream.get(content.getContent());
> parser.parse(in, texthandler, metadata, new ParseContext());    
> LOG.info("Content: " + texthandler.toString());
> LOG.info("is Empty? " + texthandler.toString().isEmpty());
>      }
> 
> Now, The content is always empty, and isEmpty() gives me true all the time !
> 
> I don't know why, I've searched a lot, resources are rare, so I asked this question here in the mailing list
> 
> Thanks in advanced and I appreciated :)
> 
>