You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/01/03 18:18:26 UTC

What to do with items for which is no parser?

Hi,

Right now the state of the crawldb is set to success for items without a 
parser that throw: 

Exception in thread "main" org.apache.nutch.parse.ParseException: parser not 
found for contentType=video/x-flv url=
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
        at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)

Should we do that at all? It doesn't seem right. I, for instance, am not 
interested in retrying such an URL again for a very long time.

Thoughts?
Thanks

Re: What to do with items for which is no parser?

Posted by Markus Jelsma <ma...@openindex.io>.
> It's a good point Markus. I would imagine that we would wish to do one
> of two things
> 
> 1) Create a parser to fetch the contentType in question (not the aim
> of Nutch but geared more towards Tika contribution...)
> 2) As you mention, use a parser implementation which stores this
> contentType as false for parsing e.g. skip this contentType when it is
> encountered again. However are we not able to achieve this through use
> of an urlfilter which denies the .x-flv suffix?

Indeed, the question is more about the state of the CrawlDB. I think the type 
should still be stored because it is valuable information if once decides to 
parse that type later.

I wonder if a db_gone status would be more appropriate in such a case. We 
cannot filter all url's by using the suffix filter because sometimes url's 
just dome have an extension at all but can be of any format.

Also, what would the signature be of an unparsed file (sorry, can't check 
right now). It must not change or let the fetch scheduler think it must be 
fetched sooner than interval.

> 
> On Tue, Jan 3, 2012 at 5:18 PM, Markus Jelsma
> 
> <ma...@openindex.io> wrote:
> > Hi,
> > 
> > Right now the state of the crawldb is set to success for items without a
> > parser that throw:
> > 
> > Exception in thread "main" org.apache.nutch.parse.ParseException: parser
> > not found for contentType=video/x-flv url=
> >        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
> >        at
> > org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101) at
> > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
> > org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)
> > 
> > Should we do that at all? It doesn't seem right. I, for instance, am not
> > interested in retrying such an URL again for a very long time.
> > 
> > Thoughts?
> > Thanks

Re: What to do with items for which is no parser?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
It's a good point Markus. I would imagine that we would wish to do one
of two things

1) Create a parser to fetch the contentType in question (not the aim
of Nutch but geared more towards Tika contribution...)
2) As you mention, use a parser implementation which stores this
contentType as false for parsing e.g. skip this contentType when it is
encountered again. However are we not able to achieve this through use
of an urlfilter which denies the .x-flv suffix?

On Tue, Jan 3, 2012 at 5:18 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> Hi,
>
> Right now the state of the crawldb is set to success for items without a
> parser that throw:
>
> Exception in thread "main" org.apache.nutch.parse.ParseException: parser not
> found for contentType=video/x-flv url=
>        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
>        at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)
>
> Should we do that at all? It doesn't seem right. I, for instance, am not
> interested in retrying such an URL again for a very long time.
>
> Thoughts?
> Thanks



-- 
Lewis