You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@buyways.nl> on 2010/09/07 11:43:15 UTC

Nutch 1.2 parser fails on application-zip

Hi all,

I've got the nutch-2010-07-07_04-49-04 nightly build in which the parser fails 
but keeps the proces running for ever! I've tried with different segments and 
the common warning in the hadoop.log with which it fails is:

2010-09-07 10:48:15,633 WARN  parse.ParserFactory - ParserFactory: Plugin: 
org.apache.nutch.parse.zip.ZipParser mapped to contentType application/zip via 
parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml

The terminal output is:

2010-09-07 10:48:15,633 WARN  parse.ParserFactory - ParserFactory: Plugin: 
org.apache.nutch.parse.zip.ZipParser mapped to contentType application/zip via 
parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml

After that, it will keep running and doing nothing but eating CPU for some 
reason and needs CTRL+C to regain the terminal.

I don't think this is supposed to happen, despite the warning. Should i create 
a new ticket? At least i couldn't find a corresponding issue as of yet.

Cheers,

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Nutch 1.2 parser fails on application-zip

Posted by Markus Jelsma <ma...@buyways.nl>.
On Tuesday 07 September 2010 13:08:21 Julien Nioche wrote:
> The version 0.7 of Tika which we currently use does not have a parser for
> the feeds. I've added one recently so Tika 0.8 will be able to handle this.
> The default settings for conf/parse-plugins.xml declare that parse-rss and
> feed must be used for that type
> 

Indeed, i read it in the issue comments.

> 
> Since you opened it you might as well close it yourself :-)
> 

Closed. There are no more issues filed under 1.2!

Thanks!

> 
> Thanks
> 
> Julien
> 
> > On Tuesday 07 September 2010 12:20:09 Julien Nioche wrote:
> > > Hi Markus,
> > >
> > > I tried an SVN export of trunk yesterday to check if the subcollection
> > >
> > > > problem
> > > > (other thread) was fixed in a later stage. The problem was, i
> > > > couldn't build
> > > > it with ant and got a dependency error, complaining about some Gora
> >
> > thing
> >
> > > > of
> > > > which i know nothing about. Also, it took ages to resolve and build.
> > >
> > > The trunk (Nutch 2.0) uses IVY to manage the dependencies and requires
> >
> > them
> >
> > > to be downloaded the first time which is why it takes time. As for
> > > GORA, the code needs to be locally then built as Nutch expects it to be
> > > available in the local Ivy cache. This is temporary and when Gora
> >
> > matures
> >
> > >  it will be managed like any other dependency.
> > >
> > > > Anyway, i'm building the branch-1.2 now and try to parse the segment
> > > > in
> >
> > a
> >
> > > > while.
> > >
> > > OK. I've committed the subcollection patch on the 1.2 branch so you
> > >  wouldn't have seen it on the nightly build that you were using.
> > >
> > > Julien
> > >
> > > > Cheers,
> > > >
> > > > On Tuesday 07 September 2010 11:51:00 Julien Nioche wrote:
> > > > > Hi Markus,
> > > > >
> > > > > Could you please try with SVN branch-1.2? The nightly build you are
> > > > > using dates from early July (the nightly build mechanism has not
> > > > > been fixed since) and Nutch-696 has been committed after that.
> > > > > There is no relation between the warning message and the fact that
> > > > > the parser
> >
> > goes
> >
> > > > > forever. Please file a JIRA if the problem persists with the SVN
> > > > > version - if possible set the log level to debug in order to find
> > > > > out the url which
> > > >
> > > > is
> > > >
> > > > >  causing the problem sp that we can reproduce the issue. Could be
> > > > > another case of a file trimmed to the max size allowed during the
> > > > > fetching which puts the parser in trouble. We'll see.
> > > > >
> > > > > Best,
> > > > >
> > > > > Julien
> > > >
> > > > Markus Jelsma - Technisch Architect - Buyways BV
> > > > http://www.linkedin.com/in/markus17
> > > > 050-8536620 / 06-50258350
> >
> > Markus Jelsma - Technisch Architect - Buyways BV
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Nutch 1.2 parser fails on application-zip

Posted by Julien Nioche <li...@gmail.com>.
Hi Markus,


> Anyway, i've got the branch fetching and parsing like a mad man without the
> previous issue.


great


> I do get the following error though:
>
> Error parsing: <URL>: failed(2,0): Can't retrieve Tika parser for mime-type
> application/rss+xml
>
> I assume this is a known issue? Looking at NUTCH-887 and NUTCH-888.
>

The version 0.7 of Tika which we currently use does not have a parser for
the feeds. I've added one recently so Tika 0.8 will be able to handle this.
The default settings for conf/parse-plugins.xml declare that parse-rss and
feed must be used for that type


> Also, the issue i opened yesterday NUTCH-898 can be closed. I just
> confirmed
> that the current branch-1.2 handles the multi valued subcollection field as
> expected.
>

Since you opened it you might as well close it yourself :-)


Thanks

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


>
>
>
> On Tuesday 07 September 2010 12:20:09 Julien Nioche wrote:
> > Hi Markus,
> >
> > I tried an SVN export of trunk yesterday to check if the subcollection
> >
> > > problem
> > > (other thread) was fixed in a later stage. The problem was, i couldn't
> > > build
> > > it with ant and got a dependency error, complaining about some Gora
> thing
> > > of
> > > which i know nothing about. Also, it took ages to resolve and build.
> >
> > The trunk (Nutch 2.0) uses IVY to manage the dependencies and requires
> them
> > to be downloaded the first time which is why it takes time. As for GORA,
> >  the code needs to be locally then built as Nutch expects it to be
> >  available in the local Ivy cache. This is temporary and when Gora
> matures
> >  it will be managed like any other dependency.
> >
> > > Anyway, i'm building the branch-1.2 now and try to parse the segment in
> a
> > > while.
> >
> > OK. I've committed the subcollection patch on the 1.2 branch so you
> >  wouldn't have seen it on the nightly build that you were using.
> >
> > Julien
> >
> > > Cheers,
> > >
> > > On Tuesday 07 September 2010 11:51:00 Julien Nioche wrote:
> > > > Hi Markus,
> > > >
> > > > Could you please try with SVN branch-1.2? The nightly build you are
> > > > using dates from early July (the nightly build mechanism has not been
> > > > fixed since) and Nutch-696 has been committed after that. There is no
> > > > relation between the warning message and the fact that the parser
> goes
> > > > forever. Please file a JIRA if the problem persists with the SVN
> > > > version - if possible set the log level to debug in order to find out
> > > > the url which
> > >
> > > is
> > >
> > > >  causing the problem sp that we can reproduce the issue. Could be
> > > > another case of a file trimmed to the max size allowed during the
> > > > fetching which puts the parser in trouble. We'll see.
> > > >
> > > > Best,
> > > >
> > > > Julien
> > >
> > > Markus Jelsma - Technisch Architect - Buyways BV
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
> >
>
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>

Re: Nutch 1.2 parser fails on application-zip

Posted by Markus Jelsma <ma...@buyways.nl>.
Sounds great!

Anyway, i've got the branch fetching and parsing like a mad man without the 
previous issue. I do get the following error though:

Error parsing: <URL>: failed(2,0): Can't retrieve Tika parser for mime-type 
application/rss+xml

I assume this is a known issue? Looking at NUTCH-887 and NUTCH-888.
 
Also, the issue i opened yesterday NUTCH-898 can be closed. I just confirmed 
that the current branch-1.2 handles the multi valued subcollection field as 
expected.

Thanks!




On Tuesday 07 September 2010 12:20:09 Julien Nioche wrote:
> Hi Markus,
> 
> I tried an SVN export of trunk yesterday to check if the subcollection
> 
> > problem
> > (other thread) was fixed in a later stage. The problem was, i couldn't
> > build
> > it with ant and got a dependency error, complaining about some Gora thing
> > of
> > which i know nothing about. Also, it took ages to resolve and build.
> 
> The trunk (Nutch 2.0) uses IVY to manage the dependencies and requires them
> to be downloaded the first time which is why it takes time. As for GORA,
>  the code needs to be locally then built as Nutch expects it to be
>  available in the local Ivy cache. This is temporary and when Gora matures
>  it will be managed like any other dependency.
> 
> > Anyway, i'm building the branch-1.2 now and try to parse the segment in a
> > while.
> 
> OK. I've committed the subcollection patch on the 1.2 branch so you
>  wouldn't have seen it on the nightly build that you were using.
> 
> Julien
> 
> > Cheers,
> >
> > On Tuesday 07 September 2010 11:51:00 Julien Nioche wrote:
> > > Hi Markus,
> > >
> > > Could you please try with SVN branch-1.2? The nightly build you are
> > > using dates from early July (the nightly build mechanism has not been
> > > fixed since) and Nutch-696 has been committed after that. There is no
> > > relation between the warning message and the fact that the parser goes
> > > forever. Please file a JIRA if the problem persists with the SVN
> > > version - if possible set the log level to debug in order to find out
> > > the url which
> >
> > is
> >
> > >  causing the problem sp that we can reproduce the issue. Could be
> > > another case of a file trimmed to the max size allowed during the
> > > fetching which puts the parser in trouble. We'll see.
> > >
> > > Best,
> > >
> > > Julien
> >
> > Markus Jelsma - Technisch Architect - Buyways BV
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Nutch 1.2 parser fails on application-zip

Posted by Julien Nioche <li...@gmail.com>.
Hi Markus,

I tried an SVN export of trunk yesterday to check if the subcollection
> problem
> (other thread) was fixed in a later stage. The problem was, i couldn't
> build
> it with ant and got a dependency error, complaining about some Gora thing
> of
> which i know nothing about. Also, it took ages to resolve and build.
>

The trunk (Nutch 2.0) uses IVY to manage the dependencies and requires them
to be downloaded the first time which is why it takes time. As for GORA, the
code needs to be locally then built as Nutch expects it to be available in
the local Ivy cache. This is temporary and when Gora matures it will be
managed like any other dependency.


> Anyway, i'm building the branch-1.2 now and try to parse the segment in a
> while.
>

OK. I've committed the subcollection patch on the 1.2 branch so you wouldn't
have seen it on the nightly build that you were using.

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


>
> Cheers,
>
> On Tuesday 07 September 2010 11:51:00 Julien Nioche wrote:
> > Hi Markus,
> >
> > Could you please try with SVN branch-1.2? The nightly build you are using
> >  dates from early July (the nightly build mechanism has not been fixed
> >  since) and Nutch-696 has been committed after that. There is no relation
> >  between the warning message and the fact that the parser goes forever.
> >  Please file a JIRA if the problem persists with the SVN version - if
> >  possible set the log level to debug in order to find out the url which
> is
> >  causing the problem sp that we can reproduce the issue. Could be another
> >  case of a file trimmed to the max size allowed during the fetching which
> >  puts the parser in trouble. We'll see.
> >
> > Best,
> >
> > Julien
> >
>
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>

Re: Nutch 1.2 parser fails on application-zip

Posted by Markus Jelsma <ma...@buyways.nl>.
Hi Julien,

I tried an SVN export of trunk yesterday to check if the subcollection problem 
(other thread) was fixed in a later stage. The problem was, i couldn't build 
it with ant and got a dependency error, complaining about some Gora thing of 
which i know nothing about. Also, it took ages to resolve and build.

Anyway, i'm building the branch-1.2 now and try to parse the segment in a 
while.

Cheers,

On Tuesday 07 September 2010 11:51:00 Julien Nioche wrote:
> Hi Markus,
> 
> Could you please try with SVN branch-1.2? The nightly build you are using
>  dates from early July (the nightly build mechanism has not been fixed
>  since) and Nutch-696 has been committed after that. There is no relation
>  between the warning message and the fact that the parser goes forever.
>  Please file a JIRA if the problem persists with the SVN version - if
>  possible set the log level to debug in order to find out the url which is
>  causing the problem sp that we can reproduce the issue. Could be another
>  case of a file trimmed to the max size allowed during the fetching which
>  puts the parser in trouble. We'll see.
> 
> Best,
> 
> Julien
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Nutch 1.2 parser fails on application-zip

Posted by Julien Nioche <li...@gmail.com>.
Hi Markus,

Could you please try with SVN branch-1.2? The nightly build you are using
dates from early July (the nightly build mechanism has not been fixed since)
and Nutch-696 has been committed after that.
There is no relation between the warning message and the fact that the
parser goes forever. Please file a JIRA if the problem persists with the SVN
version - if possible set the log level to debug in order to find out the
url which is causing the problem sp that we can reproduce the issue. Could
be another case of a file trimmed to the max size allowed during the
fetching which puts the parser in trouble. We'll see.

Best,

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

On 7 September 2010 10:43, Markus Jelsma <ma...@buyways.nl> wrote:

> Hi all,
>
> I've got the nutch-2010-07-07_04-49-04 nightly build in which the parser
> fails
> but keeps the proces running for ever! I've tried with different segments
> and
> the common warning in the hadoop.log with which it fails is:
>
> 2010-09-07 10:48:15,633 WARN  parse.ParserFactory - ParserFactory: Plugin:
> org.apache.nutch.parse.zip.ZipParser mapped to contentType application/zip
> via
> parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml
>
> The terminal output is:
>
> 2010-09-07 10:48:15,633 WARN  parse.ParserFactory - ParserFactory: Plugin:
> org.apache.nutch.parse.zip.ZipParser mapped to contentType application/zip
> via
> parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml
>
> After that, it will keep running and doing nothing but eating CPU for some
> reason and needs CTRL+C to regain the terminal.
>
> I don't think this is supposed to happen, despite the warning. Should i
> create
> a new ticket? At least i couldn't find a corresponding issue as of yet.
>
> Cheers,
>
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>