You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Bouchard Mathieu (DGTT)" <Ma...@revenuquebec.ca> on 2014/08/19 14:14:34 UTC

RE: bin/crawl : incorrect handling of nutch errors?

Hi,

We are using Solr with Nutch to provide a complete search engine for our website.

I created a cron job that would use Nutch to crawl and update the Solr index each night. This cron job is trying to automatically correct some errors that could result in a corrupt crawldb. However, it seems that the bin/crawl command doesn't correctly propagate errors coming from bin/nutch.

Here is an exemple from the bin/crawl script :
    $bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR

    if [ $? -ne 0 ]
      then exit $?
    fi

Even if there is an error in the nutch inject command, the crawl script always returns 0. The way I understand it, the exit code returned is the result of the shell test and not the result of the nutch inject command.

To correct this, we would need to modify the script with something like :
    $bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR
    RETCODE=$?

    if [ $RETCODE -ne 0 ]
      then exit $RETCODE
    fi

I also have a problem with the bin/nutch generate command. This command would return the same error code if there is an error or no new segment to process, so there is no way to tell if the error is real or not.

I'm thinking on opening a tiket with these issues, but i'm wondering if there was a reason the script was written this way?

Thanks,

Les renseignements contenus dans ce message peuvent être confidentiels.

Si vous n'êtes pas le destinataire visé ou une personne autorisée à lui remettre ce courriel, vous êtes par la présente avisé qu'il est strictement interdit d'utiliser, de copier ou de distribuer ce courriel, de dévoiler la teneur de ce message ou de prendre quelque mesure fondée sur l'information contenue. Vous êtes donc prié d'aviser immédiatement l'expéditeur de cette erreur et de détruire ce message sans garder de copie.

Re: bin/crawl : incorrect handling of nutch errors?

Posted by Julien Nioche <li...@gmail.com>.
Hi Mathieu,

It is a bug indeed. As Feng suggested, please open an issue on
 https://issues.apache.org/jira/browse/NUTCH
<https://issues.apache.org/jira/browse/NUTCH> and attach a patch if you can.

Thanks

Julien


On 20 August 2014 02:59, feng lu <am...@gmail.com> wrote:

> yes, I think this is a bug for bin/crawl script. It need to store the exist
> status of the previously executed command.
>
> I think you can open a issue and add you patch.
>
>
>
>
> On Tue, Aug 19, 2014 at 8:14 PM, Bouchard Mathieu (DGTT) <
> Mathieu.Bouchard@revenuquebec.ca> wrote:
>
> > Hi,
> >
> > We are using Solr with Nutch to provide a complete search engine for our
> > website.
> >
> > I created a cron job that would use Nutch to crawl and update the Solr
> > index each night. This cron job is trying to automatically correct some
> > errors that could result in a corrupt crawldb. However, it seems that the
> > bin/crawl command doesn't correctly propagate errors coming from
> bin/nutch.
> >
> > Here is an exemple from the bin/crawl script :
> >     $bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR
> >
> >     if [ $? -ne 0 ]
> >       then exit $?
> >     fi
> >
> > Even if there is an error in the nutch inject command, the crawl script
> > always returns 0. The way I understand it, the exit code returned is the
> > result of the shell test and not the result of the nutch inject command.
> >
> > To correct this, we would need to modify the script with something like :
> >     $bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR
> >     RETCODE=$?
> >
> >     if [ $RETCODE -ne 0 ]
> >       then exit $RETCODE
> >     fi
> >
> > I also have a problem with the bin/nutch generate command. This command
> > would return the same error code if there is an error or no new segment
> to
> > process, so there is no way to tell if the error is real or not.
> >
> > I'm thinking on opening a tiket with these issues, but i'm wondering if
> > there was a reason the script was written this way?
> >
> > Thanks,
> >
> > Les renseignements contenus dans ce message peuvent être confidentiels.
> >
> > Si vous n'êtes pas le destinataire visé ou une personne autorisée à lui
> > remettre ce courriel, vous êtes par la présente avisé qu'il est
> strictement
> > interdit d'utiliser, de copier ou de distribuer ce courriel, de dévoiler
> la
> > teneur de ce message ou de prendre quelque mesure fondée sur
> l'information
> > contenue. Vous êtes donc prié d'aviser immédiatement l'expéditeur de
> cette
> > erreur et de détruire ce message sans garder de copie.
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: bin/crawl : incorrect handling of nutch errors?

Posted by feng lu <am...@gmail.com>.
yes, I think this is a bug for bin/crawl script. It need to store the exist
status of the previously executed command.

I think you can open a issue and add you patch.




On Tue, Aug 19, 2014 at 8:14 PM, Bouchard Mathieu (DGTT) <
Mathieu.Bouchard@revenuquebec.ca> wrote:

> Hi,
>
> We are using Solr with Nutch to provide a complete search engine for our
> website.
>
> I created a cron job that would use Nutch to crawl and update the Solr
> index each night. This cron job is trying to automatically correct some
> errors that could result in a corrupt crawldb. However, it seems that the
> bin/crawl command doesn't correctly propagate errors coming from bin/nutch.
>
> Here is an exemple from the bin/crawl script :
>     $bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR
>
>     if [ $? -ne 0 ]
>       then exit $?
>     fi
>
> Even if there is an error in the nutch inject command, the crawl script
> always returns 0. The way I understand it, the exit code returned is the
> result of the shell test and not the result of the nutch inject command.
>
> To correct this, we would need to modify the script with something like :
>     $bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR
>     RETCODE=$?
>
>     if [ $RETCODE -ne 0 ]
>       then exit $RETCODE
>     fi
>
> I also have a problem with the bin/nutch generate command. This command
> would return the same error code if there is an error or no new segment to
> process, so there is no way to tell if the error is real or not.
>
> I'm thinking on opening a tiket with these issues, but i'm wondering if
> there was a reason the script was written this way?
>
> Thanks,
>
> Les renseignements contenus dans ce message peuvent être confidentiels.
>
> Si vous n'êtes pas le destinataire visé ou une personne autorisée à lui
> remettre ce courriel, vous êtes par la présente avisé qu'il est strictement
> interdit d'utiliser, de copier ou de distribuer ce courriel, de dévoiler la
> teneur de ce message ou de prendre quelque mesure fondée sur l'information
> contenue. Vous êtes donc prié d'aviser immédiatement l'expéditeur de cette
> erreur et de détruire ce message sans garder de copie.
>



-- 
Don't Grow Old, Grow Up... :-)