You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Charlie Williams <cw...@gmail.com> on 2007/02/12 19:21:39 UTC

Problem stepping through Inject code, as opposed to crawl

I have been trying to learn the Nutch code base by stepping through the code
in debug mode of Eclipse. However I am unable to understand a piece of code
in the Injector.

When I run the crawl command used for intranet crawling, it successfully
injects urls into the database. When I run standalone Injector, on the same
set of urls it injects nothing, returning null from each pass of
PrefixURLFilter.filter( url )

I saw in an achieve that that the crawl command uses crawl-tool.xml for its
config, where otherwise nutch-site.xml is used. So I made the
nutch-site.xmlfile exactly the same, but this seemed to have no
result. Does anyone know
why?

I apologize for the newb question, but any help would be greatly
appreciated.

-Charlie

Re: Problem stepping through Inject code, as opposed to crawl

Posted by Charlie Williams <cw...@gmail.com>.
I thought I would follow up on this for anyone who has also had the problem.
I found the root of the problem to be that conf/prefix-url.txt is not
included in the nutch-0.8.1 download on the site. Therefore the file cannot
be loaded when running the inject/generate/etc. calls.

I'm not sure why the crawl command still worked properly, but adding the
file and filling it with 'http' solved my problem.

-Charlie

On 2/12/07, Charlie Williams <cw...@gmail.com> wrote:
>
> yes I have been debugging, everything looks fine as it goes into the
> mapper code,
>
> from Injector.java
> @line 69
>
> try
> {
>   url = urlNormalizer.normalize(url);
>   url = filters.filter(url); <- this is what returns null
> } catch ( ... )
>  ...
> }
>
> if (url != null) { <-- this check always fails because of that
>   ...
> }
>
> I trace the call in to PrefixURLFilter.filter(url) and always get a null
> returned from here...
>
> if (trie.shortestMatch(url)== null)
>    return null;
> else
>    return url;
>
> Does this clarify the root of the problem?
>
> -Charlie
>
>
> On 2/12/07, Renaud Richardet <re...@apache.org> wrote:
> >
> > Hey Charlie,
> >
> > What do the logs say in logs/hadoop.log?
> >
> > You can also try to to set a breakpoint in Eclipse in the map method of
> > InjectMapper and reduce method of InjectReducer. When you get there in
> > debug mode, inspect your variables and check if everything looks good.
> > You can also check if your urls make it through: url =
> > filters.filter(url);  in InjectMapper
> >
> > HTH,
> > Renaud
> >
> >
> > Charlie Williams wrote:
> > > I have been trying to learn the Nutch code base by stepping through
> > > the code
> > > in debug mode of Eclipse. However I am unable to understand a piece of
> > > code
> > > in the Injector.
> > >
> > > When I run the crawl command used for intranet crawling, it
> > successfully
> > > injects urls into the database. When I run standalone Injector, on the
> > > same
> > > set of urls it injects nothing, returning null from each pass of
> > > PrefixURLFilter.filter( url )
> > >
> > > I saw in an achieve that that the crawl command uses crawl-tool.xml
> > > for its
> > > config, where otherwise nutch-site.xml is used. So I made the
> > > nutch-site.xmlfile exactly the same, but this seemed to have no
> > > result. Does anyone know
> > > why?
> > >
> > > I apologize for the newb question, but any help would be greatly
> > > appreciated.
> > >
> > > -Charlie
> > >
> >
> >
> > --
> > Renaud Richardet                                      +1 617 230 9112
> > my email is my first name at apache.org      http://www.oslutions.com
> >
> >
>

Re: Problem stepping through Inject code, as opposed to crawl

Posted by Charlie Williams <cw...@gmail.com>.
yes I have been debugging, everything looks fine as it goes into the mapper
code,

from Injector.java
@line 69

try
{
  url = urlNormalizer.normalize(url);
  url = filters.filter(url); <- this is what returns null
} catch ( ... )
 ...
}

if (url != null) { <-- this check always fails because of that
  ...
}

I trace the call in to PrefixURLFilter.filter(url) and always get a null
returned from here...

if (trie.shortestMatch(url)== null)
   return null;
else
   return url;

Does this clarify the root of the problem?

-Charlie


On 2/12/07, Renaud Richardet <re...@apache.org> wrote:
>
> Hey Charlie,
>
> What do the logs say in logs/hadoop.log?
>
> You can also try to to set a breakpoint in Eclipse in the map method of
> InjectMapper and reduce method of InjectReducer. When you get there in
> debug mode, inspect your variables and check if everything looks good.
> You can also check if your urls make it through: url =
> filters.filter(url);  in InjectMapper
>
> HTH,
> Renaud
>
>
> Charlie Williams wrote:
> > I have been trying to learn the Nutch code base by stepping through
> > the code
> > in debug mode of Eclipse. However I am unable to understand a piece of
> > code
> > in the Injector.
> >
> > When I run the crawl command used for intranet crawling, it successfully
> > injects urls into the database. When I run standalone Injector, on the
> > same
> > set of urls it injects nothing, returning null from each pass of
> > PrefixURLFilter.filter( url )
> >
> > I saw in an achieve that that the crawl command uses crawl-tool.xml
> > for its
> > config, where otherwise nutch-site.xml is used. So I made the
> > nutch-site.xmlfile exactly the same, but this seemed to have no
> > result. Does anyone know
> > why?
> >
> > I apologize for the newb question, but any help would be greatly
> > appreciated.
> >
> > -Charlie
> >
>
>
> --
> Renaud Richardet                                      +1 617 230 9112
> my email is my first name at apache.org      http://www.oslutions.com
>
>

Re: Problem stepping through Inject code, as opposed to crawl

Posted by Renaud Richardet <re...@apache.org>.
Hey Charlie,

What do the logs say in logs/hadoop.log?

You can also try to to set a breakpoint in Eclipse in the map method of 
InjectMapper and reduce method of InjectReducer. When you get there in 
debug mode, inspect your variables and check if everything looks good. 
You can also check if your urls make it through: url = 
filters.filter(url);  in InjectMapper

HTH,
Renaud


Charlie Williams wrote:
> I have been trying to learn the Nutch code base by stepping through 
> the code
> in debug mode of Eclipse. However I am unable to understand a piece of 
> code
> in the Injector.
>
> When I run the crawl command used for intranet crawling, it successfully
> injects urls into the database. When I run standalone Injector, on the 
> same
> set of urls it injects nothing, returning null from each pass of
> PrefixURLFilter.filter( url )
>
> I saw in an achieve that that the crawl command uses crawl-tool.xml 
> for its
> config, where otherwise nutch-site.xml is used. So I made the
> nutch-site.xmlfile exactly the same, but this seemed to have no
> result. Does anyone know
> why?
>
> I apologize for the newb question, but any help would be greatly
> appreciated.
>
> -Charlie
>


-- 
Renaud Richardet                                      +1 617 230 9112
my email is my first name at apache.org      http://www.oslutions.com