You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Michael Ji <fj...@yahoo.com> on 2005/08/20 05:07:29 UTC

MD5 in fetchlist / fetcher

hi there,

I dumped the contents in segment/fetchlist and
segment/fetcher;

My curious question is that: why MD5 signature of the
page content doesn't save in fetchlist? 

In my mind, I think it will save CPU time if we see a
page unchanged --- coz we can skip the parsing
process; From my view, if we have MD5 in fetchlist, we
can do it directly in memory. If we have MD5 in
fetcher, we need to search it in local file in order
to do compare with the new fetched page content MD5.

Did I miss some important points or my dumping is
wrong?

thanks,

Michael Ji 

----------------fetchlist--------------------
fetch: true
page: Version: 4
URL: http://www.sina.com/
ID: d6a83e9c17e05d5602709a63c241bf68
Next fetch: Sun Aug 21 20:15:06 CDT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 0
Score: 1.0
NextScore: 1.0

anchors: 0

----------------fetcher--------------------
fetch: true
page: Version: 4
URL: http://www.sina.com/
ID: d6a83e9c17e05d5602709a63c241bf68
Next fetch: Sun Aug 21 20:15:06 CDT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 0
Score: 1.0
NextScore: 1.0

anchors: 0
Fetch Result:
MD5Hash: 56eae3c2556cb10a00e7346738dcb318
ProtocolStatus: success(1), lastModified=0
FetchDate: Sun Aug 14 20:15:13 CDT 2005




__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: crawl-urlfilter.txt mechanics

Posted by Piotr Kosiorowski <pk...@gmail.com>.
crawl-urlfilter.txt is "bin/nutch crawl" specific. If you want to use
each step separatelly - you ar ein fact doing "Whole Web crawling"
from tutorial - so you need to modify regex-urlfilter.txt instead.
Regards
Piotr

On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> 
> Hi,
> 
> When I use intranet crawling, such as, call
> "bin/nutch crawl ...", crawl-urlfilter.txt works---it
> filters out the urls that is not matched the domain I
> included;
> 
> actually, when I take a look at crawltool.java, the
> config files are read in Java Properties by
> 'NutchConf.get().addConfResource("crawl-tool.xml")'
> 
> But:
> 
> When I calling each steps explicitly by myself, such
> as,
> Loop
>    generate segment
>    fetch
>    updateDB
> 
> The crawl-urlfilter.txt doesn't work;
> 
> My question is:
> 
> 1) If I want to control the crawler's behavior in
> second case, should I call 'NutchConf.get()...' by
> myself?
> 
> 2) Where url-filter exactly works? In fetcher? So,
> after loaded from .xml and .txt, all the configuration
> data is kept in Properties for life time of nutch
> running?
> 
> thanks,
> 
> Michael Ji
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

crawl-urlfilter.txt mechanics

Posted by Michael Ji <fj...@yahoo.com>.
Hi,

When I use intranet crawling, such as, call 
"bin/nutch crawl ...", crawl-urlfilter.txt works---it
filters out the urls that is not matched the domain I
included;

actually, when I take a look at crawltool.java, the
config files are read in Java Properties by
'NutchConf.get().addConfResource("crawl-tool.xml")'

But:

When I calling each steps explicitly by myself, such
as, 
Loop 
   generate segment
   fetch
   updateDB

The crawl-urlfilter.txt doesn't work; 

My question is:

1) If I want to control the crawler's behavior in
second case, should I call 'NutchConf.get()...' by
myself?

2) Where url-filter exactly works? In fetcher? So,
after loaded from .xml and .txt, all the configuration
data is kept in Properties for life time of nutch
running?

thanks,

Michael Ji


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: dump nutch index

Posted by Jack Tang <hi...@gmail.com>.
Hi Michael

On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> hi Jack:
> 
> I guess segread can dump the content of fetched
> segment content; but I want to see inside of index
> created by running "bin/nutch index", etc.

Try to search "http/https/ftp/file"(the protocol) keywords using
NutchBean, I guess it will dump all index;), right?

> thanks,
> 
> Michael Ji
> 

Regards
/Jack

> --- Jack Tang <hi...@gmail.com> wrote:
> 
> > Hi Michael
> >
> > Is "segread" nutch command what you wanna?
> > Corresponding class is
> > org.apache.nutch.segment.SegmentReader
> >
> > Regards
> > /Jack
> >
> > On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > > hi Jack:
> > >
> > > I am using Lukeall now and can browse into the
> > index
> > > files; it is very powerful tool.
> > >
> > > But I wonder if I can output the content of the
> > > individual files in index dir to a text format,
> > means,
> > > I can see the each text saved in index files
> > without
> > > interpreting by Lukeall.
> > >
> > > thanks,
> > >
> > > Michael Ji
> > >
> > > --- Jack Tang <hi...@gmail.com> wrote:
> > >
> > > > Hi Michael
> > > >
> > > > Hope luke helps you.
> > > > http://www.getopt.org/luke/
> > > >
> > > > Regards
> > > > /Jack
> > > >
> > > > On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > > > > hi there,
> > > > >
> > > > > Is there a easy way that I could dump nutch
> > index
> > > > to a
> > > > > human-readable format?
> > > > >
> > > > > thanks,
> > > > >
> > > > > Michael Ji
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > ____________________________________________________
> > > > > Start your day with Yahoo! - make it your home
> > > > page
> > > > > http://www.yahoo.com/r/hs
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Keep Discovering ... ...
> > > > http://www.jroller.com/page/jmars
> > > >
> > >
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Tired of spam?  Yahoo! Mail has the best spam
> > protection around
> > > http://mail.yahoo.com
> > >
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> 
> 
> 
> 
> ____________________________________________________
> Start your day with Yahoo! - make it your home page
> http://www.yahoo.com/r/hs
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: dump nutch index

Posted by Michael Ji <fj...@yahoo.com>.
hi Jack:

I guess segread can dump the content of fetched
segment content; but I want to see inside of index
created by running "bin/nutch index", etc.

thanks,

Michael Ji

--- Jack Tang <hi...@gmail.com> wrote:

> Hi Michael
> 
> Is "segread" nutch command what you wanna?
> Corresponding class is
> org.apache.nutch.segment.SegmentReader
> 
> Regards
> /Jack
> 
> On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > hi Jack:
> > 
> > I am using Lukeall now and can browse into the
> index
> > files; it is very powerful tool.
> > 
> > But I wonder if I can output the content of the
> > individual files in index dir to a text format,
> means,
> > I can see the each text saved in index files
> without
> > interpreting by Lukeall.
> > 
> > thanks,
> > 
> > Michael Ji
> > 
> > --- Jack Tang <hi...@gmail.com> wrote:
> > 
> > > Hi Michael
> > >
> > > Hope luke helps you.
> > > http://www.getopt.org/luke/
> > >
> > > Regards
> > > /Jack
> > >
> > > On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > > > hi there,
> > > >
> > > > Is there a easy way that I could dump nutch
> index
> > > to a
> > > > human-readable format?
> > > >
> > > > thanks,
> > > >
> > > > Michael Ji
> > > >
> > > >
> > > >
> > > >
> > >
> ____________________________________________________
> > > > Start your day with Yahoo! - make it your home
> > > page
> > > > http://www.yahoo.com/r/hs
> > > >
> > > >
> > >
> > >
> > > --
> > > Keep Discovering ... ...
> > > http://www.jroller.com/page/jmars
> > >
> > 
> > 
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> > 
> 
> 
> -- 
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
> 



		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 

Re: dump nutch index

Posted by Jack Tang <hi...@gmail.com>.
Hi Michael

Is "segread" nutch command what you wanna?
Corresponding class is org.apache.nutch.segment.SegmentReader

Regards
/Jack

On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> hi Jack:
> 
> I am using Lukeall now and can browse into the index
> files; it is very powerful tool.
> 
> But I wonder if I can output the content of the
> individual files in index dir to a text format, means,
> I can see the each text saved in index files without
> interpreting by Lukeall.
> 
> thanks,
> 
> Michael Ji
> 
> --- Jack Tang <hi...@gmail.com> wrote:
> 
> > Hi Michael
> >
> > Hope luke helps you.
> > http://www.getopt.org/luke/
> >
> > Regards
> > /Jack
> >
> > On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > > hi there,
> > >
> > > Is there a easy way that I could dump nutch index
> > to a
> > > human-readable format?
> > >
> > > thanks,
> > >
> > > Michael Ji
> > >
> > >
> > >
> > >
> > ____________________________________________________
> > > Start your day with Yahoo! - make it your home
> > page
> > > http://www.yahoo.com/r/hs
> > >
> > >
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: dump nutch index

Posted by Michael Ji <fj...@yahoo.com>.
hi Jack:

I am using Lukeall now and can browse into the index
files; it is very powerful tool.

But I wonder if I can output the content of the
individual files in index dir to a text format, means,
I can see the each text saved in index files without
interpreting by Lukeall.

thanks,

Michael Ji

--- Jack Tang <hi...@gmail.com> wrote:

> Hi Michael
> 
> Hope luke helps you.
> http://www.getopt.org/luke/
> 
> Regards
> /Jack
> 
> On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > hi there,
> > 
> > Is there a easy way that I could dump nutch index
> to a
> > human-readable format?
> > 
> > thanks,
> > 
> > Michael Ji
> > 
> > 
> > 
> >
> ____________________________________________________
> > Start your day with Yahoo! - make it your home
> page
> > http://www.yahoo.com/r/hs
> > 
> > 
> 
> 
> -- 
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: dump nutch index

Posted by Jack Tang <hi...@gmail.com>.
Hi Michael

Hope luke helps you.
http://www.getopt.org/luke/

Regards
/Jack

On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> hi there,
> 
> Is there a easy way that I could dump nutch index to a
> human-readable format?
> 
> thanks,
> 
> Michael Ji
> 
> 
> 
> ____________________________________________________
> Start your day with Yahoo! - make it your home page
> http://www.yahoo.com/r/hs
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

dump nutch index

Posted by Michael Ji <fj...@yahoo.com>.
hi there,

Is there a easy way that I could dump nutch index to a
human-readable format?

thanks,

Michael Ji


		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs