You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Michael Ji <fj...@yahoo.com> on 2005/08/22 02:22:55 UTC

dump nutch index

hi there,

Is there a easy way that I could dump nutch index to a
human-readable format?

thanks,

Michael Ji


		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 

Re: crawl-urlfilter.txt mechanics

Posted by Piotr Kosiorowski <pk...@gmail.com>.
crawl-urlfilter.txt is "bin/nutch crawl" specific. If you want to use
each step separatelly - you ar ein fact doing "Whole Web crawling"
from tutorial - so you need to modify regex-urlfilter.txt instead.
Regards
Piotr

On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> 
> Hi,
> 
> When I use intranet crawling, such as, call
> "bin/nutch crawl ...", crawl-urlfilter.txt works---it
> filters out the urls that is not matched the domain I
> included;
> 
> actually, when I take a look at crawltool.java, the
> config files are read in Java Properties by
> 'NutchConf.get().addConfResource("crawl-tool.xml")'
> 
> But:
> 
> When I calling each steps explicitly by myself, such
> as,
> Loop
>    generate segment
>    fetch
>    updateDB
> 
> The crawl-urlfilter.txt doesn't work;
> 
> My question is:
> 
> 1) If I want to control the crawler's behavior in
> second case, should I call 'NutchConf.get()...' by
> myself?
> 
> 2) Where url-filter exactly works? In fetcher? So,
> after loaded from .xml and .txt, all the configuration
> data is kept in Properties for life time of nutch
> running?
> 
> thanks,
> 
> Michael Ji
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

crawl-urlfilter.txt mechanics

Posted by Michael Ji <fj...@yahoo.com>.
Hi,

When I use intranet crawling, such as, call 
"bin/nutch crawl ...", crawl-urlfilter.txt works---it
filters out the urls that is not matched the domain I
included;

actually, when I take a look at crawltool.java, the
config files are read in Java Properties by
'NutchConf.get().addConfResource("crawl-tool.xml")'

But:

When I calling each steps explicitly by myself, such
as, 
Loop 
   generate segment
   fetch
   updateDB

The crawl-urlfilter.txt doesn't work; 

My question is:

1) If I want to control the crawler's behavior in
second case, should I call 'NutchConf.get()...' by
myself?

2) Where url-filter exactly works? In fetcher? So,
after loaded from .xml and .txt, all the configuration
data is kept in Properties for life time of nutch
running?

thanks,

Michael Ji


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: dump nutch index

Posted by Jack Tang <hi...@gmail.com>.
Hi Michael

On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> hi Jack:
> 
> I guess segread can dump the content of fetched
> segment content; but I want to see inside of index
> created by running "bin/nutch index", etc.

Try to search "http/https/ftp/file"(the protocol) keywords using
NutchBean, I guess it will dump all index;), right?

> thanks,
> 
> Michael Ji
> 

Regards
/Jack

> --- Jack Tang <hi...@gmail.com> wrote:
> 
> > Hi Michael
> >
> > Is "segread" nutch command what you wanna?
> > Corresponding class is
> > org.apache.nutch.segment.SegmentReader
> >
> > Regards
> > /Jack
> >
> > On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > > hi Jack:
> > >
> > > I am using Lukeall now and can browse into the
> > index
> > > files; it is very powerful tool.
> > >
> > > But I wonder if I can output the content of the
> > > individual files in index dir to a text format,
> > means,
> > > I can see the each text saved in index files
> > without
> > > interpreting by Lukeall.
> > >
> > > thanks,
> > >
> > > Michael Ji
> > >
> > > --- Jack Tang <hi...@gmail.com> wrote:
> > >
> > > > Hi Michael
> > > >
> > > > Hope luke helps you.
> > > > http://www.getopt.org/luke/
> > > >
> > > > Regards
> > > > /Jack
> > > >
> > > > On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > > > > hi there,
> > > > >
> > > > > Is there a easy way that I could dump nutch
> > index
> > > > to a
> > > > > human-readable format?
> > > > >
> > > > > thanks,
> > > > >
> > > > > Michael Ji
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > ____________________________________________________
> > > > > Start your day with Yahoo! - make it your home
> > > > page
> > > > > http://www.yahoo.com/r/hs
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Keep Discovering ... ...
> > > > http://www.jroller.com/page/jmars
> > > >
> > >
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Tired of spam?  Yahoo! Mail has the best spam
> > protection around
> > > http://mail.yahoo.com
> > >
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> 
> 
> 
> 
> ____________________________________________________
> Start your day with Yahoo! - make it your home page
> http://www.yahoo.com/r/hs
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: dump nutch index

Posted by Michael Ji <fj...@yahoo.com>.
hi Jack:

I guess segread can dump the content of fetched
segment content; but I want to see inside of index
created by running "bin/nutch index", etc.

thanks,

Michael Ji

--- Jack Tang <hi...@gmail.com> wrote:

> Hi Michael
> 
> Is "segread" nutch command what you wanna?
> Corresponding class is
> org.apache.nutch.segment.SegmentReader
> 
> Regards
> /Jack
> 
> On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > hi Jack:
> > 
> > I am using Lukeall now and can browse into the
> index
> > files; it is very powerful tool.
> > 
> > But I wonder if I can output the content of the
> > individual files in index dir to a text format,
> means,
> > I can see the each text saved in index files
> without
> > interpreting by Lukeall.
> > 
> > thanks,
> > 
> > Michael Ji
> > 
> > --- Jack Tang <hi...@gmail.com> wrote:
> > 
> > > Hi Michael
> > >
> > > Hope luke helps you.
> > > http://www.getopt.org/luke/
> > >
> > > Regards
> > > /Jack
> > >
> > > On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > > > hi there,
> > > >
> > > > Is there a easy way that I could dump nutch
> index
> > > to a
> > > > human-readable format?
> > > >
> > > > thanks,
> > > >
> > > > Michael Ji
> > > >
> > > >
> > > >
> > > >
> > >
> ____________________________________________________
> > > > Start your day with Yahoo! - make it your home
> > > page
> > > > http://www.yahoo.com/r/hs
> > > >
> > > >
> > >
> > >
> > > --
> > > Keep Discovering ... ...
> > > http://www.jroller.com/page/jmars
> > >
> > 
> > 
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> > 
> 
> 
> -- 
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
> 



		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 

Re: dump nutch index

Posted by Jack Tang <hi...@gmail.com>.
Hi Michael

Is "segread" nutch command what you wanna?
Corresponding class is org.apache.nutch.segment.SegmentReader

Regards
/Jack

On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> hi Jack:
> 
> I am using Lukeall now and can browse into the index
> files; it is very powerful tool.
> 
> But I wonder if I can output the content of the
> individual files in index dir to a text format, means,
> I can see the each text saved in index files without
> interpreting by Lukeall.
> 
> thanks,
> 
> Michael Ji
> 
> --- Jack Tang <hi...@gmail.com> wrote:
> 
> > Hi Michael
> >
> > Hope luke helps you.
> > http://www.getopt.org/luke/
> >
> > Regards
> > /Jack
> >
> > On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > > hi there,
> > >
> > > Is there a easy way that I could dump nutch index
> > to a
> > > human-readable format?
> > >
> > > thanks,
> > >
> > > Michael Ji
> > >
> > >
> > >
> > >
> > ____________________________________________________
> > > Start your day with Yahoo! - make it your home
> > page
> > > http://www.yahoo.com/r/hs
> > >
> > >
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: dump nutch index

Posted by Michael Ji <fj...@yahoo.com>.
hi Jack:

I am using Lukeall now and can browse into the index
files; it is very powerful tool.

But I wonder if I can output the content of the
individual files in index dir to a text format, means,
I can see the each text saved in index files without
interpreting by Lukeall.

thanks,

Michael Ji

--- Jack Tang <hi...@gmail.com> wrote:

> Hi Michael
> 
> Hope luke helps you.
> http://www.getopt.org/luke/
> 
> Regards
> /Jack
> 
> On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > hi there,
> > 
> > Is there a easy way that I could dump nutch index
> to a
> > human-readable format?
> > 
> > thanks,
> > 
> > Michael Ji
> > 
> > 
> > 
> >
> ____________________________________________________
> > Start your day with Yahoo! - make it your home
> page
> > http://www.yahoo.com/r/hs
> > 
> > 
> 
> 
> -- 
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: dump nutch index

Posted by Jack Tang <hi...@gmail.com>.
Hi Michael

Hope luke helps you.
http://www.getopt.org/luke/

Regards
/Jack

On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> hi there,
> 
> Is there a easy way that I could dump nutch index to a
> human-readable format?
> 
> thanks,
> 
> Michael Ji
> 
> 
> 
> ____________________________________________________
> Start your day with Yahoo! - make it your home page
> http://www.yahoo.com/r/hs
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars