You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Michael Ji <fj...@yahoo.com> on 2005/08/22 02:22:55 UTC
dump nutch index
hi there,
Is there a easy way that I could dump nutch index to a
human-readable format?
thanks,
Michael Ji
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs
Re: crawl-urlfilter.txt mechanics
Posted by Piotr Kosiorowski <pk...@gmail.com>.
crawl-urlfilter.txt is "bin/nutch crawl" specific. If you want to use
each step separatelly - you ar ein fact doing "Whole Web crawling"
from tutorial - so you need to modify regex-urlfilter.txt instead.
Regards
Piotr
On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
>
> Hi,
>
> When I use intranet crawling, such as, call
> "bin/nutch crawl ...", crawl-urlfilter.txt works---it
> filters out the urls that is not matched the domain I
> included;
>
> actually, when I take a look at crawltool.java, the
> config files are read in Java Properties by
> 'NutchConf.get().addConfResource("crawl-tool.xml")'
>
> But:
>
> When I calling each steps explicitly by myself, such
> as,
> Loop
> generate segment
> fetch
> updateDB
>
> The crawl-urlfilter.txt doesn't work;
>
> My question is:
>
> 1) If I want to control the crawler's behavior in
> second case, should I call 'NutchConf.get()...' by
> myself?
>
> 2) Where url-filter exactly works? In fetcher? So,
> after loaded from .xml and .txt, all the configuration
> data is kept in Properties for life time of nutch
> running?
>
> thanks,
>
> Michael Ji
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
crawl-urlfilter.txt mechanics
Posted by Michael Ji <fj...@yahoo.com>.
Hi,
When I use intranet crawling, such as, call
"bin/nutch crawl ...", crawl-urlfilter.txt works---it
filters out the urls that is not matched the domain I
included;
actually, when I take a look at crawltool.java, the
config files are read in Java Properties by
'NutchConf.get().addConfResource("crawl-tool.xml")'
But:
When I calling each steps explicitly by myself, such
as,
Loop
generate segment
fetch
updateDB
The crawl-urlfilter.txt doesn't work;
My question is:
1) If I want to control the crawler's behavior in
second case, should I call 'NutchConf.get()...' by
myself?
2) Where url-filter exactly works? In fetcher? So,
after loaded from .xml and .txt, all the configuration
data is kept in Properties for life time of nutch
running?
thanks,
Michael Ji
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Re: dump nutch index
Posted by Jack Tang <hi...@gmail.com>.
Hi Michael
On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> hi Jack:
>
> I guess segread can dump the content of fetched
> segment content; but I want to see inside of index
> created by running "bin/nutch index", etc.
Try to search "http/https/ftp/file"(the protocol) keywords using
NutchBean, I guess it will dump all index;), right?
> thanks,
>
> Michael Ji
>
Regards
/Jack
> --- Jack Tang <hi...@gmail.com> wrote:
>
> > Hi Michael
> >
> > Is "segread" nutch command what you wanna?
> > Corresponding class is
> > org.apache.nutch.segment.SegmentReader
> >
> > Regards
> > /Jack
> >
> > On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > > hi Jack:
> > >
> > > I am using Lukeall now and can browse into the
> > index
> > > files; it is very powerful tool.
> > >
> > > But I wonder if I can output the content of the
> > > individual files in index dir to a text format,
> > means,
> > > I can see the each text saved in index files
> > without
> > > interpreting by Lukeall.
> > >
> > > thanks,
> > >
> > > Michael Ji
> > >
> > > --- Jack Tang <hi...@gmail.com> wrote:
> > >
> > > > Hi Michael
> > > >
> > > > Hope luke helps you.
> > > > http://www.getopt.org/luke/
> > > >
> > > > Regards
> > > > /Jack
> > > >
> > > > On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > > > > hi there,
> > > > >
> > > > > Is there a easy way that I could dump nutch
> > index
> > > > to a
> > > > > human-readable format?
> > > > >
> > > > > thanks,
> > > > >
> > > > > Michael Ji
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > ____________________________________________________
> > > > > Start your day with Yahoo! - make it your home
> > > > page
> > > > > http://www.yahoo.com/r/hs
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Keep Discovering ... ...
> > > > http://www.jroller.com/page/jmars
> > > >
> > >
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Tired of spam? Yahoo! Mail has the best spam
> > protection around
> > > http://mail.yahoo.com
> > >
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
>
>
>
> ____________________________________________________
> Start your day with Yahoo! - make it your home page
> http://www.yahoo.com/r/hs
>
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: dump nutch index
Posted by Michael Ji <fj...@yahoo.com>.
hi Jack:
I guess segread can dump the content of fetched
segment content; but I want to see inside of index
created by running "bin/nutch index", etc.
thanks,
Michael Ji
--- Jack Tang <hi...@gmail.com> wrote:
> Hi Michael
>
> Is "segread" nutch command what you wanna?
> Corresponding class is
> org.apache.nutch.segment.SegmentReader
>
> Regards
> /Jack
>
> On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > hi Jack:
> >
> > I am using Lukeall now and can browse into the
> index
> > files; it is very powerful tool.
> >
> > But I wonder if I can output the content of the
> > individual files in index dir to a text format,
> means,
> > I can see the each text saved in index files
> without
> > interpreting by Lukeall.
> >
> > thanks,
> >
> > Michael Ji
> >
> > --- Jack Tang <hi...@gmail.com> wrote:
> >
> > > Hi Michael
> > >
> > > Hope luke helps you.
> > > http://www.getopt.org/luke/
> > >
> > > Regards
> > > /Jack
> > >
> > > On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > > > hi there,
> > > >
> > > > Is there a easy way that I could dump nutch
> index
> > > to a
> > > > human-readable format?
> > > >
> > > > thanks,
> > > >
> > > > Michael Ji
> > > >
> > > >
> > > >
> > > >
> > >
> ____________________________________________________
> > > > Start your day with Yahoo! - make it your home
> > > page
> > > > http://www.yahoo.com/r/hs
> > > >
> > > >
> > >
> > >
> > > --
> > > Keep Discovering ... ...
> > > http://www.jroller.com/page/jmars
> > >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam? Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs
Re: dump nutch index
Posted by Jack Tang <hi...@gmail.com>.
Hi Michael
Is "segread" nutch command what you wanna?
Corresponding class is org.apache.nutch.segment.SegmentReader
Regards
/Jack
On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> hi Jack:
>
> I am using Lukeall now and can browse into the index
> files; it is very powerful tool.
>
> But I wonder if I can output the content of the
> individual files in index dir to a text format, means,
> I can see the each text saved in index files without
> interpreting by Lukeall.
>
> thanks,
>
> Michael Ji
>
> --- Jack Tang <hi...@gmail.com> wrote:
>
> > Hi Michael
> >
> > Hope luke helps you.
> > http://www.getopt.org/luke/
> >
> > Regards
> > /Jack
> >
> > On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > > hi there,
> > >
> > > Is there a easy way that I could dump nutch index
> > to a
> > > human-readable format?
> > >
> > > thanks,
> > >
> > > Michael Ji
> > >
> > >
> > >
> > >
> > ____________________________________________________
> > > Start your day with Yahoo! - make it your home
> > page
> > > http://www.yahoo.com/r/hs
> > >
> > >
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: dump nutch index
Posted by Michael Ji <fj...@yahoo.com>.
hi Jack:
I am using Lukeall now and can browse into the index
files; it is very powerful tool.
But I wonder if I can output the content of the
individual files in index dir to a text format, means,
I can see the each text saved in index files without
interpreting by Lukeall.
thanks,
Michael Ji
--- Jack Tang <hi...@gmail.com> wrote:
> Hi Michael
>
> Hope luke helps you.
> http://www.getopt.org/luke/
>
> Regards
> /Jack
>
> On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> > hi there,
> >
> > Is there a easy way that I could dump nutch index
> to a
> > human-readable format?
> >
> > thanks,
> >
> > Michael Ji
> >
> >
> >
> >
> ____________________________________________________
> > Start your day with Yahoo! - make it your home
> page
> > http://www.yahoo.com/r/hs
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Re: dump nutch index
Posted by Jack Tang <hi...@gmail.com>.
Hi Michael
Hope luke helps you.
http://www.getopt.org/luke/
Regards
/Jack
On 8/22/05, Michael Ji <fj...@yahoo.com> wrote:
> hi there,
>
> Is there a easy way that I could dump nutch index to a
> human-readable format?
>
> thanks,
>
> Michael Ji
>
>
>
> ____________________________________________________
> Start your day with Yahoo! - make it your home page
> http://www.yahoo.com/r/hs
>
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars