You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Aaron Tang <ga...@gmail.com> on 2006/07/25 18:36:55 UTC
How can i get a page content or parse data by the page's url
Hi all,
How can i get a page content or parse data by the page's url.
Just like the command:
$ bin/nutch segread crawl/segments/20060725213636/ -dump
will dump pages in the segment.
I'm using nutch 0.7.2 on cygwin under winxp.
Thanks!
Aaron
Re: How can i get a page content or parse data by the page's url
Posted by Lourival Júnior <ju...@gmail.com>.
I think no. But you could write one, just take a look to nutch api.
On 7/25/06, Aaron Tang <ga...@gmail.com> wrote:
>
> Is there any nutch api can do this?
>
> -----Original Message-----
> From: Lourival Júnior [mailto:junior.ufpa@gmail.com]
> Sent: Wednesday, July 26, 2006 1:41 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: How can i get a page content or parse data by the page's url
>
> If I'm not wrong you can´t do this. The segread command only accept these
> arguments:
>
> SegmentReader [-fix] [-dump] [-dumpsort] [-list] [-nocontent]
> [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)
> NOTE: at least one segment dir name is required, or '-dir' option.
> -fix automatically fix corrupted segments
> -dump dump segment data in human-readable format
> -dumpsort dump segment data in human-readable format, sorted
> by URL
> -list print useful information about segments
> -nocontent ignore content data
> -noparsedata ignore parse_data data
> -nocontent ignore parse_text data
> -dir segments directory containing multiple segments
> seg1 seg2 ... segment directories
>
> On 7/25/06, Aaron Tang <ga...@gmail.com> wrote:
> >
> > Hi all,
> >
> > How can i get a page content or parse data by the page's url.
> > Just like the command:
> >
> > $ bin/nutch segread crawl/segments/20060725213636/ -dump
> >
> > will dump pages in the segment.
> >
> > I'm using nutch 0.7.2 on cygwin under winxp.
> >
> > Thanks!
> >
> > Aaron
> >
> >
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi
> Msn: junior_ufpa@hotmail.com
>
>
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com
RE: How can i get a page content or parse data by the page's url
Posted by Aaron Tang <ga...@gmail.com>.
Is there any nutch api can do this?
-----Original Message-----
From: Lourival Júnior [mailto:junior.ufpa@gmail.com]
Sent: Wednesday, July 26, 2006 1:41 AM
To: nutch-dev@lucene.apache.org
Subject: Re: How can i get a page content or parse data by the page's url
If I'm not wrong you can´t do this. The segread command only accept these
arguments:
SegmentReader [-fix] [-dump] [-dumpsort] [-list] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)
NOTE: at least one segment dir name is required, or '-dir' option.
-fix automatically fix corrupted segments
-dump dump segment data in human-readable format
-dumpsort dump segment data in human-readable format, sorted
by URL
-list print useful information about segments
-nocontent ignore content data
-noparsedata ignore parse_data data
-nocontent ignore parse_text data
-dir segments directory containing multiple segments
seg1 seg2 ... segment directories
On 7/25/06, Aaron Tang <ga...@gmail.com> wrote:
>
> Hi all,
>
> How can i get a page content or parse data by the page's url.
> Just like the command:
>
> $ bin/nutch segread crawl/segments/20060725213636/ -dump
>
> will dump pages in the segment.
>
> I'm using nutch 0.7.2 on cygwin under winxp.
>
> Thanks!
>
> Aaron
>
>
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com
Re: How can i get a page content or parse data by the page's url
Posted by Lourival Júnior <ju...@gmail.com>.
If I'm not wrong you can´t do this. The segread command only accept these
arguments:
SegmentReader [-fix] [-dump] [-dumpsort] [-list] [-nocontent] [-noparsedata]
[-noparsetext] (-dir segments | seg1 seg2 ...)
NOTE: at least one segment dir name is required, or '-dir' option.
-fix automatically fix corrupted segments
-dump dump segment data in human-readable format
-dumpsort dump segment data in human-readable format, sorted
by URL
-list print useful information about segments
-nocontent ignore content data
-noparsedata ignore parse_data data
-nocontent ignore parse_text data
-dir segments directory containing multiple segments
seg1 seg2 ... segment directories
On 7/25/06, Aaron Tang <ga...@gmail.com> wrote:
>
> Hi all,
>
> How can i get a page content or parse data by the page's url.
> Just like the command:
>
> $ bin/nutch segread crawl/segments/20060725213636/ -dump
>
> will dump pages in the segment.
>
> I'm using nutch 0.7.2 on cygwin under winxp.
>
> Thanks!
>
> Aaron
>
>
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com