You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Renato Marroquín Mogrovejo <re...@gmail.com> on 2013/05/12 09:40:05 UTC

Fetching a specific number of urls

Hi all,

I have been trying to fetch a query similar to:

http://www.xyz.com/?page=1

But where the number can vary from 1 to 100. Inside the first page
there are links to the next ones. So I updated the
conf/regex-urlfilter file and added:

^[0-9]{1,45}$

When I do this, the generate job fails saying that it is "Invalid
first character". I have tried generating with topN 5 and depth 5 and
trying to fetch more urls but that does not work.

Could anyone advise me on how to accomplish this? I am running Nutch 2.x.
Thanks in advance!


Renato M.

Re: Fetching a specific number of urls

Posted by Tejas Patil <te...@gmail.com>.

On Thu, May 16, 2013 at 11:53 AM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Well I have managed to get the same results as you have (I think). Now
> on my crawldb there are the links with the following structure:
>
> +http://www.xyz.com/\?page=*
>
> But there are also many other links, how would I do to only get the
> links in the above format? I mean ignoring all the others and only
> getting the ones with the same structure.
>

If you *just* want urls of type
http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=<http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1>

then add a accept rule for that and reject the rest by using this:

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

+
http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=*

# reject rest all urls
-.


> I have also noticed something interesting, that if I use:
>
> ./bin/nutch generate -topN 10 -numFetchers 1 -depth 10  -noFilter -adddays
> 0
>
> I only get the same seed url but no others, is this caused by the
> depth parameter?
>

Weird. Depth has nothing to do with this.
topN parameter could be set to a bigger value to see if this happens. I
vaguely remember (2-3 years back) that there was a jira about this and it
was said to be wont fix as people wont use low topN values in a typical
prod setup.


> Thanks again!
>
>
> Renato M.
>
>
> 2013/5/16 Renato Marroquín Mogrovejo <re...@gmail.com>:
> > Hi Tejas,
> >
> > Thank you very much for your help again.
> > But I'm sorry to inform that I am still not able to get the next link
> > into my crawldb. I am thinking that my conf/regex-urlfilter.txt file
> > is not properly set up. I am sending the content of this file, could
> > you help me determining what is wrong with it?
> > Thanks a ton in advanced!
> >
> >
> > Renato M.
> >
> >
> > # skip file: ftp: and mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> > # for a more extensive coverage use the urlfilter-suffix plugin
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> >
> > #+http://www.xyz.com/\?page=*
> > +
> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=*
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> > +.
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > +.
> >
> > # accept anything else
> > +.
> >
> > 2013/5/13 Tejas Patil <te...@gmail.com>:
> >> Hi Renato,
> >>
> >> The default content limit for http protocol is 65536 while the webpage
> is
> >> much bigger than that. The relevant config needs to be updated.
> >> Add this to the conf/nutch-site.xml:
> >>
> >> *<property>*
> >> *  <name>http.content.limit</name>*
> >> *  <value>240000</value>*
> >> *  <description>The length limit for downloaded content using the http*
> >> *  protocol, in bytes. If this value is nonnegative (>=0), content
> longer*
> >> *  than it will be truncated; otherwise, no truncation at all. Do not*
> >> *  confuse this setting with the file.content.limit setting.*
> >> *  </description>*
> >> *</property>*
> >>
> >> I got a connection timed out error post this config change above (it
> makes
> >> sense as the content to be downloaded is more).
> >> So I added this to the conf/nutch-site.xml:
> >>
> >> *<property>*
> >> *  <name>http.timeout</name>*
> >> *  <value>1000000</value>*
> >> *  <description>The default network timeout, in
> milliseconds.</description>*
> >> *</property>*
> >>
> >> After running a fresh crawl, I could see the link to the next page in
> the
> >> crawldb:
> >>
> >> *
> >>
> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
> >> key:
> >>
>  net.telelistas.www:http/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
> >> *
> >> *baseUrl:        null*
> >> *status: 1 (status_unfetched)*
> >> *fetchTime:      1368424541731*
> >> *prevFetchTime:  0*
> >> *fetchInterval:  2592000*
> >> *retriesSinceFetch:      0*
> >> *modifiedTime:   0*
> >> *prevModifiedTime:       0*
> >> *protocolStatus: (null)*
> >> *parseStatus:    (null)*
> >> *title:  null*
> >> *score:  0.0042918455*
> >> *markers:        {dist=1}*
> >> *reprUrl:        null*
> >> *metadata _csh_ :        ;���*
> >>
> >> HTH
> >>
> >>
> >> On Sun, May 12, 2013 at 10:21 PM, Renato Marroquín Mogrovejo <
> >> renatoj.marroquin@gmail.com> wrote:
> >>
> >>> Hi Tejas,
> >>>
> >>> So I started fresh. I deleted the webpage keyspace as I am using
> >>> Cassandra as a backend. But I did get the same output. I mean I get a
> >>> bunch of urls after I do a readdb -dump but not the ones I want. I get
> >>> only one fetched site, and many links parsed (to be parsed in the next
> >>> cycle?). Maybe it has to do something with the urls I am trying to
> >>> get?
> >>> I am trying to get this url and similar ones:
> >>>
> >>>
> >>>
> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1
> >>>
> >>> But I have noticed that the links pointing to the next ones are
> >>> something like this:
> >>>
> >>> <a class="resultado_roda"
> >>> href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a>
> >>>
> >>> So I decided to try commenting this url rule:
> >>> # skip URLs with slash-delimited segment that repeats 3+ times, to
> break
> >>> loops
> >>> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >>>
> >>> But I got the same results. A single site fetched, some urls parsed
> >>> but not the ones I want using the regex-urlfilter.txt. Any Ideas?
> >>> Thanks a ton for your help Tejas!
> >>>
> >>>
> >>> Renato M.
> >>>
> >>>
> >>> 2013/5/12 Tejas Patil <te...@gmail.com>:
> >>> > Hi Renato,
> >>> >
> >>> > Thats weird. I ran a crawl over similar urls having a query in the
> end (
> >>> > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with
> 2.x.
> >>> > My guess is that there is something wrong while parsing due to which
> >>> > outlinks are not getting into the crawldb.
> >>> >
> >>> > Start from fresh. Clear everything from previous attempts.
> (including the
> >>> > backend table named as the value of 'storage.schema.webpage').
> >>> > Run these :
> >>> > bin/nutch inject *<urldir>*
> >>> > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0
> >>> > bin/nutch fetch *<batchID>* -threads 2
> >>> > bin/nutch parse *<batchID> *
> >>> > bin/nutch updatedb
> >>> > bin/nutch readdb -dump <*output dir*>
> >>> >
> >>> > The readdb output will shown if the outlinks were extracted
> correctly.
> >>> >
> >>> > The commands for checking urlfilter rules accept one input url at a
> time
> >>> > from console (you need to type/paste the url and hit enter).
> >>> > It shows "+" if the url is accepted by the current rules. ("-" for
> >>> > rejection).
> >>> >
> >>> > Thanks,
> >>> > Tejas
> >>> >
> >>> > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo <
> >>> > renatoj.marroquin@gmail.com> wrote:
> >>> >
> >>> >> And I did try the commands you told me but I am not sure how they
> >>> >> work. They do wait for an url to be input, but then it prints the
> url
> >>> >> with a '+' at the beginning, what does that mean?
> >>> >>
> >>> >> http://www.xyz.com/lanchon
> >>> >> +http://www.xyz.com/lanchon
> >>> >>
> >>> >> 2013/5/12 Renato Marroquín Mogrovejo <re...@gmail.com>:
> >>> >> > Hi Tejas,
> >>> >> >
> >>> >> > Thanks for your help. I have tried the expression you suggested,
> and
> >>> >> > now my url-filter file is like this:
> >>> >> > +http://www.xyz.com/\?page=*
> >>> >> >
> >>> >> > # skip URLs containing certain characters as probable queries,
> etc.
> >>> >> > #-[?*!@=]
> >>> >> > +.
> >>> >> >
> >>> >> > # skip URLs with slash-delimited segment that repeats 3+ times, to
> >>> break
> >>> >> loops
> >>> >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >>> >> > +.
> >>> >> >
> >>> >> > # accept anything else
> >>> >> > +.
> >>> >> >
> >>> >> > So after this, I run a generate command -topN 5 -depth 5, and
> then a
> >>> >> > fetch all, but I keep on getting a single page fetched. What am I
> >>> >> > doing wrong? Thanks again for your help.
> >>> >> >
> >>> >> >
> >>> >> > Renato M.
> >>> >> >
> >>> >> > 2013/5/12 Tejas Patil <te...@gmail.com>:
> >>> >> >> FYI: You can use anyone of these commands to run the
> regex-urlfilter
> >>> >> rules
> >>> >> >> against any given url:
> >>> >> >>
> >>> >> >> bin/nutch plugin urlfilter-regex
> >>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >>> >> >> OR
> >>> >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
> >>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >>> >> >>
> >>> >> >> Both of them accept input url one at a time from stdin.
> >>> >> >> The later one has a param which can enable you to test a given
> url
> >>> >> against
> >>> >> >> several url filters at once. See its usage for more details.
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <
> >>> tejas.patil.cs@gmail.com
> >>> >> >wrote:
> >>> >> >>
> >>> >> >>> If there is no restriction on the number at the end of the url,
> you
> >>> >> might
> >>> >> >>> just use this:
> >>> >> >>> (note that the rule must be above the one which filters urls
> with a
> >>> "?"
> >>> >> >>> character)
> >>> >> >>>
> >>> >> >>> *+http://www.xyz.com/\?page=*
> >>> >> >>> *
> >>> >> >>> *
> >>> >> >>> *# skip URLs containing certain characters as probable queries,
> >>> etc.*
> >>> >> >>> *-[?*!@=]*
> >>> >> >>>
> >>> >> >>>
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
> >>> >> >>> renatoj.marroquin@gmail.com> wrote:
> >>> >> >>>
> >>> >> >>>> Hi all,
> >>> >> >>>>
> >>> >> >>>> I have been trying to fetch a query similar to:
> >>> >> >>>>
> >>> >> >>>> http://www.xyz.com/?page=1
> >>> >> >>>>
> >>> >> >>>> But where the number can vary from 1 to 100. Inside the first
> page
> >>> >> >>>> there are links to the next ones. So I updated the
> >>> >> >>>> conf/regex-urlfilter file and added:
> >>> >> >>>>
> >>> >> >>>> ^[0-9]{1,45}$
> >>> >> >>>>
> >>> >> >>>> When I do this, the generate job fails saying that it is
> "Invalid
> >>> >> >>>> first character". I have tried generating with topN 5 and
> depth 5
> >>> and
> >>> >> >>>> trying to fetch more urls but that does not work.
> >>> >> >>>>
> >>> >> >>>> Could anyone advise me on how to accomplish this? I am running
> >>> Nutch
> >>> >> 2.x.
> >>> >> >>>> Thanks in advance!
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> Renato M.
> >>> >> >>>>
> >>> >> >>>
> >>> >> >>>
> >>> >>
> >>>
>

Re: Fetching a specific number of urls

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Well I have managed to get the same results as you have (I think). Now
on my crawldb there are the links with the following structure:

+http://www.xyz.com/\?page=*

But there are also many other links, how would I do to only get the
links in the above format? I mean ignoring all the others and only
getting the ones with the same structure.
I have also noticed something interesting, that if I use:

./bin/nutch generate -topN 10 -numFetchers 1 -depth 10  -noFilter -adddays 0

I only get the same seed url but no others, is this caused by the
depth parameter?
Thanks again!


Renato M.


2013/5/16 Renato Marroquín Mogrovejo <re...@gmail.com>:
> Hi Tejas,
>
> Thank you very much for your help again.
> But I'm sorry to inform that I am still not able to get the next link
> into my crawldb. I am thinking that my conf/regex-urlfilter.txt file
> is not properly set up. I am sending the content of this file, could
> you help me determining what is wrong with it?
> Thanks a ton in advanced!
>
>
> Renato M.
>
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
> #+http://www.xyz.com/\?page=*
> +http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=*
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
> +.
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> +.
>
> # accept anything else
> +.
>
> 2013/5/13 Tejas Patil <te...@gmail.com>:
>> Hi Renato,
>>
>> The default content limit for http protocol is 65536 while the webpage is
>> much bigger than that. The relevant config needs to be updated.
>> Add this to the conf/nutch-site.xml:
>>
>> *<property>*
>> *  <name>http.content.limit</name>*
>> *  <value>240000</value>*
>> *  <description>The length limit for downloaded content using the http*
>> *  protocol, in bytes. If this value is nonnegative (>=0), content longer*
>> *  than it will be truncated; otherwise, no truncation at all. Do not*
>> *  confuse this setting with the file.content.limit setting.*
>> *  </description>*
>> *</property>*
>>
>> I got a connection timed out error post this config change above (it makes
>> sense as the content to be downloaded is more).
>> So I added this to the conf/nutch-site.xml:
>>
>> *<property>*
>> *  <name>http.timeout</name>*
>> *  <value>1000000</value>*
>> *  <description>The default network timeout, in milliseconds.</description>*
>> *</property>*
>>
>> After running a fresh crawl, I could see the link to the next page in the
>> crawldb:
>>
>> *
>> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
>> key:
>>  net.telelistas.www:http/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
>> *
>> *baseUrl:        null*
>> *status: 1 (status_unfetched)*
>> *fetchTime:      1368424541731*
>> *prevFetchTime:  0*
>> *fetchInterval:  2592000*
>> *retriesSinceFetch:      0*
>> *modifiedTime:   0*
>> *prevModifiedTime:       0*
>> *protocolStatus: (null)*
>> *parseStatus:    (null)*
>> *title:  null*
>> *score:  0.0042918455*
>> *markers:        {dist=1}*
>> *reprUrl:        null*
>> *metadata _csh_ :        ;���*
>>
>> HTH
>>
>>
>> On Sun, May 12, 2013 at 10:21 PM, Renato Marroquín Mogrovejo <
>> renatoj.marroquin@gmail.com> wrote:
>>
>>> Hi Tejas,
>>>
>>> So I started fresh. I deleted the webpage keyspace as I am using
>>> Cassandra as a backend. But I did get the same output. I mean I get a
>>> bunch of urls after I do a readdb -dump but not the ones I want. I get
>>> only one fetched site, and many links parsed (to be parsed in the next
>>> cycle?). Maybe it has to do something with the urls I am trying to
>>> get?
>>> I am trying to get this url and similar ones:
>>>
>>>
>>> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1
>>>
>>> But I have noticed that the links pointing to the next ones are
>>> something like this:
>>>
>>> <a class="resultado_roda"
>>> href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a>
>>>
>>> So I decided to try commenting this url rule:
>>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>>> loops
>>> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>
>>> But I got the same results. A single site fetched, some urls parsed
>>> but not the ones I want using the regex-urlfilter.txt. Any Ideas?
>>> Thanks a ton for your help Tejas!
>>>
>>>
>>> Renato M.
>>>
>>>
>>> 2013/5/12 Tejas Patil <te...@gmail.com>:
>>> > Hi Renato,
>>> >
>>> > Thats weird. I ran a crawl over similar urls having a query in the end (
>>> > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x.
>>> > My guess is that there is something wrong while parsing due to which
>>> > outlinks are not getting into the crawldb.
>>> >
>>> > Start from fresh. Clear everything from previous attempts. (including the
>>> > backend table named as the value of 'storage.schema.webpage').
>>> > Run these :
>>> > bin/nutch inject *<urldir>*
>>> > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0
>>> > bin/nutch fetch *<batchID>* -threads 2
>>> > bin/nutch parse *<batchID> *
>>> > bin/nutch updatedb
>>> > bin/nutch readdb -dump <*output dir*>
>>> >
>>> > The readdb output will shown if the outlinks were extracted correctly.
>>> >
>>> > The commands for checking urlfilter rules accept one input url at a time
>>> > from console (you need to type/paste the url and hit enter).
>>> > It shows "+" if the url is accepted by the current rules. ("-" for
>>> > rejection).
>>> >
>>> > Thanks,
>>> > Tejas
>>> >
>>> > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo <
>>> > renatoj.marroquin@gmail.com> wrote:
>>> >
>>> >> And I did try the commands you told me but I am not sure how they
>>> >> work. They do wait for an url to be input, but then it prints the url
>>> >> with a '+' at the beginning, what does that mean?
>>> >>
>>> >> http://www.xyz.com/lanchon
>>> >> +http://www.xyz.com/lanchon
>>> >>
>>> >> 2013/5/12 Renato Marroquín Mogrovejo <re...@gmail.com>:
>>> >> > Hi Tejas,
>>> >> >
>>> >> > Thanks for your help. I have tried the expression you suggested, and
>>> >> > now my url-filter file is like this:
>>> >> > +http://www.xyz.com/\?page=*
>>> >> >
>>> >> > # skip URLs containing certain characters as probable queries, etc.
>>> >> > #-[?*!@=]
>>> >> > +.
>>> >> >
>>> >> > # skip URLs with slash-delimited segment that repeats 3+ times, to
>>> break
>>> >> loops
>>> >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>> >> > +.
>>> >> >
>>> >> > # accept anything else
>>> >> > +.
>>> >> >
>>> >> > So after this, I run a generate command -topN 5 -depth 5, and then a
>>> >> > fetch all, but I keep on getting a single page fetched. What am I
>>> >> > doing wrong? Thanks again for your help.
>>> >> >
>>> >> >
>>> >> > Renato M.
>>> >> >
>>> >> > 2013/5/12 Tejas Patil <te...@gmail.com>:
>>> >> >> FYI: You can use anyone of these commands to run the regex-urlfilter
>>> >> rules
>>> >> >> against any given url:
>>> >> >>
>>> >> >> bin/nutch plugin urlfilter-regex
>>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
>>> >> >> OR
>>> >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
>>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
>>> >> >>
>>> >> >> Both of them accept input url one at a time from stdin.
>>> >> >> The later one has a param which can enable you to test a given url
>>> >> against
>>> >> >> several url filters at once. See its usage for more details.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <
>>> tejas.patil.cs@gmail.com
>>> >> >wrote:
>>> >> >>
>>> >> >>> If there is no restriction on the number at the end of the url, you
>>> >> might
>>> >> >>> just use this:
>>> >> >>> (note that the rule must be above the one which filters urls with a
>>> "?"
>>> >> >>> character)
>>> >> >>>
>>> >> >>> *+http://www.xyz.com/\?page=*
>>> >> >>> *
>>> >> >>> *
>>> >> >>> *# skip URLs containing certain characters as probable queries,
>>> etc.*
>>> >> >>> *-[?*!@=]*
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
>>> >> >>> renatoj.marroquin@gmail.com> wrote:
>>> >> >>>
>>> >> >>>> Hi all,
>>> >> >>>>
>>> >> >>>> I have been trying to fetch a query similar to:
>>> >> >>>>
>>> >> >>>> http://www.xyz.com/?page=1
>>> >> >>>>
>>> >> >>>> But where the number can vary from 1 to 100. Inside the first page
>>> >> >>>> there are links to the next ones. So I updated the
>>> >> >>>> conf/regex-urlfilter file and added:
>>> >> >>>>
>>> >> >>>> ^[0-9]{1,45}$
>>> >> >>>>
>>> >> >>>> When I do this, the generate job fails saying that it is "Invalid
>>> >> >>>> first character". I have tried generating with topN 5 and depth 5
>>> and
>>> >> >>>> trying to fetch more urls but that does not work.
>>> >> >>>>
>>> >> >>>> Could anyone advise me on how to accomplish this? I am running
>>> Nutch
>>> >> 2.x.
>>> >> >>>> Thanks in advance!
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Renato M.
>>> >> >>>>
>>> >> >>>
>>> >> >>>
>>> >>
>>>

Re: Fetching a specific number of urls

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi Tejas,

Thank you very much for your help again.
But I'm sorry to inform that I am still not able to get the next link
into my crawldb. I am thinking that my conf/regex-urlfilter.txt file
is not properly set up. I am sending the content of this file, could
you help me determining what is wrong with it?
Thanks a ton in advanced!


Renato M.


# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

#+http://www.xyz.com/\?page=*
+http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=*

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
+.

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+.

# accept anything else
+.

2013/5/13 Tejas Patil <te...@gmail.com>:
> Hi Renato,
>
> The default content limit for http protocol is 65536 while the webpage is
> much bigger than that. The relevant config needs to be updated.
> Add this to the conf/nutch-site.xml:
>
> *<property>*
> *  <name>http.content.limit</name>*
> *  <value>240000</value>*
> *  <description>The length limit for downloaded content using the http*
> *  protocol, in bytes. If this value is nonnegative (>=0), content longer*
> *  than it will be truncated; otherwise, no truncation at all. Do not*
> *  confuse this setting with the file.content.limit setting.*
> *  </description>*
> *</property>*
>
> I got a connection timed out error post this config change above (it makes
> sense as the content to be downloaded is more).
> So I added this to the conf/nutch-site.xml:
>
> *<property>*
> *  <name>http.timeout</name>*
> *  <value>1000000</value>*
> *  <description>The default network timeout, in milliseconds.</description>*
> *</property>*
>
> After running a fresh crawl, I could see the link to the next page in the
> crawldb:
>
> *
> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
> key:
>  net.telelistas.www:http/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
> *
> *baseUrl:        null*
> *status: 1 (status_unfetched)*
> *fetchTime:      1368424541731*
> *prevFetchTime:  0*
> *fetchInterval:  2592000*
> *retriesSinceFetch:      0*
> *modifiedTime:   0*
> *prevModifiedTime:       0*
> *protocolStatus: (null)*
> *parseStatus:    (null)*
> *title:  null*
> *score:  0.0042918455*
> *markers:        {dist=1}*
> *reprUrl:        null*
> *metadata _csh_ :        ;���*
>
> HTH
>
>
> On Sun, May 12, 2013 at 10:21 PM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
>> Hi Tejas,
>>
>> So I started fresh. I deleted the webpage keyspace as I am using
>> Cassandra as a backend. But I did get the same output. I mean I get a
>> bunch of urls after I do a readdb -dump but not the ones I want. I get
>> only one fetched site, and many links parsed (to be parsed in the next
>> cycle?). Maybe it has to do something with the urls I am trying to
>> get?
>> I am trying to get this url and similar ones:
>>
>>
>> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1
>>
>> But I have noticed that the links pointing to the next ones are
>> something like this:
>>
>> <a class="resultado_roda"
>> href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a>
>>
>> So I decided to try commenting this url rule:
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>
>> But I got the same results. A single site fetched, some urls parsed
>> but not the ones I want using the regex-urlfilter.txt. Any Ideas?
>> Thanks a ton for your help Tejas!
>>
>>
>> Renato M.
>>
>>
>> 2013/5/12 Tejas Patil <te...@gmail.com>:
>> > Hi Renato,
>> >
>> > Thats weird. I ran a crawl over similar urls having a query in the end (
>> > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x.
>> > My guess is that there is something wrong while parsing due to which
>> > outlinks are not getting into the crawldb.
>> >
>> > Start from fresh. Clear everything from previous attempts. (including the
>> > backend table named as the value of 'storage.schema.webpage').
>> > Run these :
>> > bin/nutch inject *<urldir>*
>> > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0
>> > bin/nutch fetch *<batchID>* -threads 2
>> > bin/nutch parse *<batchID> *
>> > bin/nutch updatedb
>> > bin/nutch readdb -dump <*output dir*>
>> >
>> > The readdb output will shown if the outlinks were extracted correctly.
>> >
>> > The commands for checking urlfilter rules accept one input url at a time
>> > from console (you need to type/paste the url and hit enter).
>> > It shows "+" if the url is accepted by the current rules. ("-" for
>> > rejection).
>> >
>> > Thanks,
>> > Tejas
>> >
>> > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo <
>> > renatoj.marroquin@gmail.com> wrote:
>> >
>> >> And I did try the commands you told me but I am not sure how they
>> >> work. They do wait for an url to be input, but then it prints the url
>> >> with a '+' at the beginning, what does that mean?
>> >>
>> >> http://www.xyz.com/lanchon
>> >> +http://www.xyz.com/lanchon
>> >>
>> >> 2013/5/12 Renato Marroquín Mogrovejo <re...@gmail.com>:
>> >> > Hi Tejas,
>> >> >
>> >> > Thanks for your help. I have tried the expression you suggested, and
>> >> > now my url-filter file is like this:
>> >> > +http://www.xyz.com/\?page=*
>> >> >
>> >> > # skip URLs containing certain characters as probable queries, etc.
>> >> > #-[?*!@=]
>> >> > +.
>> >> >
>> >> > # skip URLs with slash-delimited segment that repeats 3+ times, to
>> break
>> >> loops
>> >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>> >> > +.
>> >> >
>> >> > # accept anything else
>> >> > +.
>> >> >
>> >> > So after this, I run a generate command -topN 5 -depth 5, and then a
>> >> > fetch all, but I keep on getting a single page fetched. What am I
>> >> > doing wrong? Thanks again for your help.
>> >> >
>> >> >
>> >> > Renato M.
>> >> >
>> >> > 2013/5/12 Tejas Patil <te...@gmail.com>:
>> >> >> FYI: You can use anyone of these commands to run the regex-urlfilter
>> >> rules
>> >> >> against any given url:
>> >> >>
>> >> >> bin/nutch plugin urlfilter-regex
>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
>> >> >> OR
>> >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
>> >> >>
>> >> >> Both of them accept input url one at a time from stdin.
>> >> >> The later one has a param which can enable you to test a given url
>> >> against
>> >> >> several url filters at once. See its usage for more details.
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <
>> tejas.patil.cs@gmail.com
>> >> >wrote:
>> >> >>
>> >> >>> If there is no restriction on the number at the end of the url, you
>> >> might
>> >> >>> just use this:
>> >> >>> (note that the rule must be above the one which filters urls with a
>> "?"
>> >> >>> character)
>> >> >>>
>> >> >>> *+http://www.xyz.com/\?page=*
>> >> >>> *
>> >> >>> *
>> >> >>> *# skip URLs containing certain characters as probable queries,
>> etc.*
>> >> >>> *-[?*!@=]*
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
>> >> >>> renatoj.marroquin@gmail.com> wrote:
>> >> >>>
>> >> >>>> Hi all,
>> >> >>>>
>> >> >>>> I have been trying to fetch a query similar to:
>> >> >>>>
>> >> >>>> http://www.xyz.com/?page=1
>> >> >>>>
>> >> >>>> But where the number can vary from 1 to 100. Inside the first page
>> >> >>>> there are links to the next ones. So I updated the
>> >> >>>> conf/regex-urlfilter file and added:
>> >> >>>>
>> >> >>>> ^[0-9]{1,45}$
>> >> >>>>
>> >> >>>> When I do this, the generate job fails saying that it is "Invalid
>> >> >>>> first character". I have tried generating with topN 5 and depth 5
>> and
>> >> >>>> trying to fetch more urls but that does not work.
>> >> >>>>
>> >> >>>> Could anyone advise me on how to accomplish this? I am running
>> Nutch
>> >> 2.x.
>> >> >>>> Thanks in advance!
>> >> >>>>
>> >> >>>>
>> >> >>>> Renato M.
>> >> >>>>
>> >> >>>
>> >> >>>
>> >>
>>

Re: Fetching a specific number of urls

Posted by Tejas Patil <te...@gmail.com>.

Hi Renato,

The default content limit for http protocol is 65536 while the webpage is
much bigger than that. The relevant config needs to be updated.
Add this to the conf/nutch-site.xml:

*<property>*
*  <name>http.content.limit</name>*
*  <value>240000</value>*
*  <description>The length limit for downloaded content using the http*
*  protocol, in bytes. If this value is nonnegative (>=0), content longer*
*  than it will be truncated; otherwise, no truncation at all. Do not*
*  confuse this setting with the file.content.limit setting.*
*  </description>*
*</property>*

I got a connection timed out error post this config change above (it makes
sense as the content to be downloaded is more).
So I added this to the conf/nutch-site.xml:

*<property>*
*  <name>http.timeout</name>*
*  <value>1000000</value>*
*  <description>The default network timeout, in milliseconds.</description>*
*</property>*

After running a fresh crawl, I could see the link to the next page in the
crawldb:

*
http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
key:
 net.telelistas.www:http/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
*
*baseUrl:        null*
*status: 1 (status_unfetched)*
*fetchTime:      1368424541731*
*prevFetchTime:  0*
*fetchInterval:  2592000*
*retriesSinceFetch:      0*
*modifiedTime:   0*
*prevModifiedTime:       0*
*protocolStatus: (null)*
*parseStatus:    (null)*
*title:  null*
*score:  0.0042918455*
*markers:        {dist=1}*
*reprUrl:        null*
*metadata _csh_ :        ;���*

HTH


On Sun, May 12, 2013 at 10:21 PM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Hi Tejas,
>
> So I started fresh. I deleted the webpage keyspace as I am using
> Cassandra as a backend. But I did get the same output. I mean I get a
> bunch of urls after I do a readdb -dump but not the ones I want. I get
> only one fetched site, and many links parsed (to be parsed in the next
> cycle?). Maybe it has to do something with the urls I am trying to
> get?
> I am trying to get this url and similar ones:
>
>
> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1
>
> But I have noticed that the links pointing to the next ones are
> something like this:
>
> <a class="resultado_roda"
> href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a>
>
> So I decided to try commenting this url rule:
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> But I got the same results. A single site fetched, some urls parsed
> but not the ones I want using the regex-urlfilter.txt. Any Ideas?
> Thanks a ton for your help Tejas!
>
>
> Renato M.
>
>
> 2013/5/12 Tejas Patil <te...@gmail.com>:
> > Hi Renato,
> >
> > Thats weird. I ran a crawl over similar urls having a query in the end (
> > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x.
> > My guess is that there is something wrong while parsing due to which
> > outlinks are not getting into the crawldb.
> >
> > Start from fresh. Clear everything from previous attempts. (including the
> > backend table named as the value of 'storage.schema.webpage').
> > Run these :
> > bin/nutch inject *<urldir>*
> > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0
> > bin/nutch fetch *<batchID>* -threads 2
> > bin/nutch parse *<batchID> *
> > bin/nutch updatedb
> > bin/nutch readdb -dump <*output dir*>
> >
> > The readdb output will shown if the outlinks were extracted correctly.
> >
> > The commands for checking urlfilter rules accept one input url at a time
> > from console (you need to type/paste the url and hit enter).
> > It shows "+" if the url is accepted by the current rules. ("-" for
> > rejection).
> >
> > Thanks,
> > Tejas
> >
> > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo <
> > renatoj.marroquin@gmail.com> wrote:
> >
> >> And I did try the commands you told me but I am not sure how they
> >> work. They do wait for an url to be input, but then it prints the url
> >> with a '+' at the beginning, what does that mean?
> >>
> >> http://www.xyz.com/lanchon
> >> +http://www.xyz.com/lanchon
> >>
> >> 2013/5/12 Renato Marroquín Mogrovejo <re...@gmail.com>:
> >> > Hi Tejas,
> >> >
> >> > Thanks for your help. I have tried the expression you suggested, and
> >> > now my url-filter file is like this:
> >> > +http://www.xyz.com/\?page=*
> >> >
> >> > # skip URLs containing certain characters as probable queries, etc.
> >> > #-[?*!@=]
> >> > +.
> >> >
> >> > # skip URLs with slash-delimited segment that repeats 3+ times, to
> break
> >> loops
> >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >> > +.
> >> >
> >> > # accept anything else
> >> > +.
> >> >
> >> > So after this, I run a generate command -topN 5 -depth 5, and then a
> >> > fetch all, but I keep on getting a single page fetched. What am I
> >> > doing wrong? Thanks again for your help.
> >> >
> >> >
> >> > Renato M.
> >> >
> >> > 2013/5/12 Tejas Patil <te...@gmail.com>:
> >> >> FYI: You can use anyone of these commands to run the regex-urlfilter
> >> rules
> >> >> against any given url:
> >> >>
> >> >> bin/nutch plugin urlfilter-regex
> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >> >> OR
> >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >> >>
> >> >> Both of them accept input url one at a time from stdin.
> >> >> The later one has a param which can enable you to test a given url
> >> against
> >> >> several url filters at once. See its usage for more details.
> >> >>
> >> >>
> >> >>
> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <
> tejas.patil.cs@gmail.com
> >> >wrote:
> >> >>
> >> >>> If there is no restriction on the number at the end of the url, you
> >> might
> >> >>> just use this:
> >> >>> (note that the rule must be above the one which filters urls with a
> "?"
> >> >>> character)
> >> >>>
> >> >>> *+http://www.xyz.com/\?page=*
> >> >>> *
> >> >>> *
> >> >>> *# skip URLs containing certain characters as probable queries,
> etc.*
> >> >>> *-[?*!@=]*
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
> >> >>> renatoj.marroquin@gmail.com> wrote:
> >> >>>
> >> >>>> Hi all,
> >> >>>>
> >> >>>> I have been trying to fetch a query similar to:
> >> >>>>
> >> >>>> http://www.xyz.com/?page=1
> >> >>>>
> >> >>>> But where the number can vary from 1 to 100. Inside the first page
> >> >>>> there are links to the next ones. So I updated the
> >> >>>> conf/regex-urlfilter file and added:
> >> >>>>
> >> >>>> ^[0-9]{1,45}$
> >> >>>>
> >> >>>> When I do this, the generate job fails saying that it is "Invalid
> >> >>>> first character". I have tried generating with topN 5 and depth 5
> and
> >> >>>> trying to fetch more urls but that does not work.
> >> >>>>
> >> >>>> Could anyone advise me on how to accomplish this? I am running
> Nutch
> >> 2.x.
> >> >>>> Thanks in advance!
> >> >>>>
> >> >>>>
> >> >>>> Renato M.
> >> >>>>
> >> >>>
> >> >>>
> >>
>

Re: Fetching a specific number of urls

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi Tejas,

So I started fresh. I deleted the webpage keyspace as I am using
Cassandra as a backend. But I did get the same output. I mean I get a
bunch of urls after I do a readdb -dump but not the ones I want. I get
only one fetched site, and many links parsed (to be parsed in the next
cycle?). Maybe it has to do something with the urls I am trying to
get?
I am trying to get this url and similar ones:

http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1

But I have noticed that the links pointing to the next ones are
something like this:

<a class="resultado_roda"
href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a>

So I decided to try commenting this url rule:
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

But I got the same results. A single site fetched, some urls parsed
but not the ones I want using the regex-urlfilter.txt. Any Ideas?
Thanks a ton for your help Tejas!


Renato M.


2013/5/12 Tejas Patil <te...@gmail.com>:
> Hi Renato,
>
> Thats weird. I ran a crawl over similar urls having a query in the end (
> http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x.
> My guess is that there is something wrong while parsing due to which
> outlinks are not getting into the crawldb.
>
> Start from fresh. Clear everything from previous attempts. (including the
> backend table named as the value of 'storage.schema.webpage').
> Run these :
> bin/nutch inject *<urldir>*
> bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0
> bin/nutch fetch *<batchID>* -threads 2
> bin/nutch parse *<batchID> *
> bin/nutch updatedb
> bin/nutch readdb -dump <*output dir*>
>
> The readdb output will shown if the outlinks were extracted correctly.
>
> The commands for checking urlfilter rules accept one input url at a time
> from console (you need to type/paste the url and hit enter).
> It shows "+" if the url is accepted by the current rules. ("-" for
> rejection).
>
> Thanks,
> Tejas
>
> On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
>> And I did try the commands you told me but I am not sure how they
>> work. They do wait for an url to be input, but then it prints the url
>> with a '+' at the beginning, what does that mean?
>>
>> http://www.xyz.com/lanchon
>> +http://www.xyz.com/lanchon
>>
>> 2013/5/12 Renato Marroquín Mogrovejo <re...@gmail.com>:
>> > Hi Tejas,
>> >
>> > Thanks for your help. I have tried the expression you suggested, and
>> > now my url-filter file is like this:
>> > +http://www.xyz.com/\?page=*
>> >
>> > # skip URLs containing certain characters as probable queries, etc.
>> > #-[?*!@=]
>> > +.
>> >
>> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>> > +.
>> >
>> > # accept anything else
>> > +.
>> >
>> > So after this, I run a generate command -topN 5 -depth 5, and then a
>> > fetch all, but I keep on getting a single page fetched. What am I
>> > doing wrong? Thanks again for your help.
>> >
>> >
>> > Renato M.
>> >
>> > 2013/5/12 Tejas Patil <te...@gmail.com>:
>> >> FYI: You can use anyone of these commands to run the regex-urlfilter
>> rules
>> >> against any given url:
>> >>
>> >> bin/nutch plugin urlfilter-regex
>> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
>> >> OR
>> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
>> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
>> >>
>> >> Both of them accept input url one at a time from stdin.
>> >> The later one has a param which can enable you to test a given url
>> against
>> >> several url filters at once. See its usage for more details.
>> >>
>> >>
>> >>
>> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <tejas.patil.cs@gmail.com
>> >wrote:
>> >>
>> >>> If there is no restriction on the number at the end of the url, you
>> might
>> >>> just use this:
>> >>> (note that the rule must be above the one which filters urls with a "?"
>> >>> character)
>> >>>
>> >>> *+http://www.xyz.com/\?page=*
>> >>> *
>> >>> *
>> >>> *# skip URLs containing certain characters as probable queries, etc.*
>> >>> *-[?*!@=]*
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
>> >>> renatoj.marroquin@gmail.com> wrote:
>> >>>
>> >>>> Hi all,
>> >>>>
>> >>>> I have been trying to fetch a query similar to:
>> >>>>
>> >>>> http://www.xyz.com/?page=1
>> >>>>
>> >>>> But where the number can vary from 1 to 100. Inside the first page
>> >>>> there are links to the next ones. So I updated the
>> >>>> conf/regex-urlfilter file and added:
>> >>>>
>> >>>> ^[0-9]{1,45}$
>> >>>>
>> >>>> When I do this, the generate job fails saying that it is "Invalid
>> >>>> first character". I have tried generating with topN 5 and depth 5 and
>> >>>> trying to fetch more urls but that does not work.
>> >>>>
>> >>>> Could anyone advise me on how to accomplish this? I am running Nutch
>> 2.x.
>> >>>> Thanks in advance!
>> >>>>
>> >>>>
>> >>>> Renato M.
>> >>>>
>> >>>
>> >>>
>>

Re: Fetching a specific number of urls

Posted by Tejas Patil <te...@gmail.com>.

Hi Renato,

Thats weird. I ran a crawl over similar urls having a query in the end (
http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x.
My guess is that there is something wrong while parsing due to which
outlinks are not getting into the crawldb.

Start from fresh. Clear everything from previous attempts. (including the
backend table named as the value of 'storage.schema.webpage').
Run these :
bin/nutch inject *<urldir>*
bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0
bin/nutch fetch *<batchID>* -threads 2
bin/nutch parse *<batchID> *
bin/nutch updatedb
bin/nutch readdb -dump <*output dir*>

The readdb output will shown if the outlinks were extracted correctly.

The commands for checking urlfilter rules accept one input url at a time
from console (you need to type/paste the url and hit enter).
It shows "+" if the url is accepted by the current rules. ("-" for
rejection).

Thanks,
Tejas

On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> And I did try the commands you told me but I am not sure how they
> work. They do wait for an url to be input, but then it prints the url
> with a '+' at the beginning, what does that mean?
>
> http://www.xyz.com/lanchon
> +http://www.xyz.com/lanchon
>
> 2013/5/12 Renato Marroquín Mogrovejo <re...@gmail.com>:
> > Hi Tejas,
> >
> > Thanks for your help. I have tried the expression you suggested, and
> > now my url-filter file is like this:
> > +http://www.xyz.com/\?page=*
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> > +.
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > +.
> >
> > # accept anything else
> > +.
> >
> > So after this, I run a generate command -topN 5 -depth 5, and then a
> > fetch all, but I keep on getting a single page fetched. What am I
> > doing wrong? Thanks again for your help.
> >
> >
> > Renato M.
> >
> > 2013/5/12 Tejas Patil <te...@gmail.com>:
> >> FYI: You can use anyone of these commands to run the regex-urlfilter
> rules
> >> against any given url:
> >>
> >> bin/nutch plugin urlfilter-regex
> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >> OR
> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >>
> >> Both of them accept input url one at a time from stdin.
> >> The later one has a param which can enable you to test a given url
> against
> >> several url filters at once. See its usage for more details.
> >>
> >>
> >>
> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
> >>
> >>> If there is no restriction on the number at the end of the url, you
> might
> >>> just use this:
> >>> (note that the rule must be above the one which filters urls with a "?"
> >>> character)
> >>>
> >>> *+http://www.xyz.com/\?page=*
> >>> *
> >>> *
> >>> *# skip URLs containing certain characters as probable queries, etc.*
> >>> *-[?*!@=]*
> >>>
> >>>
> >>>
> >>>
> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
> >>> renatoj.marroquin@gmail.com> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I have been trying to fetch a query similar to:
> >>>>
> >>>> http://www.xyz.com/?page=1
> >>>>
> >>>> But where the number can vary from 1 to 100. Inside the first page
> >>>> there are links to the next ones. So I updated the
> >>>> conf/regex-urlfilter file and added:
> >>>>
> >>>> ^[0-9]{1,45}$
> >>>>
> >>>> When I do this, the generate job fails saying that it is "Invalid
> >>>> first character". I have tried generating with topN 5 and depth 5 and
> >>>> trying to fetch more urls but that does not work.
> >>>>
> >>>> Could anyone advise me on how to accomplish this? I am running Nutch
> 2.x.
> >>>> Thanks in advance!
> >>>>
> >>>>
> >>>> Renato M.
> >>>>
> >>>
> >>>
>

Re: Fetching a specific number of urls

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

And I did try the commands you told me but I am not sure how they
work. They do wait for an url to be input, but then it prints the url
with a '+' at the beginning, what does that mean?

http://www.xyz.com/lanchon
+http://www.xyz.com/lanchon

2013/5/12 Renato Marroquín Mogrovejo <re...@gmail.com>:
> Hi Tejas,
>
> Thanks for your help. I have tried the expression you suggested, and
> now my url-filter file is like this:
> +http://www.xyz.com/\?page=*
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
> +.
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> +.
>
> # accept anything else
> +.
>
> So after this, I run a generate command -topN 5 -depth 5, and then a
> fetch all, but I keep on getting a single page fetched. What am I
> doing wrong? Thanks again for your help.
>
>
> Renato M.
>
> 2013/5/12 Tejas Patil <te...@gmail.com>:
>> FYI: You can use anyone of these commands to run the regex-urlfilter rules
>> against any given url:
>>
>> bin/nutch plugin urlfilter-regex
>> org.apache.nutch.urlfilter.regex.RegexURLFilter
>> OR
>> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
>> org.apache.nutch.urlfilter.regex.RegexURLFilter
>>
>> Both of them accept input url one at a time from stdin.
>> The later one has a param which can enable you to test a given url against
>> several url filters at once. See its usage for more details.
>>
>>
>>
>> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <te...@gmail.com>wrote:
>>
>>> If there is no restriction on the number at the end of the url, you might
>>> just use this:
>>> (note that the rule must be above the one which filters urls with a "?"
>>> character)
>>>
>>> *+http://www.xyz.com/\?page=*
>>> *
>>> *
>>> *# skip URLs containing certain characters as probable queries, etc.*
>>> *-[?*!@=]*
>>>
>>>
>>>
>>>
>>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
>>> renatoj.marroquin@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have been trying to fetch a query similar to:
>>>>
>>>> http://www.xyz.com/?page=1
>>>>
>>>> But where the number can vary from 1 to 100. Inside the first page
>>>> there are links to the next ones. So I updated the
>>>> conf/regex-urlfilter file and added:
>>>>
>>>> ^[0-9]{1,45}$
>>>>
>>>> When I do this, the generate job fails saying that it is "Invalid
>>>> first character". I have tried generating with topN 5 and depth 5 and
>>>> trying to fetch more urls but that does not work.
>>>>
>>>> Could anyone advise me on how to accomplish this? I am running Nutch 2.x.
>>>> Thanks in advance!
>>>>
>>>>
>>>> Renato M.
>>>>
>>>
>>>

Re: Fetching a specific number of urls

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi Tejas,

Thanks for your help. I have tried the expression you suggested, and
now my url-filter file is like this:
+http://www.xyz.com/\?page=*

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
+.

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+.

# accept anything else
+.

So after this, I run a generate command -topN 5 -depth 5, and then a
fetch all, but I keep on getting a single page fetched. What am I
doing wrong? Thanks again for your help.


Renato M.

2013/5/12 Tejas Patil <te...@gmail.com>:
> FYI: You can use anyone of these commands to run the regex-urlfilter rules
> against any given url:
>
> bin/nutch plugin urlfilter-regex
> org.apache.nutch.urlfilter.regex.RegexURLFilter
> OR
> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
> org.apache.nutch.urlfilter.regex.RegexURLFilter
>
> Both of them accept input url one at a time from stdin.
> The later one has a param which can enable you to test a given url against
> several url filters at once. See its usage for more details.
>
>
>
> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <te...@gmail.com>wrote:
>
>> If there is no restriction on the number at the end of the url, you might
>> just use this:
>> (note that the rule must be above the one which filters urls with a "?"
>> character)
>>
>> *+http://www.xyz.com/\?page=*
>> *
>> *
>> *# skip URLs containing certain characters as probable queries, etc.*
>> *-[?*!@=]*
>>
>>
>>
>>
>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
>> renatoj.marroquin@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have been trying to fetch a query similar to:
>>>
>>> http://www.xyz.com/?page=1
>>>
>>> But where the number can vary from 1 to 100. Inside the first page
>>> there are links to the next ones. So I updated the
>>> conf/regex-urlfilter file and added:
>>>
>>> ^[0-9]{1,45}$
>>>
>>> When I do this, the generate job fails saying that it is "Invalid
>>> first character". I have tried generating with topN 5 and depth 5 and
>>> trying to fetch more urls but that does not work.
>>>
>>> Could anyone advise me on how to accomplish this? I am running Nutch 2.x.
>>> Thanks in advance!
>>>
>>>
>>> Renato M.
>>>
>>
>>

Re: Fetching a specific number of urls

Posted by Tejas Patil <te...@gmail.com>.

FYI: You can use anyone of these commands to run the regex-urlfilter rules
against any given url:

bin/nutch plugin urlfilter-regex
org.apache.nutch.urlfilter.regex.RegexURLFilter
OR
bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
org.apache.nutch.urlfilter.regex.RegexURLFilter

Both of them accept input url one at a time from stdin.
The later one has a param which can enable you to test a given url against
several url filters at once. See its usage for more details.



On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <te...@gmail.com>wrote:

> If there is no restriction on the number at the end of the url, you might
> just use this:
> (note that the rule must be above the one which filters urls with a "?"
> character)
>
> *+http://www.xyz.com/\?page=*
> *
> *
> *# skip URLs containing certain characters as probable queries, etc.*
> *-[?*!@=]*
>
>
>
>
> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
>> Hi all,
>>
>> I have been trying to fetch a query similar to:
>>
>> http://www.xyz.com/?page=1
>>
>> But where the number can vary from 1 to 100. Inside the first page
>> there are links to the next ones. So I updated the
>> conf/regex-urlfilter file and added:
>>
>> ^[0-9]{1,45}$
>>
>> When I do this, the generate job fails saying that it is "Invalid
>> first character". I have tried generating with topN 5 and depth 5 and
>> trying to fetch more urls but that does not work.
>>
>> Could anyone advise me on how to accomplish this? I am running Nutch 2.x.
>> Thanks in advance!
>>
>>
>> Renato M.
>>
>
>

Re: Fetching a specific number of urls

Posted by Tejas Patil <te...@gmail.com>.

If there is no restriction on the number at the end of the url, you might
just use this:
(note that the rule must be above the one which filters urls with a "?"
character)

*+http://www.xyz.com/\?page=*
*
*
*# skip URLs containing certain characters as probable queries, etc.*
*-[?*!@=]*




On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Hi all,
>
> I have been trying to fetch a query similar to:
>
> http://www.xyz.com/?page=1
>
> But where the number can vary from 1 to 100. Inside the first page
> there are links to the next ones. So I updated the
> conf/regex-urlfilter file and added:
>
> ^[0-9]{1,45}$
>
> When I do this, the generate job fails saying that it is "Invalid
> first character". I have tried generating with topN 5 and depth 5 and
> trying to fetch more urls but that does not work.
>
> Could anyone advise me on how to accomplish this? I am running Nutch 2.x.
> Thanks in advance!
>
>
> Renato M.
>