You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nima Falaki <nf...@popsugar.com> on 2014/05/31 02:16:01 UTC

Problem with crawling macys robots.txt

Hello Everyone:

Just have a question about an issue I discovered while trying to crawl the
macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 and
crawler-commons 0.4. This is the robots.txt file from macys

User-agent: *
Crawl-delay: 120
Disallow: /compare
Disallow: /registry/wedding/compare
Disallow: /catalog/product/zoom.jsp
Disallow: /search
Disallow: /shop/search
Disallow: /shop/registry/wedding/search
Disallow: *natuzzi*
noindex: *natuzzi*
Disallow: *Natuzzi*
noindex: *Natuzzi*
Disallow:  /bag/add*


When I run this robots.txt through the RobotsRulesParser with this url
(http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=)
I get the following exceptions

2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
(SimpleRobotRulesParser.java:reportWarning(456)) - 	Unknown line in
robots.txt file (size 672): noindex: *natuzzi*

2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
(SimpleRobotRulesParser.java:reportWarning(456)) - 	Unknown line in
robots.txt file (size 672): noindex: *Natuzzi*

2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
(SimpleRobotRulesParser.java:reportWarning(456)) - 	Unknown line in
robots.txt file (size 672): noindex: *natuzzi*

2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
(SimpleRobotRulesParser.java:reportWarning(456)) - 	Unknown line in
robots.txt file (size 672): noindex: *Natuzzi*

Is there anything I can do to solve this problem? Is this a problem
with nutch or does macys.com have a really bad robots.txt file?




 <http://www.popsugar.com>
Nima Falaki
Software Engineer
nfalaki@popsugar.com

Re: Problem with crawling macys robots.txt

Posted by Julien Nioche <li...@gmail.com>.
That's why we have fetcher.max.crawl.delay : if a ridiculously large value
is set, at least you won't be slowed down too much. See
https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L693



On 4 June 2014 05:10, S.L <si...@gmail.com> wrote:

> Out of curiosity , what if one needs to set the rules of politeness that
> are more realistic , i.e if I want to set the crawl-delay to be a certain
> max value regardless of what a particular site has , which java class
> should I be looking to change , assuming that this cannot be achieved using
> the config parameters. Thanks.
>
>
> On Tue, Jun 3, 2014 at 5:52 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com>
> wrote:
>
> > > though , I wonder if anyone uses Nutch in production and how they
> > overcome
> > > this limitation being imposed by sites like macys.com where they have
> a
> > > Crawl-Delay specified?
> >
> > If you follow rules of politeness, there will be now way to overcome the
> > crawl-delay from robots.txt: crawling will be horribly slow. So slow,
> that
> > completeness and freshness seem unreachable targets. But maybe that's
> > exactly the intention of site owner.
> >
> > On 06/03/2014 04:29 PM, S.L wrote:
> > > Thats good piece of Info Nima , it means you wont be able to crawl more
> > > than 720 pages in 24 hrs , this sounds like a pretty serious limitation
> > > though , I wonder if anyone uses Nutch in production and how they
> > overcome
> > > this limitation being imposed by sites like macys.com where they have
> a
> > > Crawl-Delay specified?
> > >
> > >
> > >
> > >
> > > On Tue, Jun 3, 2014 at 3:24 AM, Nima Falaki <nf...@popsugar.com>
> > wrote:
> > >
> > >> Nevermind, I figured it out, I adjusted my fetcher.max.crawl.delay
> > >> accordingly and it solved the issue. Macys.com has a crawl-delay of
> 120,
> > >> nutch by default has a crawl delay of 30, so I had to change that and
> it
> > >> worked. You guys must either make the crawl delay to -1 (something I
> > dont
> > >> recommend, but I did for example purposes), or to over 120 (for
> > macys.com)
> > >> in order to crawl macys.com
> > >>
> > >> <property>
> > >>
> > >>  <name>fetcher.max.crawl.delay</name>
> > >>
> > >>  <value>-1</value>
> > >>
> > >>  <description>
> > >>
> > >>  If the Crawl-Delay in robots.txt is set to greater than this value
> (in
> > >>
> > >>  seconds) then the fetcher will skip this page, generating an error
> > report.
> > >>
> > >>  If set to -1 the fetcher will never skip such pages and will wait the
> > >>
> > >>  amount of time retrieved from robots.txt Crawl-Delay, however long
> that
> > >>
> > >>  might be.
> > >>
> > >>  </description>
> > >>
> > >> </property>
> > >>
> > >>
> > >> On Mon, Jun 2, 2014 at 6:31 PM, Nima Falaki <nf...@popsugar.com>
> > wrote:
> > >>
> > >>> Hi Sebastian:
> > >>>
> > >>> One thing I noticed is that when I tested the robots.txt with
> > >>> RobotsRulesParser, which is in org.apache.nutch.protocol, with the
> > >>> following URL
> > >>>
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>>
> > >>> It gave me this message
> > >>>
> > >>> 2014-06-02 18:27:16,949 WARN  robots.SimpleRobotRulesParser (
> > >>> SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing
> > >>> robots.txt for
> > >>> /Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt
> > >>>
> > >>> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> > >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> > >>> robots.txt file (size 672): noindex: *natuzzi*
> > >>>
> > >>> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> > >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> > >>> robots.txt file (size 672): noindex: *Natuzzi*
> > >>>
> > >>> 2014-06-02 18:27:16,954 WARN  robots.SimpleRobotRulesParser (
> > >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> > >>> robots.txt file (size 672): noindex: *natuzzi*
> > >>>
> > >>> 2014-06-02 18:27:16,955 WARN  robots.SimpleRobotRulesParser (
> > >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> > >>> robots.txt file (size 672): noindex: *Natuzzi*
> > >>>
> > >>> *allowed:
> > >>>
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>> <
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>> *
> > >>>
> > >>>
> > >>> This is in direct contrary to what happened when I ran the crawl
> script
> > >>> with
> > >>>
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>> as my SeedURL
> > >>>
> > >>> I got this in my crawlDB
> > >>>
> > >>> *
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>> <
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>>
> > >>>       Version: 7*
> > >>>
> > >>> *Status: 3 (db_gone)*
> > >>>
> > >>> *Fetch time: Thu Jul 17 18:05:47 PDT 2014*
> > >>>
> > >>> *Modified time: Wed Dec 31 16:00:00 PST 1969*
> > >>>
> > >>> *Retries since fetch: 0*
> > >>>
> > >>> *Retry interval: 3888000 seconds (45 days)*
> > >>>
> > >>> *Score: 1.0*
> > >>>
> > >>> *Signature: null*
> > >>>
> > >>> *Metadata:*
> > >>>
> > >>> *        _pst_=robots_denied(18), lastModified=0*
> > >>>
> > >>>
> > >>> Is this a bug in the crawler-commons 0.3? Where when you test the
> macys
> > >>> robots.txt file with RobotRulesParser it allows it, but when you run
> > the
> > >>> macys url as a seed url in the crawl script then it denies the url.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel <
> > >>> wastl.nagel@googlemail.com> wrote:
> > >>>
> > >>>> Hi Luke, hi Nima,
> > >>>>
> > >>>>>     The/Robot Exclusion Standard/does not mention anything about
> the
> > >>>> "*" character in
> > >>>>> the|Disallow:|statement.
> > >>>> Indeed the RFC draft [1] does not. However, since Google [2] does
> wild
> > >>>> card patterns are
> > >>>> frequently used in robots.txt. With crawler-commons 0.4 [3] these
> > rules
> > >>>> are also followed
> > >>>> by Nutch (to be in versions 1.9 resp. 2.3).
> > >>>>
> > >>>> But the error message is about the noindex lines:
> > >>>>  noindex: *natuzzi*
> > >>>> These lines are redundant (and also invalid, I suppose):
> > >>>> if a page/URL is disallowed, it's not fetched at all,
> > >>>> and will hardly slip into the index.
> > >>>> I think you can ignore the warning.
> > >>>>
> > >>>>> One might also question the craw-delay setting of 120 seconds, but
> > >>>> that's another issue...
> > >>>> Yeah, it will take very long to crawl the site.
> > >>>> With Nutch the property "fetcher.max.crawl.delay" needs to be
> > adjusted:
> > >>>>
> > >>>> <property>
> > >>>>  <name>fetcher.max.crawl.delay</name>
> > >>>>  <value>30</value>
> > >>>>  <description>
> > >>>>  If the Crawl-Delay in robots.txt is set to greater than this value
> > (in
> > >>>>  seconds) then the fetcher will skip this page, generating an error
> > >>>> report.
> > >>>>  If set to -1 the fetcher will never skip such pages and will wait
> the
> > >>>>  amount of time retrieved from robots.txt Crawl-Delay, however long
> > that
> > >>>>  might be.
> > >>>>  </description>
> > >>>> </property>
> > >>>>
> > >>>> Cheers,
> > >>>> Sebastian
> > >>>>
> > >>>> [1] http://www.robotstxt.org/norobots-rfc.txt
> > >>>> [2]
> > >>>>
> > >>
> >
> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
> > >>>> [3]
> > >>>>
> > >>
> >
> http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt
> > >>>>
> > >>>> On 05/31/2014 04:27 PM, Luke Mawbey wrote:
> > >>>>> From wikipedia:
> > >>>>>     The/Robot Exclusion Standard/does not mention anything about
> the
> > >>>> "*" character in
> > >>>>> the|Disallow:|statement. Some crawlers like Googlebot recognize
> > >> strings
> > >>>> containing "*", while MSNbot
> > >>>>> and Teoma interpret it in different ways
> > >>>>>
> > >>>>> So the 'problem' is with Macy's. Really, there is no problem for
> you:
> > >>>> presumably that line is just
> > >>>>> ignored from robots.txt.
> > >>>>>
> > >>>>> One might also question the craw-delay setting of 120 seconds, but
> > >>>> that's another issue...
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On 31/05/2014 12:16 AM, Nima Falaki wrote:
> > >>>>>> Hello Everyone:
> > >>>>>>
> > >>>>>> Just have a question about an issue I discovered while trying to
> > >> crawl
> > >>>> the
> > >>>>>> macys robots.txt, I am using nutch 1.8 and used crawler-commons
> 0.3
> > >> and
> > >>>>>> crawler-commons 0.4. This is the robots.txt file from macys
> > >>>>>>
> > >>>>>> User-agent: *
> > >>>>>> Crawl-delay: 120
> > >>>>>> Disallow: /compare
> > >>>>>> Disallow: /registry/wedding/compare
> > >>>>>> Disallow: /catalog/product/zoom.jsp
> > >>>>>> Disallow: /search
> > >>>>>> Disallow: /shop/search
> > >>>>>> Disallow: /shop/registry/wedding/search
> > >>>>>> Disallow: *natuzzi*
> > >>>>>> noindex: *natuzzi*
> > >>>>>> Disallow: *Natuzzi*
> > >>>>>> noindex: *Natuzzi*
> > >>>>>> Disallow:  /bag/add*
> > >>>>>>
> > >>>>>>
> > >>>>>> When I run this robots.txt through the RobotsRulesParser with this
> > >> url
> > >>>>>> (
> > >>>>
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>>> )
> > >>>>>>
> > >>>>>> I get the following exceptions
> > >>>>>>
> > >>>>>> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
> > >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown
> line
> > >> in
> > >>>>>> robots.txt file (size 672): noindex: *natuzzi*
> > >>>>>>
> > >>>>>> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
> > >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown
> line
> > >> in
> > >>>>>> robots.txt file (size 672): noindex: *Natuzzi*
> > >>>>>>
> > >>>>>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> > >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown
> line
> > >> in
> > >>>>>> robots.txt file (size 672): noindex: *natuzzi*
> > >>>>>>
> > >>>>>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> > >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown
> line
> > >> in
> > >>>>>> robots.txt file (size 672): noindex: *Natuzzi*
> > >>>>>>
> > >>>>>> Is there anything I can do to solve this problem? Is this a
> problem
> > >>>>>> with nutch or does macys.com have a really bad robots.txt file?
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>   <http://www.popsugar.com>
> > >>>>>> Nima Falaki
> > >>>>>> Software Engineer
> > >>>>>> nfalaki@popsugar.com
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>>
> > >>>
> > >>>  <http://www.popsugar.com>
> > >>>
> > >>> Nima Falaki
> > >>> Software Engineer
> > >>> nfalaki@popsugar.com
> > >>>
> > >>>
> > >>
> > >>
> > >> --
> > >>
> > >>
> > >>
> > >> Nima Falaki
> > >> Software Engineer
> > >> nfalaki@popsugar.com
> > >>
> > >
> >
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Problem with crawling macys robots.txt

Posted by "S.L" <si...@gmail.com>.
Out of curiosity , what if one needs to set the rules of politeness that
are more realistic , i.e if I want to set the crawl-delay to be a certain
max value regardless of what a particular site has , which java class
should I be looking to change , assuming that this cannot be achieved using
the config parameters. Thanks.


On Tue, Jun 3, 2014 at 5:52 PM, Sebastian Nagel <wa...@googlemail.com>
wrote:

> > though , I wonder if anyone uses Nutch in production and how they
> overcome
> > this limitation being imposed by sites like macys.com where they have a
> > Crawl-Delay specified?
>
> If you follow rules of politeness, there will be now way to overcome the
> crawl-delay from robots.txt: crawling will be horribly slow. So slow, that
> completeness and freshness seem unreachable targets. But maybe that's
> exactly the intention of site owner.
>
> On 06/03/2014 04:29 PM, S.L wrote:
> > Thats good piece of Info Nima , it means you wont be able to crawl more
> > than 720 pages in 24 hrs , this sounds like a pretty serious limitation
> > though , I wonder if anyone uses Nutch in production and how they
> overcome
> > this limitation being imposed by sites like macys.com where they have a
> > Crawl-Delay specified?
> >
> >
> >
> >
> > On Tue, Jun 3, 2014 at 3:24 AM, Nima Falaki <nf...@popsugar.com>
> wrote:
> >
> >> Nevermind, I figured it out, I adjusted my fetcher.max.crawl.delay
> >> accordingly and it solved the issue. Macys.com has a crawl-delay of 120,
> >> nutch by default has a crawl delay of 30, so I had to change that and it
> >> worked. You guys must either make the crawl delay to -1 (something I
> dont
> >> recommend, but I did for example purposes), or to over 120 (for
> macys.com)
> >> in order to crawl macys.com
> >>
> >> <property>
> >>
> >>  <name>fetcher.max.crawl.delay</name>
> >>
> >>  <value>-1</value>
> >>
> >>  <description>
> >>
> >>  If the Crawl-Delay in robots.txt is set to greater than this value (in
> >>
> >>  seconds) then the fetcher will skip this page, generating an error
> report.
> >>
> >>  If set to -1 the fetcher will never skip such pages and will wait the
> >>
> >>  amount of time retrieved from robots.txt Crawl-Delay, however long that
> >>
> >>  might be.
> >>
> >>  </description>
> >>
> >> </property>
> >>
> >>
> >> On Mon, Jun 2, 2014 at 6:31 PM, Nima Falaki <nf...@popsugar.com>
> wrote:
> >>
> >>> Hi Sebastian:
> >>>
> >>> One thing I noticed is that when I tested the robots.txt with
> >>> RobotsRulesParser, which is in org.apache.nutch.protocol, with the
> >>> following URL
> >>>
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>>
> >>> It gave me this message
> >>>
> >>> 2014-06-02 18:27:16,949 WARN  robots.SimpleRobotRulesParser (
> >>> SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing
> >>> robots.txt for
> >>> /Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt
> >>>
> >>> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> >>> robots.txt file (size 672): noindex: *natuzzi*
> >>>
> >>> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> >>> robots.txt file (size 672): noindex: *Natuzzi*
> >>>
> >>> 2014-06-02 18:27:16,954 WARN  robots.SimpleRobotRulesParser (
> >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> >>> robots.txt file (size 672): noindex: *natuzzi*
> >>>
> >>> 2014-06-02 18:27:16,955 WARN  robots.SimpleRobotRulesParser (
> >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> >>> robots.txt file (size 672): noindex: *Natuzzi*
> >>>
> >>> *allowed:
> >>>
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>> <
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>> *
> >>>
> >>>
> >>> This is in direct contrary to what happened when I ran the crawl script
> >>> with
> >>>
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>> as my SeedURL
> >>>
> >>> I got this in my crawlDB
> >>>
> >>> *
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>> <
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>>
> >>>       Version: 7*
> >>>
> >>> *Status: 3 (db_gone)*
> >>>
> >>> *Fetch time: Thu Jul 17 18:05:47 PDT 2014*
> >>>
> >>> *Modified time: Wed Dec 31 16:00:00 PST 1969*
> >>>
> >>> *Retries since fetch: 0*
> >>>
> >>> *Retry interval: 3888000 seconds (45 days)*
> >>>
> >>> *Score: 1.0*
> >>>
> >>> *Signature: null*
> >>>
> >>> *Metadata:*
> >>>
> >>> *        _pst_=robots_denied(18), lastModified=0*
> >>>
> >>>
> >>> Is this a bug in the crawler-commons 0.3? Where when you test the macys
> >>> robots.txt file with RobotRulesParser it allows it, but when you run
> the
> >>> macys url as a seed url in the crawl script then it denies the url.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel <
> >>> wastl.nagel@googlemail.com> wrote:
> >>>
> >>>> Hi Luke, hi Nima,
> >>>>
> >>>>>     The/Robot Exclusion Standard/does not mention anything about the
> >>>> "*" character in
> >>>>> the|Disallow:|statement.
> >>>> Indeed the RFC draft [1] does not. However, since Google [2] does wild
> >>>> card patterns are
> >>>> frequently used in robots.txt. With crawler-commons 0.4 [3] these
> rules
> >>>> are also followed
> >>>> by Nutch (to be in versions 1.9 resp. 2.3).
> >>>>
> >>>> But the error message is about the noindex lines:
> >>>>  noindex: *natuzzi*
> >>>> These lines are redundant (and also invalid, I suppose):
> >>>> if a page/URL is disallowed, it's not fetched at all,
> >>>> and will hardly slip into the index.
> >>>> I think you can ignore the warning.
> >>>>
> >>>>> One might also question the craw-delay setting of 120 seconds, but
> >>>> that's another issue...
> >>>> Yeah, it will take very long to crawl the site.
> >>>> With Nutch the property "fetcher.max.crawl.delay" needs to be
> adjusted:
> >>>>
> >>>> <property>
> >>>>  <name>fetcher.max.crawl.delay</name>
> >>>>  <value>30</value>
> >>>>  <description>
> >>>>  If the Crawl-Delay in robots.txt is set to greater than this value
> (in
> >>>>  seconds) then the fetcher will skip this page, generating an error
> >>>> report.
> >>>>  If set to -1 the fetcher will never skip such pages and will wait the
> >>>>  amount of time retrieved from robots.txt Crawl-Delay, however long
> that
> >>>>  might be.
> >>>>  </description>
> >>>> </property>
> >>>>
> >>>> Cheers,
> >>>> Sebastian
> >>>>
> >>>> [1] http://www.robotstxt.org/norobots-rfc.txt
> >>>> [2]
> >>>>
> >>
> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
> >>>> [3]
> >>>>
> >>
> http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt
> >>>>
> >>>> On 05/31/2014 04:27 PM, Luke Mawbey wrote:
> >>>>> From wikipedia:
> >>>>>     The/Robot Exclusion Standard/does not mention anything about the
> >>>> "*" character in
> >>>>> the|Disallow:|statement. Some crawlers like Googlebot recognize
> >> strings
> >>>> containing "*", while MSNbot
> >>>>> and Teoma interpret it in different ways
> >>>>>
> >>>>> So the 'problem' is with Macy's. Really, there is no problem for you:
> >>>> presumably that line is just
> >>>>> ignored from robots.txt.
> >>>>>
> >>>>> One might also question the craw-delay setting of 120 seconds, but
> >>>> that's another issue...
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 31/05/2014 12:16 AM, Nima Falaki wrote:
> >>>>>> Hello Everyone:
> >>>>>>
> >>>>>> Just have a question about an issue I discovered while trying to
> >> crawl
> >>>> the
> >>>>>> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3
> >> and
> >>>>>> crawler-commons 0.4. This is the robots.txt file from macys
> >>>>>>
> >>>>>> User-agent: *
> >>>>>> Crawl-delay: 120
> >>>>>> Disallow: /compare
> >>>>>> Disallow: /registry/wedding/compare
> >>>>>> Disallow: /catalog/product/zoom.jsp
> >>>>>> Disallow: /search
> >>>>>> Disallow: /shop/search
> >>>>>> Disallow: /shop/registry/wedding/search
> >>>>>> Disallow: *natuzzi*
> >>>>>> noindex: *natuzzi*
> >>>>>> Disallow: *Natuzzi*
> >>>>>> noindex: *Natuzzi*
> >>>>>> Disallow:  /bag/add*
> >>>>>>
> >>>>>>
> >>>>>> When I run this robots.txt through the RobotsRulesParser with this
> >> url
> >>>>>> (
> >>>>
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>>> )
> >>>>>>
> >>>>>> I get the following exceptions
> >>>>>>
> >>>>>> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
> >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
> >> in
> >>>>>> robots.txt file (size 672): noindex: *natuzzi*
> >>>>>>
> >>>>>> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
> >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
> >> in
> >>>>>> robots.txt file (size 672): noindex: *Natuzzi*
> >>>>>>
> >>>>>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
> >> in
> >>>>>> robots.txt file (size 672): noindex: *natuzzi*
> >>>>>>
> >>>>>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
> >> in
> >>>>>> robots.txt file (size 672): noindex: *Natuzzi*
> >>>>>>
> >>>>>> Is there anything I can do to solve this problem? Is this a problem
> >>>>>> with nutch or does macys.com have a really bad robots.txt file?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>   <http://www.popsugar.com>
> >>>>>> Nima Falaki
> >>>>>> Software Engineer
> >>>>>> nfalaki@popsugar.com
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>>
> >>>
> >>>  <http://www.popsugar.com>
> >>>
> >>> Nima Falaki
> >>> Software Engineer
> >>> nfalaki@popsugar.com
> >>>
> >>>
> >>
> >>
> >> --
> >>
> >>
> >>
> >> Nima Falaki
> >> Software Engineer
> >> nfalaki@popsugar.com
> >>
> >
>
>

Re: Problem with crawling macys robots.txt

Posted by Sebastian Nagel <wa...@googlemail.com>.
> though , I wonder if anyone uses Nutch in production and how they overcome
> this limitation being imposed by sites like macys.com where they have a
> Crawl-Delay specified?

If you follow rules of politeness, there will be now way to overcome the
crawl-delay from robots.txt: crawling will be horribly slow. So slow, that
completeness and freshness seem unreachable targets. But maybe that's
exactly the intention of site owner.

On 06/03/2014 04:29 PM, S.L wrote:
> Thats good piece of Info Nima , it means you wont be able to crawl more
> than 720 pages in 24 hrs , this sounds like a pretty serious limitation
> though , I wonder if anyone uses Nutch in production and how they overcome
> this limitation being imposed by sites like macys.com where they have a
> Crawl-Delay specified?
> 
> 
> 
> 
> On Tue, Jun 3, 2014 at 3:24 AM, Nima Falaki <nf...@popsugar.com> wrote:
> 
>> Nevermind, I figured it out, I adjusted my fetcher.max.crawl.delay
>> accordingly and it solved the issue. Macys.com has a crawl-delay of 120,
>> nutch by default has a crawl delay of 30, so I had to change that and it
>> worked. You guys must either make the crawl delay to -1 (something I dont
>> recommend, but I did for example purposes), or to over 120 (for macys.com)
>> in order to crawl macys.com
>>
>> <property>
>>
>>  <name>fetcher.max.crawl.delay</name>
>>
>>  <value>-1</value>
>>
>>  <description>
>>
>>  If the Crawl-Delay in robots.txt is set to greater than this value (in
>>
>>  seconds) then the fetcher will skip this page, generating an error report.
>>
>>  If set to -1 the fetcher will never skip such pages and will wait the
>>
>>  amount of time retrieved from robots.txt Crawl-Delay, however long that
>>
>>  might be.
>>
>>  </description>
>>
>> </property>
>>
>>
>> On Mon, Jun 2, 2014 at 6:31 PM, Nima Falaki <nf...@popsugar.com> wrote:
>>
>>> Hi Sebastian:
>>>
>>> One thing I noticed is that when I tested the robots.txt with
>>> RobotsRulesParser, which is in org.apache.nutch.protocol, with the
>>> following URL
>>>
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>>
>>> It gave me this message
>>>
>>> 2014-06-02 18:27:16,949 WARN  robots.SimpleRobotRulesParser (
>>> SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing
>>> robots.txt for
>>> /Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt
>>>
>>> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
>>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
>>> robots.txt file (size 672): noindex: *natuzzi*
>>>
>>> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
>>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
>>> robots.txt file (size 672): noindex: *Natuzzi*
>>>
>>> 2014-06-02 18:27:16,954 WARN  robots.SimpleRobotRulesParser (
>>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
>>> robots.txt file (size 672): noindex: *natuzzi*
>>>
>>> 2014-06-02 18:27:16,955 WARN  robots.SimpleRobotRulesParser (
>>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
>>> robots.txt file (size 672): noindex: *Natuzzi*
>>>
>>> *allowed:
>>>
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>> <
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>> *
>>>
>>>
>>> This is in direct contrary to what happened when I ran the crawl script
>>> with
>>>
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>> as my SeedURL
>>>
>>> I got this in my crawlDB
>>>
>>> *
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>> <
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>>
>>>       Version: 7*
>>>
>>> *Status: 3 (db_gone)*
>>>
>>> *Fetch time: Thu Jul 17 18:05:47 PDT 2014*
>>>
>>> *Modified time: Wed Dec 31 16:00:00 PST 1969*
>>>
>>> *Retries since fetch: 0*
>>>
>>> *Retry interval: 3888000 seconds (45 days)*
>>>
>>> *Score: 1.0*
>>>
>>> *Signature: null*
>>>
>>> *Metadata:*
>>>
>>> *        _pst_=robots_denied(18), lastModified=0*
>>>
>>>
>>> Is this a bug in the crawler-commons 0.3? Where when you test the macys
>>> robots.txt file with RobotRulesParser it allows it, but when you run the
>>> macys url as a seed url in the crawl script then it denies the url.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel <
>>> wastl.nagel@googlemail.com> wrote:
>>>
>>>> Hi Luke, hi Nima,
>>>>
>>>>>     The/Robot Exclusion Standard/does not mention anything about the
>>>> "*" character in
>>>>> the|Disallow:|statement.
>>>> Indeed the RFC draft [1] does not. However, since Google [2] does wild
>>>> card patterns are
>>>> frequently used in robots.txt. With crawler-commons 0.4 [3] these rules
>>>> are also followed
>>>> by Nutch (to be in versions 1.9 resp. 2.3).
>>>>
>>>> But the error message is about the noindex lines:
>>>>  noindex: *natuzzi*
>>>> These lines are redundant (and also invalid, I suppose):
>>>> if a page/URL is disallowed, it's not fetched at all,
>>>> and will hardly slip into the index.
>>>> I think you can ignore the warning.
>>>>
>>>>> One might also question the craw-delay setting of 120 seconds, but
>>>> that's another issue...
>>>> Yeah, it will take very long to crawl the site.
>>>> With Nutch the property "fetcher.max.crawl.delay" needs to be adjusted:
>>>>
>>>> <property>
>>>>  <name>fetcher.max.crawl.delay</name>
>>>>  <value>30</value>
>>>>  <description>
>>>>  If the Crawl-Delay in robots.txt is set to greater than this value (in
>>>>  seconds) then the fetcher will skip this page, generating an error
>>>> report.
>>>>  If set to -1 the fetcher will never skip such pages and will wait the
>>>>  amount of time retrieved from robots.txt Crawl-Delay, however long that
>>>>  might be.
>>>>  </description>
>>>> </property>
>>>>
>>>> Cheers,
>>>> Sebastian
>>>>
>>>> [1] http://www.robotstxt.org/norobots-rfc.txt
>>>> [2]
>>>>
>> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
>>>> [3]
>>>>
>> http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt
>>>>
>>>> On 05/31/2014 04:27 PM, Luke Mawbey wrote:
>>>>> From wikipedia:
>>>>>     The/Robot Exclusion Standard/does not mention anything about the
>>>> "*" character in
>>>>> the|Disallow:|statement. Some crawlers like Googlebot recognize
>> strings
>>>> containing "*", while MSNbot
>>>>> and Teoma interpret it in different ways
>>>>>
>>>>> So the 'problem' is with Macy's. Really, there is no problem for you:
>>>> presumably that line is just
>>>>> ignored from robots.txt.
>>>>>
>>>>> One might also question the craw-delay setting of 120 seconds, but
>>>> that's another issue...
>>>>>
>>>>>
>>>>>
>>>>> On 31/05/2014 12:16 AM, Nima Falaki wrote:
>>>>>> Hello Everyone:
>>>>>>
>>>>>> Just have a question about an issue I discovered while trying to
>> crawl
>>>> the
>>>>>> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3
>> and
>>>>>> crawler-commons 0.4. This is the robots.txt file from macys
>>>>>>
>>>>>> User-agent: *
>>>>>> Crawl-delay: 120
>>>>>> Disallow: /compare
>>>>>> Disallow: /registry/wedding/compare
>>>>>> Disallow: /catalog/product/zoom.jsp
>>>>>> Disallow: /search
>>>>>> Disallow: /shop/search
>>>>>> Disallow: /shop/registry/wedding/search
>>>>>> Disallow: *natuzzi*
>>>>>> noindex: *natuzzi*
>>>>>> Disallow: *Natuzzi*
>>>>>> noindex: *Natuzzi*
>>>>>> Disallow:  /bag/add*
>>>>>>
>>>>>>
>>>>>> When I run this robots.txt through the RobotsRulesParser with this
>> url
>>>>>> (
>>>>
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>>> )
>>>>>>
>>>>>> I get the following exceptions
>>>>>>
>>>>>> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
>>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
>> in
>>>>>> robots.txt file (size 672): noindex: *natuzzi*
>>>>>>
>>>>>> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
>>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
>> in
>>>>>> robots.txt file (size 672): noindex: *Natuzzi*
>>>>>>
>>>>>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
>>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
>> in
>>>>>> robots.txt file (size 672): noindex: *natuzzi*
>>>>>>
>>>>>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
>>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
>> in
>>>>>> robots.txt file (size 672): noindex: *Natuzzi*
>>>>>>
>>>>>> Is there anything I can do to solve this problem? Is this a problem
>>>>>> with nutch or does macys.com have a really bad robots.txt file?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   <http://www.popsugar.com>
>>>>>> Nima Falaki
>>>>>> Software Engineer
>>>>>> nfalaki@popsugar.com
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>>  <http://www.popsugar.com>
>>>
>>> Nima Falaki
>>> Software Engineer
>>> nfalaki@popsugar.com
>>>
>>>
>>
>>
>> --
>>
>>
>>
>> Nima Falaki
>> Software Engineer
>> nfalaki@popsugar.com
>>
> 


Re: Problem with crawling macys robots.txt

Posted by "S.L" <si...@gmail.com>.
Thats good piece of Info Nima , it means you wont be able to crawl more
than 720 pages in 24 hrs , this sounds like a pretty serious limitation
though , I wonder if anyone uses Nutch in production and how they overcome
this limitation being imposed by sites like macys.com where they have a
Crawl-Delay specified?




On Tue, Jun 3, 2014 at 3:24 AM, Nima Falaki <nf...@popsugar.com> wrote:

> Nevermind, I figured it out, I adjusted my fetcher.max.crawl.delay
> accordingly and it solved the issue. Macys.com has a crawl-delay of 120,
> nutch by default has a crawl delay of 30, so I had to change that and it
> worked. You guys must either make the crawl delay to -1 (something I dont
> recommend, but I did for example purposes), or to over 120 (for macys.com)
> in order to crawl macys.com
>
> <property>
>
>  <name>fetcher.max.crawl.delay</name>
>
>  <value>-1</value>
>
>  <description>
>
>  If the Crawl-Delay in robots.txt is set to greater than this value (in
>
>  seconds) then the fetcher will skip this page, generating an error report.
>
>  If set to -1 the fetcher will never skip such pages and will wait the
>
>  amount of time retrieved from robots.txt Crawl-Delay, however long that
>
>  might be.
>
>  </description>
>
> </property>
>
>
> On Mon, Jun 2, 2014 at 6:31 PM, Nima Falaki <nf...@popsugar.com> wrote:
>
> > Hi Sebastian:
> >
> > One thing I noticed is that when I tested the robots.txt with
> > RobotsRulesParser, which is in org.apache.nutch.protocol, with the
> > following URL
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >
> > It gave me this message
> >
> > 2014-06-02 18:27:16,949 WARN  robots.SimpleRobotRulesParser (
> > SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing
> > robots.txt for
> > /Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt
> >
> > 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> > SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> > robots.txt file (size 672): noindex: *natuzzi*
> >
> > 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> > SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> > robots.txt file (size 672): noindex: *Natuzzi*
> >
> > 2014-06-02 18:27:16,954 WARN  robots.SimpleRobotRulesParser (
> > SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> > robots.txt file (size 672): noindex: *natuzzi*
> >
> > 2014-06-02 18:27:16,955 WARN  robots.SimpleRobotRulesParser (
> > SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> > robots.txt file (size 672): noindex: *Natuzzi*
> >
> > *allowed:
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > <
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >*
> >
> >
> > This is in direct contrary to what happened when I ran the crawl script
> > with
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > as my SeedURL
> >
> > I got this in my crawlDB
> >
> > *
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > <
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >
> >       Version: 7*
> >
> > *Status: 3 (db_gone)*
> >
> > *Fetch time: Thu Jul 17 18:05:47 PDT 2014*
> >
> > *Modified time: Wed Dec 31 16:00:00 PST 1969*
> >
> > *Retries since fetch: 0*
> >
> > *Retry interval: 3888000 seconds (45 days)*
> >
> > *Score: 1.0*
> >
> > *Signature: null*
> >
> > *Metadata:*
> >
> > *        _pst_=robots_denied(18), lastModified=0*
> >
> >
> > Is this a bug in the crawler-commons 0.3? Where when you test the macys
> > robots.txt file with RobotRulesParser it allows it, but when you run the
> > macys url as a seed url in the crawl script then it denies the url.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel <
> > wastl.nagel@googlemail.com> wrote:
> >
> >> Hi Luke, hi Nima,
> >>
> >> >     The/Robot Exclusion Standard/does not mention anything about the
> >> "*" character in
> >> > the|Disallow:|statement.
> >> Indeed the RFC draft [1] does not. However, since Google [2] does wild
> >> card patterns are
> >> frequently used in robots.txt. With crawler-commons 0.4 [3] these rules
> >> are also followed
> >> by Nutch (to be in versions 1.9 resp. 2.3).
> >>
> >> But the error message is about the noindex lines:
> >>  noindex: *natuzzi*
> >> These lines are redundant (and also invalid, I suppose):
> >> if a page/URL is disallowed, it's not fetched at all,
> >> and will hardly slip into the index.
> >> I think you can ignore the warning.
> >>
> >> > One might also question the craw-delay setting of 120 seconds, but
> >> that's another issue...
> >> Yeah, it will take very long to crawl the site.
> >> With Nutch the property "fetcher.max.crawl.delay" needs to be adjusted:
> >>
> >> <property>
> >>  <name>fetcher.max.crawl.delay</name>
> >>  <value>30</value>
> >>  <description>
> >>  If the Crawl-Delay in robots.txt is set to greater than this value (in
> >>  seconds) then the fetcher will skip this page, generating an error
> >> report.
> >>  If set to -1 the fetcher will never skip such pages and will wait the
> >>  amount of time retrieved from robots.txt Crawl-Delay, however long that
> >>  might be.
> >>  </description>
> >> </property>
> >>
> >> Cheers,
> >> Sebastian
> >>
> >> [1] http://www.robotstxt.org/norobots-rfc.txt
> >> [2]
> >>
> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
> >> [3]
> >>
> http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt
> >>
> >> On 05/31/2014 04:27 PM, Luke Mawbey wrote:
> >> > From wikipedia:
> >> >     The/Robot Exclusion Standard/does not mention anything about the
> >> "*" character in
> >> > the|Disallow:|statement. Some crawlers like Googlebot recognize
> strings
> >> containing "*", while MSNbot
> >> > and Teoma interpret it in different ways
> >> >
> >> > So the 'problem' is with Macy's. Really, there is no problem for you:
> >> presumably that line is just
> >> > ignored from robots.txt.
> >> >
> >> > One might also question the craw-delay setting of 120 seconds, but
> >> that's another issue...
> >> >
> >> >
> >> >
> >> > On 31/05/2014 12:16 AM, Nima Falaki wrote:
> >> >> Hello Everyone:
> >> >>
> >> >> Just have a question about an issue I discovered while trying to
> crawl
> >> the
> >> >> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3
> and
> >> >> crawler-commons 0.4. This is the robots.txt file from macys
> >> >>
> >> >> User-agent: *
> >> >> Crawl-delay: 120
> >> >> Disallow: /compare
> >> >> Disallow: /registry/wedding/compare
> >> >> Disallow: /catalog/product/zoom.jsp
> >> >> Disallow: /search
> >> >> Disallow: /shop/search
> >> >> Disallow: /shop/registry/wedding/search
> >> >> Disallow: *natuzzi*
> >> >> noindex: *natuzzi*
> >> >> Disallow: *Natuzzi*
> >> >> noindex: *Natuzzi*
> >> >> Disallow:  /bag/add*
> >> >>
> >> >>
> >> >> When I run this robots.txt through the RobotsRulesParser with this
> url
> >> >> (
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >> )
> >> >>
> >> >> I get the following exceptions
> >> >>
> >> >> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
> >> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
> in
> >> >> robots.txt file (size 672): noindex: *natuzzi*
> >> >>
> >> >> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
> >> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
> in
> >> >> robots.txt file (size 672): noindex: *Natuzzi*
> >> >>
> >> >> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> >> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
> in
> >> >> robots.txt file (size 672): noindex: *natuzzi*
> >> >>
> >> >> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> >> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
> in
> >> >> robots.txt file (size 672): noindex: *Natuzzi*
> >> >>
> >> >> Is there anything I can do to solve this problem? Is this a problem
> >> >> with nutch or does macys.com have a really bad robots.txt file?
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>   <http://www.popsugar.com>
> >> >> Nima Falaki
> >> >> Software Engineer
> >> >> nfalaki@popsugar.com
> >> >>
> >> >
> >> >
> >>
> >>
> >
> >
> > --
> >
> >
> >  <http://www.popsugar.com>
> >
> > Nima Falaki
> > Software Engineer
> > nfalaki@popsugar.com
> >
> >
>
>
> --
>
>
>
> Nima Falaki
> Software Engineer
> nfalaki@popsugar.com
>

Re: Problem with crawling macys robots.txt

Posted by Nima Falaki <nf...@popsugar.com>.
Nevermind, I figured it out, I adjusted my fetcher.max.crawl.delay
accordingly and it solved the issue. Macys.com has a crawl-delay of 120,
nutch by default has a crawl delay of 30, so I had to change that and it
worked. You guys must either make the crawl delay to -1 (something I dont
recommend, but I did for example purposes), or to over 120 (for macys.com)
in order to crawl macys.com

<property>

 <name>fetcher.max.crawl.delay</name>

 <value>-1</value>

 <description>

 If the Crawl-Delay in robots.txt is set to greater than this value (in

 seconds) then the fetcher will skip this page, generating an error report.

 If set to -1 the fetcher will never skip such pages and will wait the

 amount of time retrieved from robots.txt Crawl-Delay, however long that

 might be.

 </description>

</property>


On Mon, Jun 2, 2014 at 6:31 PM, Nima Falaki <nf...@popsugar.com> wrote:

> Hi Sebastian:
>
> One thing I noticed is that when I tested the robots.txt with
> RobotsRulesParser, which is in org.apache.nutch.protocol, with the
> following URL
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>
> It gave me this message
>
> 2014-06-02 18:27:16,949 WARN  robots.SimpleRobotRulesParser (
> SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing
> robots.txt for
> /Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt
>
> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> robots.txt file (size 672): noindex: *natuzzi*
>
> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> robots.txt file (size 672): noindex: *Natuzzi*
>
> 2014-06-02 18:27:16,954 WARN  robots.SimpleRobotRulesParser (
> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> robots.txt file (size 672): noindex: *natuzzi*
>
> 2014-06-02 18:27:16,955 WARN  robots.SimpleRobotRulesParser (
> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> robots.txt file (size 672): noindex: *Natuzzi*
>
> *allowed:
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> <http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=>*
>
>
> This is in direct contrary to what happened when I ran the crawl script
> with
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> as my SeedURL
>
> I got this in my crawlDB
>
> *http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> <http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=>
>       Version: 7*
>
> *Status: 3 (db_gone)*
>
> *Fetch time: Thu Jul 17 18:05:47 PDT 2014*
>
> *Modified time: Wed Dec 31 16:00:00 PST 1969*
>
> *Retries since fetch: 0*
>
> *Retry interval: 3888000 seconds (45 days)*
>
> *Score: 1.0*
>
> *Signature: null*
>
> *Metadata:*
>
> *        _pst_=robots_denied(18), lastModified=0*
>
>
> Is this a bug in the crawler-commons 0.3? Where when you test the macys
> robots.txt file with RobotRulesParser it allows it, but when you run the
> macys url as a seed url in the crawl script then it denies the url.
>
>
>
>
>
>
>
>
>
>
>
> On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com> wrote:
>
>> Hi Luke, hi Nima,
>>
>> >     The/Robot Exclusion Standard/does not mention anything about the
>> "*" character in
>> > the|Disallow:|statement.
>> Indeed the RFC draft [1] does not. However, since Google [2] does wild
>> card patterns are
>> frequently used in robots.txt. With crawler-commons 0.4 [3] these rules
>> are also followed
>> by Nutch (to be in versions 1.9 resp. 2.3).
>>
>> But the error message is about the noindex lines:
>>  noindex: *natuzzi*
>> These lines are redundant (and also invalid, I suppose):
>> if a page/URL is disallowed, it's not fetched at all,
>> and will hardly slip into the index.
>> I think you can ignore the warning.
>>
>> > One might also question the craw-delay setting of 120 seconds, but
>> that's another issue...
>> Yeah, it will take very long to crawl the site.
>> With Nutch the property "fetcher.max.crawl.delay" needs to be adjusted:
>>
>> <property>
>>  <name>fetcher.max.crawl.delay</name>
>>  <value>30</value>
>>  <description>
>>  If the Crawl-Delay in robots.txt is set to greater than this value (in
>>  seconds) then the fetcher will skip this page, generating an error
>> report.
>>  If set to -1 the fetcher will never skip such pages and will wait the
>>  amount of time retrieved from robots.txt Crawl-Delay, however long that
>>  might be.
>>  </description>
>> </property>
>>
>> Cheers,
>> Sebastian
>>
>> [1] http://www.robotstxt.org/norobots-rfc.txt
>> [2]
>> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
>> [3]
>> http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt
>>
>> On 05/31/2014 04:27 PM, Luke Mawbey wrote:
>> > From wikipedia:
>> >     The/Robot Exclusion Standard/does not mention anything about the
>> "*" character in
>> > the|Disallow:|statement. Some crawlers like Googlebot recognize strings
>> containing "*", while MSNbot
>> > and Teoma interpret it in different ways
>> >
>> > So the 'problem' is with Macy's. Really, there is no problem for you:
>> presumably that line is just
>> > ignored from robots.txt.
>> >
>> > One might also question the craw-delay setting of 120 seconds, but
>> that's another issue...
>> >
>> >
>> >
>> > On 31/05/2014 12:16 AM, Nima Falaki wrote:
>> >> Hello Everyone:
>> >>
>> >> Just have a question about an issue I discovered while trying to crawl
>> the
>> >> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 and
>> >> crawler-commons 0.4. This is the robots.txt file from macys
>> >>
>> >> User-agent: *
>> >> Crawl-delay: 120
>> >> Disallow: /compare
>> >> Disallow: /registry/wedding/compare
>> >> Disallow: /catalog/product/zoom.jsp
>> >> Disallow: /search
>> >> Disallow: /shop/search
>> >> Disallow: /shop/registry/wedding/search
>> >> Disallow: *natuzzi*
>> >> noindex: *natuzzi*
>> >> Disallow: *Natuzzi*
>> >> noindex: *Natuzzi*
>> >> Disallow:  /bag/add*
>> >>
>> >>
>> >> When I run this robots.txt through the RobotsRulesParser with this url
>> >> (
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>> )
>> >>
>> >> I get the following exceptions
>> >>
>> >> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
>> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> >> robots.txt file (size 672): noindex: *natuzzi*
>> >>
>> >> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
>> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> >> robots.txt file (size 672): noindex: *Natuzzi*
>> >>
>> >> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
>> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> >> robots.txt file (size 672): noindex: *natuzzi*
>> >>
>> >> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
>> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> >> robots.txt file (size 672): noindex: *Natuzzi*
>> >>
>> >> Is there anything I can do to solve this problem? Is this a problem
>> >> with nutch or does macys.com have a really bad robots.txt file?
>> >>
>> >>
>> >>
>> >>
>> >>   <http://www.popsugar.com>
>> >> Nima Falaki
>> >> Software Engineer
>> >> nfalaki@popsugar.com
>> >>
>> >
>> >
>>
>>
>
>
> --
>
>
>  <http://www.popsugar.com>
>
> Nima Falaki
> Software Engineer
> nfalaki@popsugar.com
>
>


-- 



Nima Falaki
Software Engineer
nfalaki@popsugar.com

Re: Problem with crawling macys robots.txt

Posted by Nima Falaki <nf...@popsugar.com>.
Hi Sebastian:

One thing I noticed is that when I tested the robots.txt with
RobotsRulesParser, which is in org.apache.nutch.protocol, with the
following URL
http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=

It gave me this message

2014-06-02 18:27:16,949 WARN  robots.SimpleRobotRulesParser (
SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing
robots.txt for
/Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt

2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
robots.txt file (size 672): noindex: *natuzzi*

2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
robots.txt file (size 672): noindex: *Natuzzi*

2014-06-02 18:27:16,954 WARN  robots.SimpleRobotRulesParser (
SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
robots.txt file (size 672): noindex: *natuzzi*

2014-06-02 18:27:16,955 WARN  robots.SimpleRobotRulesParser (
SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
robots.txt file (size 672): noindex: *Natuzzi*

*allowed:
http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
<http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=>*


This is in direct contrary to what happened when I ran the crawl script
with
http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
as my SeedURL

I got this in my crawlDB

*http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
<http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=>
      Version: 7*

*Status: 3 (db_gone)*

*Fetch time: Thu Jul 17 18:05:47 PDT 2014*

*Modified time: Wed Dec 31 16:00:00 PST 1969*

*Retries since fetch: 0*

*Retry interval: 3888000 seconds (45 days)*

*Score: 1.0*

*Signature: null*

*Metadata:*

*        _pst_=robots_denied(18), lastModified=0*


Is this a bug in the crawler-commons 0.3? Where when you test the macys
robots.txt file with RobotRulesParser it allows it, but when you run the
macys url as a seed url in the crawl script then it denies the url.











On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi Luke, hi Nima,
>
> >     The/Robot Exclusion Standard/does not mention anything about the "*"
> character in
> > the|Disallow:|statement.
> Indeed the RFC draft [1] does not. However, since Google [2] does wild
> card patterns are
> frequently used in robots.txt. With crawler-commons 0.4 [3] these rules
> are also followed
> by Nutch (to be in versions 1.9 resp. 2.3).
>
> But the error message is about the noindex lines:
>  noindex: *natuzzi*
> These lines are redundant (and also invalid, I suppose):
> if a page/URL is disallowed, it's not fetched at all,
> and will hardly slip into the index.
> I think you can ignore the warning.
>
> > One might also question the craw-delay setting of 120 seconds, but
> that's another issue...
> Yeah, it will take very long to crawl the site.
> With Nutch the property "fetcher.max.crawl.delay" needs to be adjusted:
>
> <property>
>  <name>fetcher.max.crawl.delay</name>
>  <value>30</value>
>  <description>
>  If the Crawl-Delay in robots.txt is set to greater than this value (in
>  seconds) then the fetcher will skip this page, generating an error report.
>  If set to -1 the fetcher will never skip such pages and will wait the
>  amount of time retrieved from robots.txt Crawl-Delay, however long that
>  might be.
>  </description>
> </property>
>
> Cheers,
> Sebastian
>
> [1] http://www.robotstxt.org/norobots-rfc.txt
> [2]
> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
> [3]
> http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt
>
> On 05/31/2014 04:27 PM, Luke Mawbey wrote:
> > From wikipedia:
> >     The/Robot Exclusion Standard/does not mention anything about the "*"
> character in
> > the|Disallow:|statement. Some crawlers like Googlebot recognize strings
> containing "*", while MSNbot
> > and Teoma interpret it in different ways
> >
> > So the 'problem' is with Macy's. Really, there is no problem for you:
> presumably that line is just
> > ignored from robots.txt.
> >
> > One might also question the craw-delay setting of 120 seconds, but
> that's another issue...
> >
> >
> >
> > On 31/05/2014 12:16 AM, Nima Falaki wrote:
> >> Hello Everyone:
> >>
> >> Just have a question about an issue I discovered while trying to crawl
> the
> >> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 and
> >> crawler-commons 0.4. This is the robots.txt file from macys
> >>
> >> User-agent: *
> >> Crawl-delay: 120
> >> Disallow: /compare
> >> Disallow: /registry/wedding/compare
> >> Disallow: /catalog/product/zoom.jsp
> >> Disallow: /search
> >> Disallow: /shop/search
> >> Disallow: /shop/registry/wedding/search
> >> Disallow: *natuzzi*
> >> noindex: *natuzzi*
> >> Disallow: *Natuzzi*
> >> noindex: *Natuzzi*
> >> Disallow:  /bag/add*
> >>
> >>
> >> When I run this robots.txt through the RobotsRulesParser with this url
> >> (
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> )
> >>
> >> I get the following exceptions
> >>
> >> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
> >> robots.txt file (size 672): noindex: *natuzzi*
> >>
> >> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
> >> robots.txt file (size 672): noindex: *Natuzzi*
> >>
> >> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
> >> robots.txt file (size 672): noindex: *natuzzi*
> >>
> >> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
> >> robots.txt file (size 672): noindex: *Natuzzi*
> >>
> >> Is there anything I can do to solve this problem? Is this a problem
> >> with nutch or does macys.com have a really bad robots.txt file?
> >>
> >>
> >>
> >>
> >>   <http://www.popsugar.com>
> >> Nima Falaki
> >> Software Engineer
> >> nfalaki@popsugar.com
> >>
> >
> >
>
>


-- 


 <http://www.popsugar.com>
Nima Falaki
Software Engineer
nfalaki@popsugar.com

Re: Problem with crawling macys robots.txt

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Luke, hi Nima,

>     The/Robot Exclusion Standard/does not mention anything about the "*" character in
> the|Disallow:|statement.
Indeed the RFC draft [1] does not. However, since Google [2] does wild card patterns are
frequently used in robots.txt. With crawler-commons 0.4 [3] these rules are also followed
by Nutch (to be in versions 1.9 resp. 2.3).

But the error message is about the noindex lines:
 noindex: *natuzzi*
These lines are redundant (and also invalid, I suppose):
if a page/URL is disallowed, it's not fetched at all,
and will hardly slip into the index.
I think you can ignore the warning.

> One might also question the craw-delay setting of 120 seconds, but that's another issue...
Yeah, it will take very long to crawl the site.
With Nutch the property "fetcher.max.crawl.delay" needs to be adjusted:

<property>
 <name>fetcher.max.crawl.delay</name>
 <value>30</value>
 <description>
 If the Crawl-Delay in robots.txt is set to greater than this value (in
 seconds) then the fetcher will skip this page, generating an error report.
 If set to -1 the fetcher will never skip such pages and will wait the
 amount of time retrieved from robots.txt Crawl-Delay, however long that
 might be.
 </description>
</property>

Cheers,
Sebastian

[1] http://www.robotstxt.org/norobots-rfc.txt
[2] https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
[3] http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt

On 05/31/2014 04:27 PM, Luke Mawbey wrote:
> From wikipedia:
>     The/Robot Exclusion Standard/does not mention anything about the "*" character in
> the|Disallow:|statement. Some crawlers like Googlebot recognize strings containing "*", while MSNbot
> and Teoma interpret it in different ways
> 
> So the 'problem' is with Macy's. Really, there is no problem for you: presumably that line is just
> ignored from robots.txt.
> 
> One might also question the craw-delay setting of 120 seconds, but that's another issue...
> 
> 
> 
> On 31/05/2014 12:16 AM, Nima Falaki wrote:
>> Hello Everyone:
>>
>> Just have a question about an issue I discovered while trying to crawl the
>> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 and
>> crawler-commons 0.4. This is the robots.txt file from macys
>>
>> User-agent: *
>> Crawl-delay: 120
>> Disallow: /compare
>> Disallow: /registry/wedding/compare
>> Disallow: /catalog/product/zoom.jsp
>> Disallow: /search
>> Disallow: /shop/search
>> Disallow: /shop/registry/wedding/search
>> Disallow: *natuzzi*
>> noindex: *natuzzi*
>> Disallow: *Natuzzi*
>> noindex: *Natuzzi*
>> Disallow:  /bag/add*
>>
>>
>> When I run this robots.txt through the RobotsRulesParser with this url
>> (http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=)
>>
>> I get the following exceptions
>>
>> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> robots.txt file (size 672): noindex: *natuzzi*
>>
>> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> robots.txt file (size 672): noindex: *Natuzzi*
>>
>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> robots.txt file (size 672): noindex: *natuzzi*
>>
>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> robots.txt file (size 672): noindex: *Natuzzi*
>>
>> Is there anything I can do to solve this problem? Is this a problem
>> with nutch or does macys.com have a really bad robots.txt file?
>>
>>
>>
>>
>>   <http://www.popsugar.com>
>> Nima Falaki
>> Software Engineer
>> nfalaki@popsugar.com
>>
> 
> 


Re: Problem with crawling macys robots.txt

Posted by Luke Mawbey <ju...@lbm.net.au>.
 From wikipedia:
     The/Robot Exclusion Standard/does not mention anything about the 
"*" character in the|Disallow:|statement. Some crawlers like Googlebot 
recognize strings containing "*", while MSNbot and Teoma interpret it in 
different ways

So the 'problem' is with Macy's. Really, there is no problem for you: 
presumably that line is just ignored from robots.txt.

One might also question the craw-delay setting of 120 seconds, but 
that's another issue...



On 31/05/2014 12:16 AM, Nima Falaki wrote:
> Hello Everyone:
>
> Just have a question about an issue I discovered while trying to crawl the
> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 and
> crawler-commons 0.4. This is the robots.txt file from macys
>
> User-agent: *
> Crawl-delay: 120
> Disallow: /compare
> Disallow: /registry/wedding/compare
> Disallow: /catalog/product/zoom.jsp
> Disallow: /search
> Disallow: /shop/search
> Disallow: /shop/registry/wedding/search
> Disallow: *natuzzi*
> noindex: *natuzzi*
> Disallow: *Natuzzi*
> noindex: *Natuzzi*
> Disallow:  /bag/add*
>
>
> When I run this robots.txt through the RobotsRulesParser with this url
> (http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=)
> I get the following exceptions
>
> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
> (SimpleRobotRulesParser.java:reportWarning(456)) - 	Unknown line in
> robots.txt file (size 672): noindex: *natuzzi*
>
> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
> (SimpleRobotRulesParser.java:reportWarning(456)) - 	Unknown line in
> robots.txt file (size 672): noindex: *Natuzzi*
>
> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> (SimpleRobotRulesParser.java:reportWarning(456)) - 	Unknown line in
> robots.txt file (size 672): noindex: *natuzzi*
>
> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> (SimpleRobotRulesParser.java:reportWarning(456)) - 	Unknown line in
> robots.txt file (size 672): noindex: *Natuzzi*
>
> Is there anything I can do to solve this problem? Is this a problem
> with nutch or does macys.com have a really bad robots.txt file?
>
>
>
>
>   <http://www.popsugar.com>
> Nima Falaki
> Software Engineer
> nfalaki@popsugar.com
>