You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jorge Luis Betancourt González <jl...@uci.cu> on 2015/03/02 04:05:06 UTC

Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

The general answer is: it dependes, usually is "polite" to present your robot to the website so the webmaster knows what is accessing the site, this is why google and a lot of other search engines (big and small) use a distinctive name for their crawlers/bots. That being said, the first site that you mention works fine for a quick parsechecker that I've executed:

➜  local  bin/nutch parsechecker http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
fetching: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
parsing: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
contentType: text/html
signature: 8e90c6d581f27c36828d433f746e4d7a
---------
Url
---------------

http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title: "Dressing for the Dark"
Outlinks: 151
  outlink: toUrl: http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css anchor:
  outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor:
  outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor:
  outlink: toUrl: http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor:
...

(trimmed due length)

As for the second one I wasn't able to do a test, the provided blocks access from my IP/country:

This request is blocked by the SonicWALL Gateway Geo IP Service.
Country Name:Cuba. 

Reading your experience with this website, looks like an error in the website programming, basically I'm assuming they are saying if your User Agent is not X,Y or Z then serve the mobile version, this could worth reporting.

Trying to fool the website giving the impression that your bot is a regular user by tweaking the user agent could work for now, but could draw in webmaster's attention and could be a cause for blocking your access, this depends a lot on the webmaster :). But for your particular case could be your only solution if the webmaster doesn't have a problem with the increase in traffic.

Regards,

----- Original Message -----
From: "Meraj A. Khan" <me...@gmail.com>
To: user@nutch.apache.org
Sent: Saturday, February 28, 2015 12:09:47 AM
Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

Hi Jorge,

Yes, I was exploring changing the http.agent.name property value in
case where the sites either serve the mobile version or outright deny
the request if no agent is specified.

For example the following URL will give Request Rejected response if
the User-Agent is not specified.

http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod

And the following URL will server a mobile version.

http://www.techforless.com/cgi-bin/tech4less/60PN5000.

So is it a good practice to set the  http.agent.name  to something
like the below , to mimic a Chrome browser?

Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/41.0.2228.0 Safari/537.36

On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González
<jl...@uci.cu> wrote:
> Hi Meraj,
>
> Can you provide an example URL? explain exactly what you're after? if the page you're trying to fetch has a lot of javascript/ajax keep in mind that the browsers do a lot of stuff with the downloaded page, for instance when you enter a page, the HTML is downloaded, the referenced CSS files are also fetched and applied to the HTML (also inline styles, etc.), if any javascript is referenced is also downloaded and executed on top of the loaded DOM (also inline script tags). The same applies to fonts, etc. The browsers "knows" how to deal with all this resources, also the CSS is applied depending on which browser you're using. The Nutch crawler only knows about the downloaded HTML (similar to what you see when you view the source code of an HTML webpage) it doesn't know what a CSS style is, basically the crawler only is interested in: the links and the textual/binary content of the webpage, so when a page es fetched by Nutch, the HTML is downloaded but the other resources (fonts, styles, javascript) are not applied to the fetched page.
>
> Tweaking the http.agent.name property in the nutch-site.xml only will help with those sites that change what their response based on the user agent (one for mobile and other different for desktop browsers). This approach is being replaced by the responsive design, meaning that the user agent is not important for how the page is rendered.
>
> In the current trunk of the upcoming 1.10 version a plugin has been merged that could address this, basically this plugin uses selenium to render the page and then feed Nutch with the resulting HTML, meaning that ajax/javascript interactions will be present in the content that Nutch will parse in the next stage.
>
> Also we need more information about your use case or what you're trying to accomplish.
>
> Hope it helps,
>
> Regards,
>
> ----- Original Message -----
> From: "Meraj A. Khan" <me...@gmail.com>
> To: user@nutch.apache.org
> Sent: Friday, February 27, 2015 12:47:06 AM
> Subject: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?
>
> In some instances the content that is downloaded in Fetch phase from a
> HTTP URL is not what you would get if you were to access the request
> from a well known browser like Google Chrome for example, that is
> because the server is expecting a user agent value that represents a
> browser.
>
> There is a http.agent.name property in nutch-site.xml, is it the same
> property that should be used to set the user agent to make the server
> respond to a Nutch get request the same way as it would for a request
> from a browser ? Or is there an another configurable property ?
>
> For example the user agent value for a Chrome browser is below.
>
> Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
> Chrome/41.0.2228.0 Safari/537.36
>
>
> Thanks.

Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

Posted by "Meraj A. Khan" <me...@gmail.com>.
Jorge ,

I think I spoke too soon , if I use the protocol-httpclient plugin , I
am unable to fetch  any page using the parsechecker.

I get a [Fatal Error] :1:1: Content is not allowed in prolog. error.

Are there any known issues with using protocol-httpclient , I am using
Nutch 1.7 I have the following settings in my nutch-site.xml

    <!-- Added based on the suggestion from nutch mailing list -->
    <property>
        <name>plugin.includes</name>
        <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    </property>


    <property>
        <name>http.useHttp11</name>
        <value>true</value>
        <description>NOTE: at the moment this works only for
            protocol-httpclient.
            If true, use HTTP 1.1, if false use HTTP 1.0 .
        </description>
    </property>


Thanks.

On Sun, Mar 1, 2015 at 10:05 PM, Jorge Luis Betancourt González
<jl...@uci.cu> wrote:
> The general answer is: it dependes, usually is "polite" to present your robot to the website so the webmaster knows what is accessing the site, this is why google and a lot of other search engines (big and small) use a distinctive name for their crawlers/bots. That being said, the first site that you mention works fine for a quick parsechecker that I've executed:
>
> ➜  local  bin/nutch parsechecker http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
> fetching: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
> parsing: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
> contentType: text/html
> signature: 8e90c6d581f27c36828d433f746e4d7a
> ---------
> Url
> ---------------
>
> http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
> ---------
> ParseData
> ---------
>
> Version: 5
> Status: success(1,0)
> Title: "Dressing for the Dark"
> Outlinks: 151
>   outlink: toUrl: http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css anchor:
>   outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor:
>   outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor:
>   outlink: toUrl: http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor:
> ...
>
> (trimmed due length)
>
> As for the second one I wasn't able to do a test, the provided blocks access from my IP/country:
>
> This request is blocked by the SonicWALL Gateway Geo IP Service.
> Country Name:Cuba.
>
> Reading your experience with this website, looks like an error in the website programming, basically I'm assuming they are saying if your User Agent is not X,Y or Z then serve the mobile version, this could worth reporting.
>
> Trying to fool the website giving the impression that your bot is a regular user by tweaking the user agent could work for now, but could draw in webmaster's attention and could be a cause for blocking your access, this depends a lot on the webmaster :). But for your particular case could be your only solution if the webmaster doesn't have a problem with the increase in traffic.
>
> Regards,
>
> ----- Original Message -----
> From: "Meraj A. Khan" <me...@gmail.com>
> To: user@nutch.apache.org
> Sent: Saturday, February 28, 2015 12:09:47 AM
> Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?
>
> Hi Jorge,
>
> Yes, I was exploring changing the http.agent.name property value in
> case where the sites either serve the mobile version or outright deny
> the request if no agent is specified.
>
> For example the following URL will give Request Rejected response if
> the User-Agent is not specified.
>
> http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
>
> And the following URL will server a mobile version.
>
> http://www.techforless.com/cgi-bin/tech4less/60PN5000.
>
> So is it a good practice to set the  http.agent.name  to something
> like the below , to mimic a Chrome browser?
>
> Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
> Chrome/41.0.2228.0 Safari/537.36
>
> On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González
> <jl...@uci.cu> wrote:
>> Hi Meraj,
>>
>> Can you provide an example URL? explain exactly what you're after? if the page you're trying to fetch has a lot of javascript/ajax keep in mind that the browsers do a lot of stuff with the downloaded page, for instance when you enter a page, the HTML is downloaded, the referenced CSS files are also fetched and applied to the HTML (also inline styles, etc.), if any javascript is referenced is also downloaded and executed on top of the loaded DOM (also inline script tags). The same applies to fonts, etc. The browsers "knows" how to deal with all this resources, also the CSS is applied depending on which browser you're using. The Nutch crawler only knows about the downloaded HTML (similar to what you see when you view the source code of an HTML webpage) it doesn't know what a CSS style is, basically the crawler only is interested in: the links and the textual/binary content of the webpage, so when a page es fetched by Nutch, the HTML is downloaded but the other resources (fonts, styles, javascript) are not applied to the fetched page.
>>
>> Tweaking the http.agent.name property in the nutch-site.xml only will help with those sites that change what their response based on the user agent (one for mobile and other different for desktop browsers). This approach is being replaced by the responsive design, meaning that the user agent is not important for how the page is rendered.
>>
>> In the current trunk of the upcoming 1.10 version a plugin has been merged that could address this, basically this plugin uses selenium to render the page and then feed Nutch with the resulting HTML, meaning that ajax/javascript interactions will be present in the content that Nutch will parse in the next stage.
>>
>> Also we need more information about your use case or what you're trying to accomplish.
>>
>> Hope it helps,
>>
>> Regards,
>>
>> ----- Original Message -----
>> From: "Meraj A. Khan" <me...@gmail.com>
>> To: user@nutch.apache.org
>> Sent: Friday, February 27, 2015 12:47:06 AM
>> Subject: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?
>>
>> In some instances the content that is downloaded in Fetch phase from a
>> HTTP URL is not what you would get if you were to access the request
>> from a well known browser like Google Chrome for example, that is
>> because the server is expecting a user agent value that represents a
>> browser.
>>
>> There is a http.agent.name property in nutch-site.xml, is it the same
>> property that should be used to set the user agent to make the server
>> respond to a Nutch get request the same way as it would for a request
>> from a browser ? Or is there an another configurable property ?
>>
>> For example the user agent value for a Chrome browser is below.
>>
>> Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
>> Chrome/41.0.2228.0 Safari/537.36
>>
>>
>> Thanks.

Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

Posted by "Meraj A. Khan" <me...@gmail.com>.
Thanks Jorge, I appreciate your help.

On Sun, Mar 1, 2015 at 10:05 PM, Jorge Luis Betancourt González
<jl...@uci.cu> wrote:
> The general answer is: it dependes, usually is "polite" to present your robot to the website so the webmaster knows what is accessing the site, this is why google and a lot of other search engines (big and small) use a distinctive name for their crawlers/bots. That being said, the first site that you mention works fine for a quick parsechecker that I've executed:
>
> ➜  local  bin/nutch parsechecker http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
> fetching: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
> parsing: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
> contentType: text/html
> signature: 8e90c6d581f27c36828d433f746e4d7a
> ---------
> Url
> ---------------
>
> http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
> ---------
> ParseData
> ---------
>
> Version: 5
> Status: success(1,0)
> Title: "Dressing for the Dark"
> Outlinks: 151
>   outlink: toUrl: http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css anchor:
>   outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor:
>   outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor:
>   outlink: toUrl: http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor:
> ...
>
> (trimmed due length)
>
> As for the second one I wasn't able to do a test, the provided blocks access from my IP/country:
>
> This request is blocked by the SonicWALL Gateway Geo IP Service.
> Country Name:Cuba.
>
> Reading your experience with this website, looks like an error in the website programming, basically I'm assuming they are saying if your User Agent is not X,Y or Z then serve the mobile version, this could worth reporting.
>
> Trying to fool the website giving the impression that your bot is a regular user by tweaking the user agent could work for now, but could draw in webmaster's attention and could be a cause for blocking your access, this depends a lot on the webmaster :). But for your particular case could be your only solution if the webmaster doesn't have a problem with the increase in traffic.
>
> Regards,
>
> ----- Original Message -----
> From: "Meraj A. Khan" <me...@gmail.com>
> To: user@nutch.apache.org
> Sent: Saturday, February 28, 2015 12:09:47 AM
> Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?
>
> Hi Jorge,
>
> Yes, I was exploring changing the http.agent.name property value in
> case where the sites either serve the mobile version or outright deny
> the request if no agent is specified.
>
> For example the following URL will give Request Rejected response if
> the User-Agent is not specified.
>
> http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
>
> And the following URL will server a mobile version.
>
> http://www.techforless.com/cgi-bin/tech4less/60PN5000.
>
> So is it a good practice to set the  http.agent.name  to something
> like the below , to mimic a Chrome browser?
>
> Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
> Chrome/41.0.2228.0 Safari/537.36
>
> On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González
> <jl...@uci.cu> wrote:
>> Hi Meraj,
>>
>> Can you provide an example URL? explain exactly what you're after? if the page you're trying to fetch has a lot of javascript/ajax keep in mind that the browsers do a lot of stuff with the downloaded page, for instance when you enter a page, the HTML is downloaded, the referenced CSS files are also fetched and applied to the HTML (also inline styles, etc.), if any javascript is referenced is also downloaded and executed on top of the loaded DOM (also inline script tags). The same applies to fonts, etc. The browsers "knows" how to deal with all this resources, also the CSS is applied depending on which browser you're using. The Nutch crawler only knows about the downloaded HTML (similar to what you see when you view the source code of an HTML webpage) it doesn't know what a CSS style is, basically the crawler only is interested in: the links and the textual/binary content of the webpage, so when a page es fetched by Nutch, the HTML is downloaded but the other resources (fonts, styles, javascript) are not applied to the fetched page.
>>
>> Tweaking the http.agent.name property in the nutch-site.xml only will help with those sites that change what their response based on the user agent (one for mobile and other different for desktop browsers). This approach is being replaced by the responsive design, meaning that the user agent is not important for how the page is rendered.
>>
>> In the current trunk of the upcoming 1.10 version a plugin has been merged that could address this, basically this plugin uses selenium to render the page and then feed Nutch with the resulting HTML, meaning that ajax/javascript interactions will be present in the content that Nutch will parse in the next stage.
>>
>> Also we need more information about your use case or what you're trying to accomplish.
>>
>> Hope it helps,
>>
>> Regards,
>>
>> ----- Original Message -----
>> From: "Meraj A. Khan" <me...@gmail.com>
>> To: user@nutch.apache.org
>> Sent: Friday, February 27, 2015 12:47:06 AM
>> Subject: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?
>>
>> In some instances the content that is downloaded in Fetch phase from a
>> HTTP URL is not what you would get if you were to access the request
>> from a well known browser like Google Chrome for example, that is
>> because the server is expecting a user agent value that represents a
>> browser.
>>
>> There is a http.agent.name property in nutch-site.xml, is it the same
>> property that should be used to set the user agent to make the server
>> respond to a Nutch get request the same way as it would for a request
>> from a browser ? Or is there an another configurable property ?
>>
>> For example the user agent value for a Chrome browser is below.
>>
>> Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
>> Chrome/41.0.2228.0 Safari/537.36
>>
>>
>> Thanks.