You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2012/05/09 16:03:49 UTC

[jira] [Created] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Lewis John McGibbney created NUTCH-1360:
-------------------------------------------

             Summary: Suport the storing of IP address connected to when web crawling
                 Key: NUTCH-1360
                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
             Project: Nutch
          Issue Type: New Feature
          Components: protocol
    Affects Versions: nutchgora, 1.5
            Reporter: Lewis John McGibbney
            Priority: Minor
             Fix For: nutchgora, 1.6


Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272234#comment-13272234 ] 

Lewis John McGibbney commented on NUTCH-1360:
---------------------------------------------

As all protocol plugins try to imitate an HTTP scenario, I propose the nutch-default.xml property to be homed under HTTP properties and named http.store.ip.address?
                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411224#comment-13411224 ] 

Hudson commented on NUTCH-1360:
-------------------------------

Integrated in Nutch-trunk #1893 (See [https://builds.apache.org/job/Nutch-trunk/1893/])
    revert NUTCH-1360 (Revision 1359760)

     Result = SUCCESS
lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1359760
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java
* /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* /nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java

                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.1
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280269#comment-13280269 ] 

Lewis John McGibbney commented on NUTCH-1360:
---------------------------------------------

Committed @revision 1341100 in nutchgora branch
                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406495#comment-13406495 ] 

Ferdy Galema commented on NUTCH-1360:
-------------------------------------

Sorry for the late response, but this issue is not properly implemented (for both branch and trunk).

- IP is always stored instead of depending on property: headers.set("_ip",... should be done only if http.getIP_Header() is true.

- http.store.ip.address appends the _ip:<true or false> property to the request string? What is the purpose of that? If not intentional, we should simply revert this. On top of that it uses the property with a default of "true", but is should be "false" if the adding to request string is intentional.

Thanks.

                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410499#comment-13410499 ] 

Lewis John McGibbney commented on NUTCH-1360:
---------------------------------------------

reverted @revision 1359760 in trunk
revision @revision 1359762 in 2.X branch

OK so I understand exactly where I've gone wrong here and will work to improve this patch when I have more time. I now acknowledge that it shouldn't have made it's way into the codebase(s) and can only thank you guys for pointing this out.

{bq} Guys, unless a change is trivial please do not commit it yourself unless it has been reviewed by a fellow committer {bq}

Point absolutely taken Julien thanks. I'll get around to this in the near future and attach patch when it's ready. Also thanks for the suggestions Ferdy & Julien. 


                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.1
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293333#comment-13293333 ] 

Hudson commented on NUTCH-1360:
-------------------------------

Integrated in Nutch-trunk #1868 (See [https://builds.apache.org/job/Nutch-trunk/1868/])
    commit to address NUTCH-1360 and update to CHANGES.txt (Revision 1348993)

     Result = SUCCESS
lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1348993
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java
* /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* /nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java

                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Reopened] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney reopened NUTCH-1360:
-----------------------------------------


Reopening to address Ferdy's concerns. I can't work on this tonight but will try my best to take a look at it ASAP. Thanks Ferdy
                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293085#comment-13293085 ] 

Hudson commented on NUTCH-1360:
-------------------------------

Integrated in nutch-trunk-maven #307 (See [https://builds.apache.org/job/nutch-trunk-maven/307/])
    commit to address NUTCH-1360 and update to CHANGES.txt (Revision 1348993)

     Result = SUCCESS
lewismc : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java
* /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* /nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java

                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1360:
----------------------------------------

    Attachment: NUTCH-1360-nutchgora.patch

This is a real WIP for nutchgora. It would work as expected just now, however I think it would be nice to extend this to all protocol plugins, also making it configurable via nutch-site.xml, as we might not always wish to log the IP addresses of the hosts we connect to...
Does anyone have preference on this one? It might not be as important on protocol-file but certainly protocol-ftp and sftp respectively this may be required?
                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409578#comment-13409578 ] 

Julien Nioche commented on NUTCH-1360:
--------------------------------------

Guys, unless a change is trivial please do not commit it yourself unless it has been reviewed by a fellow committer. We can always revert to previous versions but it is easier and faster to make sure that things work as they should in the first place.

Regarding where to store the value I would do that in the content metadata.
                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.1
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1360:
---------------------------------

    Fix Version/s:     (was: nutchgora)
                   2.1
    
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.1
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410884#comment-13410884 ] 

Ferdy Galema commented on NUTCH-1360:
-------------------------------------

Thanks! Keep up the good work!
                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.1
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1360:
----------------------------------------

    Attachment: NUTCH-1360-trunk.patch

Patch for trunk.
The patch allows us to obtain additional host IP http header address information. An example is;

{code}
Content Metadata: ETag="2b51-4bf03b1ee610f-gzip" Vary=Accept-Encoding Date=Thu, 17 May 2012 10:42:56 GMT Content-Length=3490 Content-Encoding=gzip Last-Modified=Wed, 02 May 2012 01:34:57 GMT Content-Type=text/html; charset=utf-8 Connection=close Accept-Ranges=bytes Server=Apache/2.4.1 (Unix) OpenSSL/1.0.0g *_ip=192.87.106.229* 
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 
{code}

This is simply some output from the ParserChecker
                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1360:
----------------------------------------

    Attachment: NUTCH-1360-nutchgora-v2.patch

patch for Nutchgora branch.
                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1360:
----------------------------------------

    Fix Version/s:     (was: 2.1)
                   2.2
    
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.2
>
>         Attachments: NUTCH-1360-nutchgora.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406561#comment-13406561 ] 

Ferdy Galema commented on NUTCH-1360:
-------------------------------------

Just one more thing:
Should the IP not be stored in the metadata instead of the headers field? It is technically not a response header. As far as I know currently the headers container is only used for the headers returned by the http server. (But correct me if I'm wrong)
                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney reassigned NUTCH-1360:
-------------------------------------------

    Assignee: Lewis John McGibbney
    
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Comment Edited] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277718#comment-13277718 ] 

Lewis John McGibbney edited comment on NUTCH-1360 at 5/17/12 10:51 AM:
-----------------------------------------------------------------------

Patch for trunk.
The patch allows us to obtain additional host IP http header address information. An example is;

{code}
Content Metadata: ETag="2b51-4bf03b1ee610f-gzip" Vary=Accept-Encoding Date=Thu, 17 May 2012 10:42:56 GMT Content-Length=3490 Content-Encoding=gzip Last-Modified=Wed, 02 May 2012 01:34:57 GMT Content-Type=text/html; charset=utf-8 Connection=close Accept-Ranges=bytes Server=Apache/2.4.1 (Unix) OpenSSL/1.0.0g {color:red} _ip=192.87.106.229 {color}
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 
{code}

This is simply some output from the ParserChecker
                
      was (Author: lewismc):
    Patch for trunk.
The patch allows us to obtain additional host IP http header address information. An example is;

{code}
Content Metadata: ETag="2b51-4bf03b1ee610f-gzip" Vary=Accept-Encoding Date=Thu, 17 May 2012 10:42:56 GMT Content-Length=3490 Content-Encoding=gzip Last-Modified=Wed, 02 May 2012 01:34:57 GMT Content-Type=text/html; charset=utf-8 Connection=close Accept-Ranges=bytes Server=Apache/2.4.1 (Unix) OpenSSL/1.0.0g *_ip=192.87.106.229* 
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 
{code}

This is simply some output from the ParserChecker
                  
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Comment Edited] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406495#comment-13406495 ] 

Ferdy Galema edited comment on NUTCH-1360 at 7/4/12 1:41 PM:
-------------------------------------------------------------

Sorry for the late response, but this issue is not properly implemented (for both branch and trunk).

- IP is always stored instead of depending on property: headers.set("_ip",... should be done only if http.getIP_Header() is true.

- http.store.ip.address appends the _ip:<true or false> property to the request string? What is the purpose of that? If not intentional, we should simply revert this. On top of that it rereads the property with a default of "true", but is should be "false" (or just use http.getIP_Header()) if the adding to request string is intentional.

Thanks.
                
      was (Author: ferdy.g):
    Sorry for the late response, but this issue is not properly implemented (for both branch and trunk).

- IP is always stored instead of depending on property: headers.set("_ip",... should be done only if http.getIP_Header() is true.

- http.store.ip.address appends the _ip:<true or false> property to the request string? What is the purpose of that? If not intentional, we should simply revert this. On top of that it uses the property with a default of "true", but is should be "false" if the adding to request string is intentional.

Thanks.

                  
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney resolved NUTCH-1360.
-----------------------------------------

    Resolution: Fixed

Committed @revision 1348993 in trunk as well.
                
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1360:
----------------------------------------

    Patch Info: Patch Available
    
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira