You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by jyoti aditya <jy...@gmail.com> on 2016/11/29 04:07:16 UTC

Impolite crawling using NUTCH

Hi team,

Can we use NUTCH to do impolite crawling?
Or is there any way by which we can disobey robots.text?


With Regards
Jyoti Aditya

Re: Impolite crawling using NUTCH

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Jyoti,

in this case, the answer is simple: the robots.txt whitelisting
was never ported from 1.x to 2.x ;(

Best,
Sebastian


On 12/07/2016 12:44 PM, jyoti aditya wrote:
> Hi Chris/Team,
> 
> I am using Nutch2.3.1 with mongoDB configured.
> Site - flipart.com <http://flipart.com>
> 
> Even though I have added whitelist property in my nutch-stie.xml,
> I am not able to crawl.
> 
> Please find attached log.
> Please help me to fix this issue.
> 
> With Regards,
> Jyoti Aditya
> 
> On Tue, Dec 6, 2016 at 11:02 AM, Mattmann, Chris A (3010) <chris.a.mattmann@jpl.nasa.gov
> <ma...@jpl.nasa.gov>> wrote:
> 
>     Hi Jyoti,____
> 
>     __ __
> 
>     I need a lot more detail than \u201cit didn\u2019t work\u201d. What didn\u2019t work about it? Do you have a log
>     file? What site were you trying to crawl? What command did you use? Where is your nutch
>     config? Were you running in distributed or local mode?____
> 
>     __ __
> 
>     Onto Selenium \u2013 have you tried it or simply reading the docs, you think it\u2019s old? What have
>     you done? What have you tried?____
> 
>     __ __
> 
>     I need a LOT more detail before I (and I\u2019m guessing anyone else on these lists) can help.____
> 
>     __ __
> 
>     Cheers,____
> 
>     Chris____
> 
>     __ __
> 
>     __ __
> 
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
>     Chris Mattmann, Ph.D.____
> 
>     Principal Data Scientist, Engineering Administrative Office (3010)____
> 
>     Manager, Open Source Projects Formulation and Development Office (8212)____
> 
>     NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA____
> 
>     Office: 180-503E, Mailstop: 180-503____
> 
>     Email: chris.a.mattmann@nasa.gov <ma...@nasa.gov>____
> 
>     WWW:  http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>____
> 
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
>     Director, Information Retrieval and Data Science Group (IRDS)____
> 
>     Adjunct Associate Professor, Computer Science Department____
> 
>     University of Southern California, Los Angeles, CA 90089 USA____
> 
>     WWW: http://irds.usc.edu/____
> 
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
>     __ __
> 
>     __ __
> 
>     *From: *jyoti aditya <jyotiaditya007@gmail.com <ma...@gmail.com>>
>     *Date: *Monday, December 5, 2016 at 9:29 PM
>     *To: *"Mattmann, Chris A (3010)" <chris.a.mattmann@jpl.nasa.gov
>     <ma...@jpl.nasa.gov>>
>     *Cc: *"user@nutch.apache.org <ma...@nutch.apache.org>" <user@nutch.apache.org
>     <ma...@nutch.apache.org>>, "dev@nutch.apatche.org <ma...@nutch.apatche.org>"
>     <dev@nutch.apatche.org <ma...@nutch.apatche.org>>
> 
> 
>     *Subject: *Re: Impolite crawling using NUTCH____
> 
>     __ __
> 
>     Hi Chris/Team, ____
> 
>     __ __
> 
>     Whitelisting domain name din't work. ____
> 
>     And when i was trying to configure selenium. It need one headless browser to be integrated with.____
> 
>     Documentation for selenium-protocol plugin looks old. firefox-11 is now not supported as
>     headless browser with selenium.____
> 
>     So please help me out in configuring selenium plugin configuration.____
> 
>     __ __
> 
>     I am yet not sure, after configuring above what result it will fetch me.____
> 
>     __ __
> 
>     With Regards,____
> 
>     Jyoti Aditya____
> 
>     __ __
> 
>     On Tue, Dec 6, 2016 at 12:00 AM, Mattmann, Chris A (3010) <chris.a.mattmann@jpl.nasa.gov
>     <ma...@jpl.nasa.gov>> wrote:____
> 
>         Hi Jyoti,____
> 
>          ____
> 
>         Again, please keep dev@nutch.a.o <ma...@nutch.a.o> CC\u2019ed, and also you may consider
>         looking at this page:____
> 
>          ____
> 
>         https://wiki.apache.org/nutch/AdvancedAjaxInteraction
>         <https://wiki.apache.org/nutch/AdvancedAjaxInteraction> ____
> 
>          ____
> 
>         Cheers,____
> 
>         Chris____
> 
>          ____
> 
>          ____
> 
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
>         Chris Mattmann, Ph.D.____
> 
>         Principal Data Scientist, Engineering Administrative Office (3010)____
> 
>         Manager, Open Source Projects Formulation and Development Office (8212)____
> 
>         NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA____
> 
>         Office: 180-503E, Mailstop: 180-503____
> 
>         Email: chris.a.mattmann@nasa.gov <ma...@nasa.gov>____
> 
>         WWW:  http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>____
> 
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
>         Director, Information Retrieval and Data Science Group (IRDS)____
> 
>         Adjunct Associate Professor, Computer Science Department____
> 
>         University of Southern California, Los Angeles, CA 90089 USA____
> 
>         WWW: http://irds.usc.edu/____
> 
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
>          ____
> 
>          ____
> 
>         *From: *jyoti aditya <jyotiaditya007@gmail.com <ma...@gmail.com>>
>         *Date: *Monday, December 5, 2016 at 1:42 AM
>         *To: *Chris Mattmann <mattmann@apache.org <ma...@apache.org>>____
> 
> 
>         *Subject: *Re: Impolite crawling using NUTCH____
> 
>          ____
> 
>         Hi Chris, ____
> 
>          ____
> 
>         Whitelist din't work.____
> 
>         And I was trying to configure selenium with nutch.____
> 
>         But I am not sure that by doing so, what result will come.____
> 
>         And also, it looks very clumsy to configure selenium with firefox. ____
> 
>          ____
> 
>         Regards,____
> 
>         Jyoti Aditya____
> 
>          ____
> 
>         On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <mattmann@apache.org
>         <ma...@apache.org>> wrote:____
> 
>             Hmm, I\u2019m a little confused here. You were first trying to use white list robots.txt, and
>             now
>             you are talking about Selenium.____
> 
>              ____
> 
>             1.       Did the white list work____
> 
>             2.       Are you now asking how to use Nutch and Selenium?____
> 
>              ____
> 
>             Cheers,____
> 
>             Chris____
> 
>              ____
> 
>              ____
> 
>              ____
> 
>             *From: *jyoti aditya <jyotiaditya007@gmail.com <ma...@gmail.com>>
>             *Date: *Thursday, December 1, 2016 at 10:26 PM
>             *To: *"Mattmann, Chris A (3010)" <chris.a.mattmann@jpl.nasa.gov
>             <ma...@jpl.nasa.gov>>
>             *Subject: *Re: Impolite crawling using NUTCH____
> 
>              ____
> 
>             Hi Chris, ____
> 
>              ____
> 
>             Thanks for the response.____
> 
>             I added the changes as you mentioned above.____
> 
>              ____
> 
>             But I am still not able to get all content from a webpage.____
> 
>             Can you please tell me that do I need to add some selenium plugin to crawl ____
> 
>             dynamic content available on web page?____
> 
>              ____
> 
>             I have a concern that this kind of wiki pages are not directly accessible.____
> 
>             There is no way we can reach to these kind of useful pages.____
> 
>             So please do needful regarding this.____
> 
>              ____
> 
>              ____
> 
>             With Regards,____
> 
>             Jyoti Aditya____
> 
>              ____
> 
>             On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) <chris.a.mattmann@jpl.nasa.gov
>             <ma...@jpl.nasa.gov>> wrote:____
> 
>                 There is a robots.txt whitelist. You can find documentation here:
> 
>                 https://wiki.apache.org/nutch/WhiteListRobots
>                 <https://wiki.apache.org/nutch/WhiteListRobots>
> 
>                 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>                 Chris Mattmann, Ph.D.
>                 Principal Data Scientist, Engineering Administrative Office (3010)
>                 Manager, Open Source Projects Formulation and Development Office (8212)
>                 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>                 Office: 180-503E, Mailstop: 180-503
>                 Email: chris.a.mattmann@nasa.gov <ma...@nasa.gov>
>                 WWW:  http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>                 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>                 Director, Information Retrieval and Data Science Group (IRDS)
>                 Adjunct Associate Professor, Computer Science Department
>                 University of Southern California, Los Angeles, CA 90089 USA
>                 WWW: http://irds.usc.edu/
>                 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
> 
> 
>                 On 11/29/16, 8:57 AM, "Tom Chiverton" <tc@extravision.com
>                 <ma...@extravision.com>> wrote:
> 
>                     Sure, you can remove the check from the code and recompile.
> 
>                     Under what circumstances would you need to ignore robots.txt ? Would
>                     something like allowing access by particular IP or user agents be an
>                     alternative ?
> 
>                     Tom
> 
> 
>                     On 29/11/16 04:07, jyoti aditya wrote:
>                     > Hi team,
>                     >
>                     > Can we use NUTCH to do impolite crawling?
>                     > Or is there any way by which we can disobey robots.text?
>                     >
>                     >
>                     > With Regards
>                     > Jyoti Aditya
>                     >
>                     >
>                     > ______________________________________________________________________
>                     > This email has been scanned by the Symantec Email Security.cloud service.
>                     > For more information please visit http://www.symanteccloud.com
>                     > __________________________________________________________________________
> 
> 
> 
>             ____
> 
>              ____
> 
>             -- ____
> 
>             With Regards ____
> 
>             Jyoti Aditya____
> 
> 
> 
>         ____
> 
>          ____
> 
>         -- ____
> 
>         With Regards ____
> 
>         Jyoti Aditya____
> 
> 
> 
>     ____
> 
>     __ __
> 
>     -- ____
> 
>     With Regards ____
> 
>     Jyoti Aditya____
> 
> 
> 
> 
> -- 
> With Regards
> Jyoti Aditya

Re: Impolite crawling using NUTCH

Posted by jyoti aditya <jy...@gmail.com>.

Hi Chris/Team,

I am using Nutch2.3.1 with mongoDB configured.
Site - flipart.com

Even though I have added whitelist property in my nutch-stie.xml,
I am not able to crawl.

Please find attached log.
Please help me to fix this issue.

With Regards,
Jyoti Aditya

On Tue, Dec 6, 2016 at 11:02 AM, Mattmann, Chris A (3010) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi Jyoti,
>
>
>
> I need a lot more detail than “it didn’t work”. What didn’t work about it?
> Do you have a log
> file? What site were you trying to crawl? What command did you use? Where
> is your nutch
> config? Were you running in distributed or local mode?
>
>
>
> Onto Selenium – have you tried it or simply reading the docs, you think
> it’s old? What have
> you done? What have you tried?
>
>
>
> I need a LOT more detail before I (and I’m guessing anyone else on these
> lists) can help.
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Chris Mattmann, Ph.D.
>
> Principal Data Scientist, Engineering Administrative Office (3010)
>
> Manager, Open Source Projects Formulation and Development Office (8212)
>
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
> Office: 180-503E, Mailstop: 180-503
>
> Email: chris.a.mattmann@nasa.gov
>
> WWW:  http://sunset.usc.edu/~mattmann/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Director, Information Retrieval and Data Science Group (IRDS)
>
> Adjunct Associate Professor, Computer Science Department
>
> University of Southern California, Los Angeles, CA 90089 USA
>
> WWW: http://irds.usc.edu/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> *From: *jyoti aditya <jy...@gmail.com>
> *Date: *Monday, December 5, 2016 at 9:29 PM
> *To: *"Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>
> *Cc: *"user@nutch.apache.org" <us...@nutch.apache.org>, "
> dev@nutch.apatche.org" <de...@nutch.apatche.org>
>
> *Subject: *Re: Impolite crawling using NUTCH
>
>
>
> Hi Chris/Team,
>
>
>
> Whitelisting domain name din't work.
>
> And when i was trying to configure selenium. It need one headless browser
> to be integrated with.
>
> Documentation for selenium-protocol plugin looks old. firefox-11 is now
> not supported as headless browser with selenium.
>
> So please help me out in configuring selenium plugin configuration.
>
>
>
> I am yet not sure, after configuring above what result it will fetch me.
>
>
>
> With Regards,
>
> Jyoti Aditya
>
>
>
> On Tue, Dec 6, 2016 at 12:00 AM, Mattmann, Chris A (3010) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
> Hi Jyoti,
>
>
>
> Again, please keep dev@nutch.a.o CC’ed, and also you may consider looking
> at this page:
>
>
>
> https://wiki.apache.org/nutch/AdvancedAjaxInteraction
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Chris Mattmann, Ph.D.
>
> Principal Data Scientist, Engineering Administrative Office (3010)
>
> Manager, Open Source Projects Formulation and Development Office (8212)
>
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
> Office: 180-503E, Mailstop: 180-503
>
> Email: chris.a.mattmann@nasa.gov
>
> WWW:  http://sunset.usc.edu/~mattmann/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Director, Information Retrieval and Data Science Group (IRDS)
>
> Adjunct Associate Professor, Computer Science Department
>
> University of Southern California, Los Angeles, CA 90089 USA
>
> WWW: http://irds.usc.edu/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> *From: *jyoti aditya <jy...@gmail.com>
> *Date: *Monday, December 5, 2016 at 1:42 AM
> *To: *Chris Mattmann <ma...@apache.org>
>
>
> *Subject: *Re: Impolite crawling using NUTCH
>
>
>
> Hi Chris,
>
>
>
> Whitelist din't work.
>
> And I was trying to configure selenium with nutch.
>
> But I am not sure that by doing so, what result will come.
>
> And also, it looks very clumsy to configure selenium with firefox.
>
>
>
> Regards,
>
> Jyoti Aditya
>
>
>
> On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <ma...@apache.org>
> wrote:
>
> Hmm, I’m a little confused here. You were first trying to use white list
> robots.txt, and now
> you are talking about Selenium.
>
>
>
> 1.       Did the white list work
>
> 2.       Are you now asking how to use Nutch and Selenium?
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
>
>
> *From: *jyoti aditya <jy...@gmail.com>
> *Date: *Thursday, December 1, 2016 at 10:26 PM
> *To: *"Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>
> *Subject: *Re: Impolite crawling using NUTCH
>
>
>
> Hi Chris,
>
>
>
> Thanks for the response.
>
> I added the changes as you mentioned above.
>
>
>
> But I am still not able to get all content from a webpage.
>
> Can you please tell me that do I need to add some selenium plugin to crawl
>
> dynamic content available on web page?
>
>
>
> I have a concern that this kind of wiki pages are not directly accessible.
>
> There is no way we can reach to these kind of useful pages.
>
> So please do needful regarding this.
>
>
>
>
>
> With Regards,
>
> Jyoti Aditya
>
>
>
> On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
> There is a robots.txt whitelist. You can find documentation here:
>
> https://wiki.apache.org/nutch/WhiteListRobots
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, Open Source Projects Formulation and Development Office (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-503
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
> On 11/29/16, 8:57 AM, "Tom Chiverton" <tc...@extravision.com> wrote:
>
>     Sure, you can remove the check from the code and recompile.
>
>     Under what circumstances would you need to ignore robots.txt ? Would
>     something like allowing access by particular IP or user agents be an
>     alternative ?
>
>     Tom
>
>
>     On 29/11/16 04:07, jyoti aditya wrote:
>     > Hi team,
>     >
>     > Can we use NUTCH to do impolite crawling?
>     > Or is there any way by which we can disobey robots.text?
>     >
>     >
>     > With Regards
>     > Jyoti Aditya
>     >
>     >
>     > ____________________________________________________________
> __________
>     > This email has been scanned by the Symantec Email Security.cloud
> service.
>     > For more information please visit http://www.symanteccloud.com
>     > ____________________________________________________________
> __________
>
>
>
>
>
> --
>
> With Regards
>
> Jyoti Aditya
>
>
>
>
>
> --
>
> With Regards
>
> Jyoti Aditya
>
>
>
>
>
> --
>
> With Regards
>
> Jyoti Aditya
>



-- 
With Regards
Jyoti Aditya

Re: Impolite crawling using NUTCH

Posted by "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>.

Fixing dev@nutch list address

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 12/5/16, 9:32 PM, "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov> wrote:

    Hi Jyoti,
    
    I need a lot more detail than “it didn’t work”. What didn’t work about it? Do you have a log
    file? What site were you trying to crawl? What command did you use? Where is your nutch
    config? Were you running in distributed or local mode?
    
    Onto Selenium – have you tried it or simply reading the docs, you think it’s old? What have
    you done? What have you tried?
    
    I need a LOT more detail before I (and I’m guessing anyone else on these lists) can help.
    
    Cheers,
    Chris
    
    
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Chris Mattmann, Ph.D.
    Principal Data Scientist, Engineering Administrative Office (3010)
    Manager, Open Source Projects Formulation and Development Office (8212)
    NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-503
    Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS)
    Adjunct Associate Professor, Computer Science Department
    University of Southern California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    
    
    From: jyoti aditya <jy...@gmail.com>
    Date: Monday, December 5, 2016 at 9:29 PM
    To: "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>
    Cc: "user@nutch.apache.org" <us...@nutch.apache.org>, "dev@nutch.apatche.org" <de...@nutch.apatche.org>
    Subject: Re: Impolite crawling using NUTCH
    
    Hi Chris/Team,
    
    Whitelisting domain name din't work.
    And when i was trying to configure selenium. It need one headless browser to be integrated with.
    Documentation for selenium-protocol plugin looks old. firefox-11 is now not supported as headless browser with selenium.
    So please help me out in configuring selenium plugin configuration.
    
    I am yet not sure, after configuring above what result it will fetch me.
    
    With Regards,
    Jyoti Aditya
    
    On Tue, Dec 6, 2016 at 12:00 AM, Mattmann, Chris A (3010) <ch...@jpl.nasa.gov>> wrote:
    Hi Jyoti,
    
    Again, please keep dev@nutch.a.o<ma...@nutch.a.o> CC’ed, and also you may consider looking at this page:
    
    https://wiki.apache.org/nutch/AdvancedAjaxInteraction
    
    Cheers,
    Chris
    
    
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Chris Mattmann, Ph.D.
    Principal Data Scientist, Engineering Administrative Office (3010)
    Manager, Open Source Projects Formulation and Development Office (8212)
    NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-503
    Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS)
    Adjunct Associate Professor, Computer Science Department
    University of Southern California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    
    
    From: jyoti aditya <jy...@gmail.com>>
    Date: Monday, December 5, 2016 at 1:42 AM
    To: Chris Mattmann <ma...@apache.org>>
    
    Subject: Re: Impolite crawling using NUTCH
    
    Hi Chris,
    
    Whitelist din't work.
    And I was trying to configure selenium with nutch.
    But I am not sure that by doing so, what result will come.
    And also, it looks very clumsy to configure selenium with firefox.
    
    Regards,
    Jyoti Aditya
    
    On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <ma...@apache.org>> wrote:
    Hmm, I’m a little confused here. You were first trying to use white list robots.txt, and now
    you are talking about Selenium.
    
    
    1.       Did the white list work
    
    2.       Are you now asking how to use Nutch and Selenium?
    
    Cheers,
    Chris
    
    
    
    From: jyoti aditya <jy...@gmail.com>>
    Date: Thursday, December 1, 2016 at 10:26 PM
    To: "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>>
    Subject: Re: Impolite crawling using NUTCH
    
    Hi Chris,
    
    Thanks for the response.
    I added the changes as you mentioned above.
    
    But I am still not able to get all content from a webpage.
    Can you please tell me that do I need to add some selenium plugin to crawl
    dynamic content available on web page?
    
    I have a concern that this kind of wiki pages are not directly accessible.
    There is no way we can reach to these kind of useful pages.
    So please do needful regarding this.
    
    
    With Regards,
    Jyoti Aditya
    
    On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) <ch...@jpl.nasa.gov>> wrote:
    There is a robots.txt whitelist. You can find documentation here:
    
    https://wiki.apache.org/nutch/WhiteListRobots
    
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Chris Mattmann, Ph.D.
    Principal Data Scientist, Engineering Administrative Office (3010)
    Manager, Open Source Projects Formulation and Development Office (8212)
    NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-503
    Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS)
    Adjunct Associate Professor, Computer Science Department
    University of Southern California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    
    
    On 11/29/16, 8:57 AM, "Tom Chiverton" <tc...@extravision.com>> wrote:
    
        Sure, you can remove the check from the code and recompile.
    
        Under what circumstances would you need to ignore robots.txt ? Would
        something like allowing access by particular IP or user agents be an
        alternative ?
    
        Tom
    
    
        On 29/11/16 04:07, jyoti aditya wrote:
        > Hi team,
        >
        > Can we use NUTCH to do impolite crawling?
        > Or is there any way by which we can disobey robots.text?
        >
        >
        > With Regards
        > Jyoti Aditya
        >
        >
        > ______________________________________________________________________
        > This email has been scanned by the Symantec Email Security.cloud service.
        > For more information please visit http://www.symanteccloud.com
        > ______________________________________________________________________
    
    
    
    --
    With Regards
    Jyoti Aditya
    
    
    
    --
    With Regards
    Jyoti Aditya
    
    
    
    --
    With Regards
    Jyoti Aditya

Re: Impolite crawling using NUTCH

Posted by jyoti aditya <jy...@gmail.com>.

Hi Chris/Team,

I am using Nutch2.3.1 with mongoDB configured.
Site - flipart.com

Even though I have added whitelist property in my nutch-stie.xml,
I am not able to crawl.

Please find attached log.
Please help me to fix this issue.

With Regards,
Jyoti Aditya

On Tue, Dec 6, 2016 at 11:02 AM, Mattmann, Chris A (3010) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi Jyoti,
>
>
>
> I need a lot more detail than “it didn’t work”. What didn’t work about it?
> Do you have a log
> file? What site were you trying to crawl? What command did you use? Where
> is your nutch
> config? Were you running in distributed or local mode?
>
>
>
> Onto Selenium – have you tried it or simply reading the docs, you think
> it’s old? What have
> you done? What have you tried?
>
>
>
> I need a LOT more detail before I (and I’m guessing anyone else on these
> lists) can help.
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Chris Mattmann, Ph.D.
>
> Principal Data Scientist, Engineering Administrative Office (3010)
>
> Manager, Open Source Projects Formulation and Development Office (8212)
>
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
> Office: 180-503E, Mailstop: 180-503
>
> Email: chris.a.mattmann@nasa.gov
>
> WWW:  http://sunset.usc.edu/~mattmann/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Director, Information Retrieval and Data Science Group (IRDS)
>
> Adjunct Associate Professor, Computer Science Department
>
> University of Southern California, Los Angeles, CA 90089 USA
>
> WWW: http://irds.usc.edu/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> *From: *jyoti aditya <jy...@gmail.com>
> *Date: *Monday, December 5, 2016 at 9:29 PM
> *To: *"Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>
> *Cc: *"user@nutch.apache.org" <us...@nutch.apache.org>, "
> dev@nutch.apatche.org" <de...@nutch.apatche.org>
>
> *Subject: *Re: Impolite crawling using NUTCH
>
>
>
> Hi Chris/Team,
>
>
>
> Whitelisting domain name din't work.
>
> And when i was trying to configure selenium. It need one headless browser
> to be integrated with.
>
> Documentation for selenium-protocol plugin looks old. firefox-11 is now
> not supported as headless browser with selenium.
>
> So please help me out in configuring selenium plugin configuration.
>
>
>
> I am yet not sure, after configuring above what result it will fetch me.
>
>
>
> With Regards,
>
> Jyoti Aditya
>
>
>
> On Tue, Dec 6, 2016 at 12:00 AM, Mattmann, Chris A (3010) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
> Hi Jyoti,
>
>
>
> Again, please keep dev@nutch.a.o CC’ed, and also you may consider looking
> at this page:
>
>
>
> https://wiki.apache.org/nutch/AdvancedAjaxInteraction
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Chris Mattmann, Ph.D.
>
> Principal Data Scientist, Engineering Administrative Office (3010)
>
> Manager, Open Source Projects Formulation and Development Office (8212)
>
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
> Office: 180-503E, Mailstop: 180-503
>
> Email: chris.a.mattmann@nasa.gov
>
> WWW:  http://sunset.usc.edu/~mattmann/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Director, Information Retrieval and Data Science Group (IRDS)
>
> Adjunct Associate Professor, Computer Science Department
>
> University of Southern California, Los Angeles, CA 90089 USA
>
> WWW: http://irds.usc.edu/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> *From: *jyoti aditya <jy...@gmail.com>
> *Date: *Monday, December 5, 2016 at 1:42 AM
> *To: *Chris Mattmann <ma...@apache.org>
>
>
> *Subject: *Re: Impolite crawling using NUTCH
>
>
>
> Hi Chris,
>
>
>
> Whitelist din't work.
>
> And I was trying to configure selenium with nutch.
>
> But I am not sure that by doing so, what result will come.
>
> And also, it looks very clumsy to configure selenium with firefox.
>
>
>
> Regards,
>
> Jyoti Aditya
>
>
>
> On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <ma...@apache.org>
> wrote:
>
> Hmm, I’m a little confused here. You were first trying to use white list
> robots.txt, and now
> you are talking about Selenium.
>
>
>
> 1.       Did the white list work
>
> 2.       Are you now asking how to use Nutch and Selenium?
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
>
>
> *From: *jyoti aditya <jy...@gmail.com>
> *Date: *Thursday, December 1, 2016 at 10:26 PM
> *To: *"Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>
> *Subject: *Re: Impolite crawling using NUTCH
>
>
>
> Hi Chris,
>
>
>
> Thanks for the response.
>
> I added the changes as you mentioned above.
>
>
>
> But I am still not able to get all content from a webpage.
>
> Can you please tell me that do I need to add some selenium plugin to crawl
>
> dynamic content available on web page?
>
>
>
> I have a concern that this kind of wiki pages are not directly accessible.
>
> There is no way we can reach to these kind of useful pages.
>
> So please do needful regarding this.
>
>
>
>
>
> With Regards,
>
> Jyoti Aditya
>
>
>
> On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
> There is a robots.txt whitelist. You can find documentation here:
>
> https://wiki.apache.org/nutch/WhiteListRobots
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, Open Source Projects Formulation and Development Office (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-503
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
> On 11/29/16, 8:57 AM, "Tom Chiverton" <tc...@extravision.com> wrote:
>
>     Sure, you can remove the check from the code and recompile.
>
>     Under what circumstances would you need to ignore robots.txt ? Would
>     something like allowing access by particular IP or user agents be an
>     alternative ?
>
>     Tom
>
>
>     On 29/11/16 04:07, jyoti aditya wrote:
>     > Hi team,
>     >
>     > Can we use NUTCH to do impolite crawling?
>     > Or is there any way by which we can disobey robots.text?
>     >
>     >
>     > With Regards
>     > Jyoti Aditya
>     >
>     >
>     > ____________________________________________________________
> __________
>     > This email has been scanned by the Symantec Email Security.cloud
> service.
>     > For more information please visit http://www.symanteccloud.com
>     > ____________________________________________________________
> __________
>
>
>
>
>
> --
>
> With Regards
>
> Jyoti Aditya
>
>
>
>
>
> --
>
> With Regards
>
> Jyoti Aditya
>
>
>
>
>
> --
>
> With Regards
>
> Jyoti Aditya
>



-- 
With Regards
Jyoti Aditya

Re: Impolite crawling using NUTCH

Posted by "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>.

Hi Jyoti,

I need a lot more detail than “it didn’t work”. What didn’t work about it? Do you have a log
file? What site were you trying to crawl? What command did you use? Where is your nutch
config? Were you running in distributed or local mode?

Onto Selenium – have you tried it or simply reading the docs, you think it’s old? What have
you done? What have you tried?

I need a LOT more detail before I (and I’m guessing anyone else on these lists) can help.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

From: jyoti aditya <jy...@gmail.com>
Date: Monday, December 5, 2016 at 9:29 PM
To: "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>
Cc: "user@nutch.apache.org" <us...@nutch.apache.org>, "dev@nutch.apatche.org" <de...@nutch.apatche.org>
Subject: Re: Impolite crawling using NUTCH

Hi Chris/Team,

Whitelisting domain name din't work.
And when i was trying to configure selenium. It need one headless browser to be integrated with.
Documentation for selenium-protocol plugin looks old. firefox-11 is now not supported as headless browser with selenium.
So please help me out in configuring selenium plugin configuration.

I am yet not sure, after configuring above what result it will fetch me.

With Regards,
Jyoti Aditya

On Tue, Dec 6, 2016 at 12:00 AM, Mattmann, Chris A (3010) <ch...@jpl.nasa.gov>> wrote:
Hi Jyoti,

Again, please keep dev@nutch.a.o<ma...@nutch.a.o> CC’ed, and also you may consider looking at this page:

https://wiki.apache.org/nutch/AdvancedAjaxInteraction

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

From: jyoti aditya <jy...@gmail.com>>
Date: Monday, December 5, 2016 at 1:42 AM
To: Chris Mattmann <ma...@apache.org>>

Subject: Re: Impolite crawling using NUTCH

Hi Chris,

Whitelist din't work.
And I was trying to configure selenium with nutch.
But I am not sure that by doing so, what result will come.
And also, it looks very clumsy to configure selenium with firefox.

Regards,
Jyoti Aditya

On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <ma...@apache.org>> wrote:
Hmm, I’m a little confused here. You were first trying to use white list robots.txt, and now
you are talking about Selenium.

1.       Did the white list work

2.       Are you now asking how to use Nutch and Selenium?

Cheers,
Chris

From: jyoti aditya <jy...@gmail.com>>
Date: Thursday, December 1, 2016 at 10:26 PM
To: "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>>
Subject: Re: Impolite crawling using NUTCH

Hi Chris,

Thanks for the response.
I added the changes as you mentioned above.

But I am still not able to get all content from a webpage.
Can you please tell me that do I need to add some selenium plugin to crawl
dynamic content available on web page?

I have a concern that this kind of wiki pages are not directly accessible.
There is no way we can reach to these kind of useful pages.
So please do needful regarding this.

With Regards,
Jyoti Aditya

On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) <ch...@jpl.nasa.gov>> wrote:
There is a robots.txt whitelist. You can find documentation here:

https://wiki.apache.org/nutch/WhiteListRobots

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

On 11/29/16, 8:57 AM, "Tom Chiverton" <tc...@extravision.com>> wrote:

    Sure, you can remove the check from the code and recompile.

    Under what circumstances would you need to ignore robots.txt ? Would
    something like allowing access by particular IP or user agents be an
    alternative ?

    Tom

    On 29/11/16 04:07, jyoti aditya wrote:
    > Hi team,
    >
    > Can we use NUTCH to do impolite crawling?
    > Or is there any way by which we can disobey robots.text?
    >
    >
    > With Regards
    > Jyoti Aditya
    >
    >
    > ______________________________________________________________________
    > This email has been scanned by the Symantec Email Security.cloud service.
    > For more information please visit http://www.symanteccloud.com
    > ______________________________________________________________________

--
With Regards
Jyoti Aditya

--
With Regards
Jyoti Aditya

--
With Regards
Jyoti Aditya

Re: Impolite crawling using NUTCH

Posted by jyoti aditya <jy...@gmail.com>.

Hi Chris/Team,

Whitelisting domain name din't work.
And when i was trying to configure selenium. It need one headless browser
to be integrated with.
Documentation for selenium-protocol plugin looks old. firefox-11 is now not
supported as headless browser with selenium.
So please help me out in configuring selenium plugin configuration.

I am yet not sure, after configuring above what result it will fetch me.

With Regards,
Jyoti Aditya

On Tue, Dec 6, 2016 at 12:00 AM, Mattmann, Chris A (3010) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi Jyoti,
>
>
>
> Again, please keep dev@nutch.a.o CC’ed, and also you may consider looking
> at this page:
>
>
>
> https://wiki.apache.org/nutch/AdvancedAjaxInteraction
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Chris Mattmann, Ph.D.
>
> Principal Data Scientist, Engineering Administrative Office (3010)
>
> Manager, Open Source Projects Formulation and Development Office (8212)
>
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
> Office: 180-503E, Mailstop: 180-503
>
> Email: chris.a.mattmann@nasa.gov
>
> WWW:  http://sunset.usc.edu/~mattmann/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Director, Information Retrieval and Data Science Group (IRDS)
>
> Adjunct Associate Professor, Computer Science Department
>
> University of Southern California, Los Angeles, CA 90089 USA
>
> WWW: http://irds.usc.edu/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> *From: *jyoti aditya <jy...@gmail.com>
> *Date: *Monday, December 5, 2016 at 1:42 AM
> *To: *Chris Mattmann <ma...@apache.org>
>
> *Subject: *Re: Impolite crawling using NUTCH
>
>
>
> Hi Chris,
>
>
>
> Whitelist din't work.
>
> And I was trying to configure selenium with nutch.
>
> But I am not sure that by doing so, what result will come.
>
> And also, it looks very clumsy to configure selenium with firefox.
>
>
>
> Regards,
>
> Jyoti Aditya
>
>
>
> On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <ma...@apache.org>
> wrote:
>
> Hmm, I’m a little confused here. You were first trying to use white list
> robots.txt, and now
> you are talking about Selenium.
>
>
>
> 1.       Did the white list work
>
> 2.       Are you now asking how to use Nutch and Selenium?
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
>
>
> *From: *jyoti aditya <jy...@gmail.com>
> *Date: *Thursday, December 1, 2016 at 10:26 PM
> *To: *"Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>
> *Subject: *Re: Impolite crawling using NUTCH
>
>
>
> Hi Chris,
>
>
>
> Thanks for the response.
>
> I added the changes as you mentioned above.
>
>
>
> But I am still not able to get all content from a webpage.
>
> Can you please tell me that do I need to add some selenium plugin to crawl
>
> dynamic content available on web page?
>
>
>
> I have a concern that this kind of wiki pages are not directly accessible.
>
> There is no way we can reach to these kind of useful pages.
>
> So please do needful regarding this.
>
>
>
>
>
> With Regards,
>
> Jyoti Aditya
>
>
>
> On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
> There is a robots.txt whitelist. You can find documentation here:
>
> https://wiki.apache.org/nutch/WhiteListRobots
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, Open Source Projects Formulation and Development Office (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-503
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
> On 11/29/16, 8:57 AM, "Tom Chiverton" <tc...@extravision.com> wrote:
>
>     Sure, you can remove the check from the code and recompile.
>
>     Under what circumstances would you need to ignore robots.txt ? Would
>     something like allowing access by particular IP or user agents be an
>     alternative ?
>
>     Tom
>
>
>     On 29/11/16 04:07, jyoti aditya wrote:
>     > Hi team,
>     >
>     > Can we use NUTCH to do impolite crawling?
>     > Or is there any way by which we can disobey robots.text?
>     >
>     >
>     > With Regards
>     > Jyoti Aditya
>     >
>     >
>     > ____________________________________________________________
> __________
>     > This email has been scanned by the Symantec Email Security.cloud
> service.
>     > For more information please visit http://www.symanteccloud.com
>     > ____________________________________________________________
> __________
>
>
>
>
>
> --
>
> With Regards
>
> Jyoti Aditya
>
>
>
>
>
> --
>
> With Regards
>
> Jyoti Aditya
>



-- 
With Regards
Jyoti Aditya

Re: Impolite crawling using NUTCH

Posted by "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>.

Hi Jyoti,

Again, please keep dev@nutch.a.o<ma...@nutch.a.o> CC’ed, and also you may consider looking at this page:

https://wiki.apache.org/nutch/AdvancedAjaxInteraction

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

From: jyoti aditya <jy...@gmail.com>
Date: Monday, December 5, 2016 at 1:42 AM
To: Chris Mattmann <ma...@apache.org>
Subject: Re: Impolite crawling using NUTCH

Hi Chris,

Whitelist din't work.
And I was trying to configure selenium with nutch.
But I am not sure that by doing so, what result will come.
And also, it looks very clumsy to configure selenium with firefox.

Regards,
Jyoti Aditya

On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <ma...@apache.org>> wrote:
Hmm, I’m a little confused here. You were first trying to use white list robots.txt, and now
you are talking about Selenium.

1.       Did the white list work

2.       Are you now asking how to use Nutch and Selenium?

Cheers,
Chris

From: jyoti aditya <jy...@gmail.com>>
Date: Thursday, December 1, 2016 at 10:26 PM
To: "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>>
Subject: Re: Impolite crawling using NUTCH

Hi Chris,

Thanks for the response.
I added the changes as you mentioned above.

But I am still not able to get all content from a webpage.
Can you please tell me that do I need to add some selenium plugin to crawl
dynamic content available on web page?

I have a concern that this kind of wiki pages are not directly accessible.
There is no way we can reach to these kind of useful pages.
So please do needful regarding this.

With Regards,
Jyoti Aditya

On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) <ch...@jpl.nasa.gov>> wrote:
There is a robots.txt whitelist. You can find documentation here:

https://wiki.apache.org/nutch/WhiteListRobots

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

On 11/29/16, 8:57 AM, "Tom Chiverton" <tc...@extravision.com>> wrote:

    Sure, you can remove the check from the code and recompile.

    Under what circumstances would you need to ignore robots.txt ? Would
    something like allowing access by particular IP or user agents be an
    alternative ?

    Tom

    On 29/11/16 04:07, jyoti aditya wrote:
    > Hi team,
    >
    > Can we use NUTCH to do impolite crawling?
    > Or is there any way by which we can disobey robots.text?
    >
    >
    > With Regards
    > Jyoti Aditya
    >
    >
    > ______________________________________________________________________
    > This email has been scanned by the Symantec Email Security.cloud service.
    > For more information please visit http://www.symanteccloud.com
    > ______________________________________________________________________

--
With Regards
Jyoti Aditya

--
With Regards
Jyoti Aditya

Re: Impolite crawling using NUTCH

Posted by Chris Mattmann <ma...@apache.org>.

Hmm, I’m a little confused here. You were first trying to use white list robots.txt, and now 
you are talking about Selenium.

 

1.       Did the white list work

2.       Are you now asking how to use Nutch and Selenium?

 

Cheers,

Chris

 

 

 

From: jyoti aditya <jy...@gmail.com>
Date: Thursday, December 1, 2016 at 10:26 PM
To: "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>
Subject: Re: Impolite crawling using NUTCH

 

Hi Chris, 

 

Thanks for the response.

I added the changes as you mentioned above.

 

But I am still not able to get all content from a webpage.

Can you please tell me that do I need to add some selenium plugin to crawl 

dynamic content available on web page?

 

I have a concern that this kind of wiki pages are not directly accessible.

There is no way we can reach to these kind of useful pages.

So please do needful regarding this.

 

 

With Regards,

Jyoti Aditya

 

On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) <ch...@jpl.nasa.gov> wrote:

There is a robots.txt whitelist. You can find documentation here:

https://wiki.apache.org/nutch/WhiteListRobots

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



On 11/29/16, 8:57 AM, "Tom Chiverton" <tc...@extravision.com> wrote:

    Sure, you can remove the check from the code and recompile.

    Under what circumstances would you need to ignore robots.txt ? Would
    something like allowing access by particular IP or user agents be an
    alternative ?

    Tom


    On 29/11/16 04:07, jyoti aditya wrote:
    > Hi team,
    >
    > Can we use NUTCH to do impolite crawling?
    > Or is there any way by which we can disobey robots.text?
    >
    >
    > With Regards
    > Jyoti Aditya
    >
    >
    > ______________________________________________________________________
    > This email has been scanned by the Symantec Email Security.cloud service.
    > For more information please visit http://www.symanteccloud.com
    > ______________________________________________________________________





 

-- 

With Regards 

Jyoti Aditya

Re: Impolite crawling using NUTCH

Posted by "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>.

There is a robots.txt whitelist. You can find documentation here:

https://wiki.apache.org/nutch/WhiteListRobots 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 11/29/16, 8:57 AM, "Tom Chiverton" <tc...@extravision.com> wrote:

    Sure, you can remove the check from the code and recompile.
    
    Under what circumstances would you need to ignore robots.txt ? Would 
    something like allowing access by particular IP or user agents be an 
    alternative ?
    
    Tom
    
    
    On 29/11/16 04:07, jyoti aditya wrote:
    > Hi team,
    >
    > Can we use NUTCH to do impolite crawling?
    > Or is there any way by which we can disobey robots.text?
    >
    >
    > With Regards
    > Jyoti Aditya
    >
    >
    > ______________________________________________________________________
    > This email has been scanned by the Symantec Email Security.cloud service.
    > For more information please visit http://www.symanteccloud.com
    > ______________________________________________________________________

Re: Impolite crawling using NUTCH

Posted by Tom Chiverton <tc...@extravision.com>.

Sure, you can remove the check from the code and recompile.

Under what circumstances would you need to ignore robots.txt ? Would 
something like allowing access by particular IP or user agents be an 
alternative ?

Tom

On 29/11/16 04:07, jyoti aditya wrote:
> Hi team,
>
> Can we use NUTCH to do impolite crawling?
> Or is there any way by which we can disobey robots.text?
>
>
> With Regards
> Jyoti Aditya
>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________