You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Shuo Li <sl...@usc.edu> on 2015/02/13 19:12:21 UTC

Vagrant Crushed When using Nutch-Selenium

Hey guys,

I'm trying to use Nutch-Selenium to crawl nutch.apache.org. However, my
vagrant seems crushed after a few minutes. I forced it to shut down and it
turns out it only crawled 59 websites. My nutch version is 1.10 and my OS
is Ubuntu Trusty, 14.04.

Is there anything I can provide to you guys? Or is there anybody have the
same issue? Or 59 websites is the complete crawling?

Any suggestion would be appreciated.

Regards,
Shuo Li

Re: Vagrant Crushed When using Nutch-Selenium

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Going to implement more configuration in the plugin, but
based on the student emails I think your advice helped :)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Mo Omer <be...@gmail.com>
Date: Sunday, February 22, 2015 at 5:45 PM
To: Chris Mattmann <Ch...@jpl.nasa.gov>
Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: Vagrant Crushed When using Nutch-Selenium

>No problem! How'd it work out?
>
>Mo
>
>This message was drafted on a tiny touch screen; please forgive brevity &
>tpyos
>
>> On Feb 22, 2015, at 6:19 PM, "Mattmann, Chris A (3980)"
>><ch...@jpl.nasa.gov> wrote:
>> 
>> Thanks Mo, great advice.
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Jiaxin Ye <ji...@usc.edu>
>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> Date: Tuesday, February 17, 2015 at 2:49 PM
>> To: Mohammed Omer <be...@gmail.com>
>> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>> 
>>> 
>>> 
>>> 
>>> Thank you so much!! I am going to try it out tonight.
>>> 
>>> On Tuesday, February 17, 2015, Mohammed Omer <be...@gmail.com>
>>> wrote:
>>> 
>>> Jiaxin, 
>>> 
>>> 
>>> Each page takes about 3 seconds to crawl due to this piece of code - we
>>> allow selenium 3 seconds to grab the page [0]. Due to what I was
>>> crawling, I didn't want to wait for a specific element/class/id to show
>>> up. However, you can change it up if you want.
>>> Selenium documentation [1] has more info on Ex/Implicit waiting.
>>> 
>>> 
>>> Again, it's not the most efficient way to crawl; but, if you need JS to
>>> render, it's a backwards way that ensures it happens. Selenium Grid has
>>> the benefit of being able to handle more throughput, but at the end of
>>> the day we're waiting for a browser to
>>> go out and fetch the url.
>>> 
>>> 
>>> I've suggested that most items be configurable when merged into trunk
>>> [2], but I'll make a specific call-out to the wait time.
>>> 
>>> 
>>> Due to the way Selenium standalone works, it's wayyyyyy less efficient
>>> than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that
>>> set-up. 
>>> 
>>> 
>>> Wish I could help out more, but 30 threads might be too much. 5
>>>threads,
>>> at a total fetch/parse time of 4 seconds per url, would still
>>> theoretically churn out > 100k urls per day. There are multiple tweaks
>>> that could be made to optimize for your system,
>>> I'd start with reducing thread count, as you might be saturating your
>>> system [4].
>>> 
>>> 
>>> Sorry I can't be of more help!
>>> 
>>> 
>>> Thank you,
>>> 
>>> 
>>> Mo
>>> 
>>> 
>>> [0]: 
>>> 
>>>https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/jav
>>>a/
>>> org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49
>>> 
>>><https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/ja
>>>va
>>> /org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49>
>>> [1]: 
>>> http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp
>>> <http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp>
>>> [2]: https://issues.apache.org/jira/browse/NUTCH-1933
>>> [3]: https://code.google.com/p/selenium/wiki/Grid2
>>> [4]: http://stackoverflow.com/a/4895271
>>> 
>>> 
>>> On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye
>>> <jiaxinye@usc.edu <javascript:_e(%7B%7D,'cvml','jiaxinye@usc.edu');>>
>>> wrote:
>>> 
>>> I am using fetcher.threads.per.queue = 30 by the way.
>>> 
>>> On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye
>>> <jiaxinye@usc.edu <javascript:_e(%7B%7D,'cvml','jiaxinye@usc.edu');>>
>>> wrote:
>>> 
>>> Hi Mo,
>>> 
>>> 
>>> I have a problem about the selenium plugin on mac. I think I
>>>successfully
>>> set it up on mac but I have a question about the performance.
>>> I am using a Mac with Intel Core i5 processor and 8GB ram, but I found
>>> that each url fetched takes about 1 seconds to open and close
>>> the firefox window. Is it a normal speed? or anything is wrong? And is
>>>it
>>> possible to install selenium grid plugin on Mac? I will cry if you
>>> ask me to change machine now......
>>> 
>>> 
>>> Best,
>>> Jiaxin
>>> 
>>> 
>>> On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer
>>> <beancinematics@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','beancinematics@gmail.com');>> wrote:
>>> 
>>> No worries man, glad everything works! Glad, since I was having
>>>hostname
>>> issues with nutch/hbase just now as I quickly tried to get it
>>> working/fixed for ya, ha.
>>> 
>>> Mo
>>> 
>>> 
>>> On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li
>>> <sli491@usc.edu <javascript:_e(%7B%7D,'cvml','sli491@usc.edu');>>
>>>wrote:
>>> 
>>> Hey guys,
>>> 
>>> 
>>> After change my RAM to 2GB, everything works fine. My bad. Thanks for
>>> your help.
>>> 
>>> 
>>> Regards,
>>> Shuo Li
>>> 
>>> 
>>> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980)
>>> <chris.a.mattmann@jpl.nasa.gov
>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@jpl.nasa.gov');>> wrote:
>>> 
>>> Thank you Mo. I sincerely appreciate your guidance and contribution.
>>> 
>>> I will work to get your nutch selenium grid plugin contributed
>>> to work with Nutch 1.x.
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: 
>>> chris.a.mattmann@nasa.gov
>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@nasa.gov');>
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Mo Omer <beancinematics@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','beancinematics@gmail.com');>>
>>> Date: Friday, February 13, 2015 at 11:10 AM
>>> To: Chris Mattmann <Chris.A.Mattmann@jpl.nasa.gov
>>> <javascript:_e(%7B%7D,'cvml','Chris.A.Mattmann@jpl.nasa.gov');>>
>>> Cc: "dev@nutch.apache.org
>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>>> <dev@nutch.apache.org
>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>> 
>>>> Hey all,
>>>> 
>>>> When I had run nutch-selenium, it was in a config such that zombies
>>>>were
>>>> created from closing Firefox windows and they couldn't be reaped
>>>>(again,
>>>> due to the docker configuration I had).
>>>> 
>>>> In a normal setup, it should not be an issue - if you're running 20
>>>> threads in nutch that's potentially 20 open FF windows which isn't
>>>>good
>>>> for 512mb.
>>>> 
>>>> Selenium grid is much more efficient, in that browsers are opened, but
>>>> tabs are used to fetch sites - and only those are closed.
>>>> 
>>>> Additionally, ensure you're using Nutch 2.2.1.
>>>> 
>>>> Feel free to fork patch and tinker and PR as needed.
>>>> 
>>>> Chris, if you want to be added to contribs on the GitHub project,
>>>>that's
>>>> cool with me! Wish I could dedicate more time to this, but I don't
>>>> foresee using Nutch again in the near future, and am now working on
>>>> projects that require lots of reading and possibly patches to Caffe
>>>>and
>>>> opencl r-CNN projects.
>>>> 
>>>> Tl;dr:
>>>> - no, this shouldn't be typical unless you're creating zombies like
>>>>crazy
>>>> and they're not being reaped (too many open file descriptors), running
>>>> out of memory, or similar resource constraint.
>>>> - selenium grid is TONs more efficient, but a bit more difficult to
>>>>set
>>>> up. I used it to crawl 100ks of sites.
>>>> - unfortunately I can't commit more time to this, but if I can assist
>>>>in
>>>> any admin way, let me know.
>>>> 
>>>> Thank you,
>>>> 
>>>> Mo
>>>> 
>>>> This message was drafted on a tiny touch screen; please forgive
>>>>brevity &
>>>> tpyos
>>>> 
>>>>> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
>>>>> <chris.a.mattmann@jpl.nasa.gov
>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@jpl.nasa.gov');>>
>>>>>wrote:
>>>>> 
>>>>> Oh yes, please up your memory to like at least 2Gb..
>>>>> 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Chief Architect
>>>>> Instrument Software and Science Data Systems Section (398)
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 168-519, Mailstop: 168-527
>>>>> Email:
>>> chris.a.mattmann@nasa.gov
>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@nasa.gov');>
>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Associate Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Shuo Li <sli491@usc.edu
>>>>> <javascript:_e(%7B%7D,'cvml','sli491@usc.edu');>>
>>>>> Reply-To: "dev@nutch.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>>>>> <dev@nutch.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>> Date: Friday, February 13, 2015 at 10:38 AM
>>>>> To: "dev@nutch.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>>>>> <dev@nutch.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>> Cc: Mo Omer <beancinematics@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','beancinematics@gmail.com');>>
>>>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>>>> 
>>>>>> Hey Mo and Prof Mattmann,
>>>>>> 
>>>>>> 
>>>>>> I will try to crawl the 3 websites in the homework tonight (NASA
>>>>>>AMD,
>>>>>> NSF
>>>>>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
>>>>>> going
>>>>>> on.
>>>>>> 
>>>>>> 
>>>>>> Is memory an issue? My vagrant only has 512MB of memory.
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Shuo Li
>>>>>> 
>>>>>> 
>>>>>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>>>>>> <chris.a.mattmann@jpl.nasa.gov
>>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@jpl.nasa.gov');>>
>>>>>>wrote:
>>>>>> 
>>>>>> Hi Shuo,
>>>>>> 
>>>>>> Thanks for your email. I wonder if using selenium grid would
>>>>>> help?
>>>>>> 
>>>>>> Please see this plugin:
>>> https://github.com/momer/nutch-selenium-grid-plugin
>>> <https://github.com/momer/nutch-selenium-grid-plugin>
>>>>>> 
>>>>>> 
>>>>>> I’m CC’ing Mo the author of the plugin to see if he experienced
>>>>>> this while running the original selenium plugin - Mo did using
>>>>>> selenium grid help the issue that Shuo is experiencing below?
>>>>>> 
>>>>>> Mo: are you cool with portion the grid plugin, or if Lewis or
>>>>>> I do it to trunk (with full credit to you of course?)
>>>>>> 
>>>>>> Cheers,
>>>>>> Chris
>>>>>> 
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398)
>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email:
>>> chris.a.mattmann@nasa.gov
>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@nasa.gov');>
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Shuo Li <sli491@usc.edu
>>>>>> <javascript:_e(%7B%7D,'cvml','sli491@usc.edu');>>
>>>>>> Reply-To: "dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>>>>>> <dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>>> Date: Friday, February 13, 2015 at 10:12 AM
>>>>>> To: "dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>>>>>> <dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>>> Subject: Vagrant Crushed When using Nutch-Selenium
>>>>>> 
>>>>>>> Hey guys,
>>>>>>> 
>>>>>>> 
>>>>>>> I'm trying to use Nutch-Selenium to crawl
>>>>>>> nutch.apache.org <http://nutch.apache.org>
>>>>>>><http://nutch.apache.org>
>>>>>>> <http://nutch.apache.org>.
>>>>>>> However, my vagrant seems
>>>>>>> crushed after a few minutes. I forced it to shut down and it turns
>>>>>>> out it
>>>>>>> only crawled 59 websites. My nutch version is 1.10 and my OS is
>>>>>>> Ubuntu
>>>>>>> Trusty, 14.04.
>>>>>>> 
>>>>>>> 
>>>>>>> Is there anything I can provide to you guys? Or is there anybody
>>>>>>>have
>>>>>>> the
>>>>>>> same issue? Or 59 websites is the complete crawling?
>>>>>>> 
>>>>>>> 
>>>>>>> Any suggestion would be appreciated.
>>>>>>> 
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Shuo Li
>> 


Re: Vagrant Crushed When using Nutch-Selenium

Posted by Mo Omer <be...@gmail.com>.
No problem! How'd it work out?

Mo

This message was drafted on a tiny touch screen; please forgive brevity & tpyos

> On Feb 22, 2015, at 6:19 PM, "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> wrote:
> 
> Thanks Mo, great advice.
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Jiaxin Ye <ji...@usc.edu>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Tuesday, February 17, 2015 at 2:49 PM
> To: Mohammed Omer <be...@gmail.com>
> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Subject: Re: Vagrant Crushed When using Nutch-Selenium
> 
>> 
>> 
>> 
>> Thank you so much!! I am going to try it out tonight.
>> 
>> On Tuesday, February 17, 2015, Mohammed Omer <be...@gmail.com>
>> wrote:
>> 
>> Jiaxin, 
>> 
>> 
>> Each page takes about 3 seconds to crawl due to this piece of code - we
>> allow selenium 3 seconds to grab the page [0]. Due to what I was
>> crawling, I didn't want to wait for a specific element/class/id to show
>> up. However, you can change it up if you want.
>> Selenium documentation [1] has more info on Ex/Implicit waiting.
>> 
>> 
>> Again, it's not the most efficient way to crawl; but, if you need JS to
>> render, it's a backwards way that ensures it happens. Selenium Grid has
>> the benefit of being able to handle more throughput, but at the end of
>> the day we're waiting for a browser to
>> go out and fetch the url.
>> 
>> 
>> I've suggested that most items be configurable when merged into trunk
>> [2], but I'll make a specific call-out to the wait time.
>> 
>> 
>> Due to the way Selenium standalone works, it's wayyyyyy less efficient
>> than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that
>> set-up. 
>> 
>> 
>> Wish I could help out more, but 30 threads might be too much. 5 threads,
>> at a total fetch/parse time of 4 seconds per url, would still
>> theoretically churn out > 100k urls per day. There are multiple tweaks
>> that could be made to optimize for your system,
>> I'd start with reducing thread count, as you might be saturating your
>> system [4].
>> 
>> 
>> Sorry I can't be of more help!
>> 
>> 
>> Thank you,
>> 
>> 
>> Mo
>> 
>> 
>> [0]: 
>> https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java/
>> org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49
>> <https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java
>> /org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49>
>> [1]: 
>> http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp
>> <http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp>
>> [2]: https://issues.apache.org/jira/browse/NUTCH-1933
>> [3]: https://code.google.com/p/selenium/wiki/Grid2
>> [4]: http://stackoverflow.com/a/4895271
>> 
>> 
>> On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye
>> <jiaxinye@usc.edu <javascript:_e(%7B%7D,'cvml','jiaxinye@usc.edu');>>
>> wrote:
>> 
>> I am using fetcher.threads.per.queue = 30 by the way.
>> 
>> On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye
>> <jiaxinye@usc.edu <javascript:_e(%7B%7D,'cvml','jiaxinye@usc.edu');>>
>> wrote:
>> 
>> Hi Mo,
>> 
>> 
>> I have a problem about the selenium plugin on mac. I think I successfully
>> set it up on mac but I have a question about the performance.
>> I am using a Mac with Intel Core i5 processor and 8GB ram, but I found
>> that each url fetched takes about 1 seconds to open and close
>> the firefox window. Is it a normal speed? or anything is wrong? And is it
>> possible to install selenium grid plugin on Mac? I will cry if you
>> ask me to change machine now......
>> 
>> 
>> Best,
>> Jiaxin
>> 
>> 
>> On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer
>> <beancinematics@gmail.com
>> <javascript:_e(%7B%7D,'cvml','beancinematics@gmail.com');>> wrote:
>> 
>> No worries man, glad everything works! Glad, since I was having hostname
>> issues with nutch/hbase just now as I quickly tried to get it
>> working/fixed for ya, ha.
>> 
>> Mo
>> 
>> 
>> On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li
>> <sli491@usc.edu <javascript:_e(%7B%7D,'cvml','sli491@usc.edu');>> wrote:
>> 
>> Hey guys,
>> 
>> 
>> After change my RAM to 2GB, everything works fine. My bad. Thanks for
>> your help.
>> 
>> 
>> Regards,
>> Shuo Li
>> 
>> 
>> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980)
>> <chris.a.mattmann@jpl.nasa.gov
>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@jpl.nasa.gov');>> wrote:
>> 
>> Thank you Mo. I sincerely appreciate your guidance and contribution.
>> 
>> I will work to get your nutch selenium grid plugin contributed
>> to work with Nutch 1.x.
>> 
>> Cheers,
>> Chris
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: 
>> chris.a.mattmann@nasa.gov
>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@nasa.gov');>
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Mo Omer <beancinematics@gmail.com
>> <javascript:_e(%7B%7D,'cvml','beancinematics@gmail.com');>>
>> Date: Friday, February 13, 2015 at 11:10 AM
>> To: Chris Mattmann <Chris.A.Mattmann@jpl.nasa.gov
>> <javascript:_e(%7B%7D,'cvml','Chris.A.Mattmann@jpl.nasa.gov');>>
>> Cc: "dev@nutch.apache.org
>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>> <dev@nutch.apache.org
>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>> 
>>> Hey all,
>>> 
>>> When I had run nutch-selenium, it was in a config such that zombies were
>>> created from closing Firefox windows and they couldn't be reaped (again,
>>> due to the docker configuration I had).
>>> 
>>> In a normal setup, it should not be an issue - if you're running 20
>>> threads in nutch that's potentially 20 open FF windows which isn't good
>>> for 512mb.
>>> 
>>> Selenium grid is much more efficient, in that browsers are opened, but
>>> tabs are used to fetch sites - and only those are closed.
>>> 
>>> Additionally, ensure you're using Nutch 2.2.1.
>>> 
>>> Feel free to fork patch and tinker and PR as needed.
>>> 
>>> Chris, if you want to be added to contribs on the GitHub project, that's
>>> cool with me! Wish I could dedicate more time to this, but I don't
>>> foresee using Nutch again in the near future, and am now working on
>>> projects that require lots of reading and possibly patches to Caffe and
>>> opencl r-CNN projects.
>>> 
>>> Tl;dr:
>>> - no, this shouldn't be typical unless you're creating zombies like crazy
>>> and they're not being reaped (too many open file descriptors), running
>>> out of memory, or similar resource constraint.
>>> - selenium grid is TONs more efficient, but a bit more difficult to set
>>> up. I used it to crawl 100ks of sites.
>>> - unfortunately I can't commit more time to this, but if I can assist in
>>> any admin way, let me know.
>>> 
>>> Thank you,
>>> 
>>> Mo
>>> 
>>> This message was drafted on a tiny touch screen; please forgive brevity &
>>> tpyos
>>> 
>>>> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
>>>> <chris.a.mattmann@jpl.nasa.gov
>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@jpl.nasa.gov');>> wrote:
>>>> 
>>>> Oh yes, please up your memory to like at least 2Gb..
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email:
>> chris.a.mattmann@nasa.gov
>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@nasa.gov');>
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Shuo Li <sli491@usc.edu
>>>> <javascript:_e(%7B%7D,'cvml','sli491@usc.edu');>>
>>>> Reply-To: "dev@nutch.apache.org
>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>>>> <dev@nutch.apache.org
>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>> Date: Friday, February 13, 2015 at 10:38 AM
>>>> To: "dev@nutch.apache.org
>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>>>> <dev@nutch.apache.org
>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>> Cc: Mo Omer <beancinematics@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','beancinematics@gmail.com');>>
>>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>>> 
>>>>> Hey Mo and Prof Mattmann,
>>>>> 
>>>>> 
>>>>> I will try to crawl the 3 websites in the homework tonight (NASA AMD,
>>>>> NSF
>>>>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
>>>>> going
>>>>> on.
>>>>> 
>>>>> 
>>>>> Is memory an issue? My vagrant only has 512MB of memory.
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Shuo Li
>>>>> 
>>>>> 
>>>>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>>>>> <chris.a.mattmann@jpl.nasa.gov
>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@jpl.nasa.gov');>> wrote:
>>>>> 
>>>>> Hi Shuo,
>>>>> 
>>>>> Thanks for your email. I wonder if using selenium grid would
>>>>> help?
>>>>> 
>>>>> Please see this plugin:
>> https://github.com/momer/nutch-selenium-grid-plugin
>> <https://github.com/momer/nutch-selenium-grid-plugin>
>>>>> 
>>>>> 
>>>>> I’m CC’ing Mo the author of the plugin to see if he experienced
>>>>> this while running the original selenium plugin - Mo did using
>>>>> selenium grid help the issue that Shuo is experiencing below?
>>>>> 
>>>>> Mo: are you cool with portion the grid plugin, or if Lewis or
>>>>> I do it to trunk (with full credit to you of course?)
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Chief Architect
>>>>> Instrument Software and Science Data Systems Section (398)
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 168-519, Mailstop: 168-527
>>>>> Email:
>> chris.a.mattmann@nasa.gov
>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@nasa.gov');>
>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Associate Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Shuo Li <sli491@usc.edu
>>>>> <javascript:_e(%7B%7D,'cvml','sli491@usc.edu');>>
>>>>> Reply-To: "dev@nutch.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>>>>> <dev@nutch.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>> Date: Friday, February 13, 2015 at 10:12 AM
>>>>> To: "dev@nutch.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>>>>> <dev@nutch.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>> Subject: Vagrant Crushed When using Nutch-Selenium
>>>>> 
>>>>>> Hey guys,
>>>>>> 
>>>>>> 
>>>>>> I'm trying to use Nutch-Selenium to crawl
>>>>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org>
>>>>>> <http://nutch.apache.org>.
>>>>>> However, my vagrant seems
>>>>>> crushed after a few minutes. I forced it to shut down and it turns
>>>>>> out it
>>>>>> only crawled 59 websites. My nutch version is 1.10 and my OS is
>>>>>> Ubuntu
>>>>>> Trusty, 14.04.
>>>>>> 
>>>>>> 
>>>>>> Is there anything I can provide to you guys? Or is there anybody have
>>>>>> the
>>>>>> same issue? Or 59 websites is the complete crawling?
>>>>>> 
>>>>>> 
>>>>>> Any suggestion would be appreciated.
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Shuo Li
> 

Re: Vagrant Crushed When using Nutch-Selenium

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thanks Mo, great advice.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Jiaxin Ye <ji...@usc.edu>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Tuesday, February 17, 2015 at 2:49 PM
To: Mohammed Omer <be...@gmail.com>
Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: Vagrant Crushed When using Nutch-Selenium

>
>
>
>Thank you so much!! I am going to try it out tonight.
>
>On Tuesday, February 17, 2015, Mohammed Omer <be...@gmail.com>
>wrote:
>
>Jiaxin, 
>
>
>Each page takes about 3 seconds to crawl due to this piece of code - we
>allow selenium 3 seconds to grab the page [0]. Due to what I was
>crawling, I didn't want to wait for a specific element/class/id to show
>up. However, you can change it up if you want.
> Selenium documentation [1] has more info on Ex/Implicit waiting.
>
>
>Again, it's not the most efficient way to crawl; but, if you need JS to
>render, it's a backwards way that ensures it happens. Selenium Grid has
>the benefit of being able to handle more throughput, but at the end of
>the day we're waiting for a browser to
> go out and fetch the url.
>
>
>I've suggested that most items be configurable when merged into trunk
>[2], but I'll make a specific call-out to the wait time.
>
>
>Due to the way Selenium standalone works, it's wayyyyyy less efficient
>than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that
>set-up. 
>
>
>Wish I could help out more, but 30 threads might be too much. 5 threads,
>at a total fetch/parse time of 4 seconds per url, would still
>theoretically churn out > 100k urls per day. There are multiple tweaks
>that could be made to optimize for your system,
> I'd start with reducing thread count, as you might be saturating your
>system [4].
>
>
>Sorry I can't be of more help!
>
>
>Thank you,
>
>
>Mo
>
>
>[0]: 
>https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java/
>org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49
><https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java
>/org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49>
>[1]: 
>http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp
><http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp>
>[2]: https://issues.apache.org/jira/browse/NUTCH-1933
>[3]: https://code.google.com/p/selenium/wiki/Grid2
>[4]: http://stackoverflow.com/a/4895271
>
>
>On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye
><jiaxinye@usc.edu <javascript:_e(%7B%7D,'cvml','jiaxinye@usc.edu');>>
>wrote:
>
>I am using fetcher.threads.per.queue = 30 by the way.
>
>On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye
><jiaxinye@usc.edu <javascript:_e(%7B%7D,'cvml','jiaxinye@usc.edu');>>
>wrote:
>
>Hi Mo,
>
>
>I have a problem about the selenium plugin on mac. I think I successfully
>set it up on mac but I have a question about the performance.
>I am using a Mac with Intel Core i5 processor and 8GB ram, but I found
>that each url fetched takes about 1 seconds to open and close
>the firefox window. Is it a normal speed? or anything is wrong? And is it
>possible to install selenium grid plugin on Mac? I will cry if you
>ask me to change machine now......
>
>
>Best,
>Jiaxin
>
>
>On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer
><beancinematics@gmail.com
><javascript:_e(%7B%7D,'cvml','beancinematics@gmail.com');>> wrote:
>
>No worries man, glad everything works! Glad, since I was having hostname
>issues with nutch/hbase just now as I quickly tried to get it
>working/fixed for ya, ha.
>
>Mo
>
>
>On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li
><sli491@usc.edu <javascript:_e(%7B%7D,'cvml','sli491@usc.edu');>> wrote:
>
>Hey guys,
>
>
>After change my RAM to 2GB, everything works fine. My bad. Thanks for
>your help.
>
>
>Regards,
>Shuo Li
>
>
>On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980)
><chris.a.mattmann@jpl.nasa.gov
><javascript:_e(%7B%7D,'cvml','chris.a.mattmann@jpl.nasa.gov');>> wrote:
>
>Thank you Mo. I sincerely appreciate your guidance and contribution.
>
>I will work to get your nutch selenium grid plugin contributed
>to work with Nutch 1.x.
>
>Cheers,
>Chris
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: 
>chris.a.mattmann@nasa.gov
><javascript:_e(%7B%7D,'cvml','chris.a.mattmann@nasa.gov');>
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Mo Omer <beancinematics@gmail.com
><javascript:_e(%7B%7D,'cvml','beancinematics@gmail.com');>>
>Date: Friday, February 13, 2015 at 11:10 AM
>To: Chris Mattmann <Chris.A.Mattmann@jpl.nasa.gov
><javascript:_e(%7B%7D,'cvml','Chris.A.Mattmann@jpl.nasa.gov');>>
>Cc: "dev@nutch.apache.org
><javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
><dev@nutch.apache.org
><javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>Subject: Re: Vagrant Crushed When using Nutch-Selenium
>
>>Hey all,
>>
>>When I had run nutch-selenium, it was in a config such that zombies were
>>created from closing Firefox windows and they couldn't be reaped (again,
>>due to the docker configuration I had).
>>
>>In a normal setup, it should not be an issue - if you're running 20
>>threads in nutch that's potentially 20 open FF windows which isn't good
>>for 512mb.
>>
>>Selenium grid is much more efficient, in that browsers are opened, but
>>tabs are used to fetch sites - and only those are closed.
>>
>>Additionally, ensure you're using Nutch 2.2.1.
>>
>>Feel free to fork patch and tinker and PR as needed.
>>
>>Chris, if you want to be added to contribs on the GitHub project, that's
>>cool with me! Wish I could dedicate more time to this, but I don't
>>foresee using Nutch again in the near future, and am now working on
>>projects that require lots of reading and possibly patches to Caffe and
>>opencl r-CNN projects.
>>
>>Tl;dr:
>>- no, this shouldn't be typical unless you're creating zombies like crazy
>>and they're not being reaped (too many open file descriptors), running
>>out of memory, or similar resource constraint.
>>- selenium grid is TONs more efficient, but a bit more difficult to set
>>up. I used it to crawl 100ks of sites.
>>- unfortunately I can't commit more time to this, but if I can assist in
>>any admin way, let me know.
>>
>>Thank you,
>>
>>Mo
>>
>>This message was drafted on a tiny touch screen; please forgive brevity &
>>tpyos
>>
>>> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
>>><chris.a.mattmann@jpl.nasa.gov
>>><javascript:_e(%7B%7D,'cvml','chris.a.mattmann@jpl.nasa.gov');>> wrote:
>>>
>>> Oh yes, please up your memory to like at least 2Gb..
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: 
>chris.a.mattmann@nasa.gov
><javascript:_e(%7B%7D,'cvml','chris.a.mattmann@nasa.gov');>
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Shuo Li <sli491@usc.edu
>>><javascript:_e(%7B%7D,'cvml','sli491@usc.edu');>>
>>> Reply-To: "dev@nutch.apache.org
>>><javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>>><dev@nutch.apache.org
>>><javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>> Date: Friday, February 13, 2015 at 10:38 AM
>>> To: "dev@nutch.apache.org
>>><javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>>><dev@nutch.apache.org
>>><javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>> Cc: Mo Omer <beancinematics@gmail.com
>>><javascript:_e(%7B%7D,'cvml','beancinematics@gmail.com');>>
>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>>
>>>> Hey Mo and Prof Mattmann,
>>>>
>>>>
>>>> I will try to crawl the 3 websites in the homework tonight (NASA AMD,
>>>>NSF
>>>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
>>>>going
>>>> on.
>>>>
>>>>
>>>> Is memory an issue? My vagrant only has 512MB of memory.
>>>>
>>>>
>>>> Regards,
>>>> Shuo Li
>>>>
>>>>
>>>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>>>> <chris.a.mattmann@jpl.nasa.gov
>>>><javascript:_e(%7B%7D,'cvml','chris.a.mattmann@jpl.nasa.gov');>> wrote:
>>>>
>>>> Hi Shuo,
>>>>
>>>> Thanks for your email. I wonder if using selenium grid would
>>>> help?
>>>>
>>>> Please see this plugin:
>>>>
>>>> 
>https://github.com/momer/nutch-selenium-grid-plugin
><https://github.com/momer/nutch-selenium-grid-plugin>
>>>>
>>>>
>>>> I’m CC’ing Mo the author of the plugin to see if he experienced
>>>> this while running the original selenium plugin - Mo did using
>>>> selenium grid help the issue that Shuo is experiencing below?
>>>>
>>>> Mo: are you cool with portion the grid plugin, or if Lewis or
>>>> I do it to trunk (with full credit to you of course?)
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: 
>chris.a.mattmann@nasa.gov
><javascript:_e(%7B%7D,'cvml','chris.a.mattmann@nasa.gov');>
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Shuo Li <sli491@usc.edu
>>>><javascript:_e(%7B%7D,'cvml','sli491@usc.edu');>>
>>>> Reply-To: "dev@nutch.apache.org
>>>><javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>>>><dev@nutch.apache.org
>>>><javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>> Date: Friday, February 13, 2015 at 10:12 AM
>>>> To: "dev@nutch.apache.org
>>>><javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>"
>>>><dev@nutch.apache.org
>>>><javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>> Subject: Vagrant Crushed When using Nutch-Selenium
>>>>
>>>>> Hey guys,
>>>>>
>>>>>
>>>>> I'm trying to use Nutch-Selenium to crawl
>>>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org>
>>>>><http://nutch.apache.org>.
>>>>> However, my vagrant seems
>>>>> crushed after a few minutes. I forced it to shut down and it turns
>>>>>out it
>>>>> only crawled 59 websites. My nutch version is 1.10 and my OS is
>>>>>Ubuntu
>>>>> Trusty, 14.04.
>>>>>
>>>>>
>>>>> Is there anything I can provide to you guys? Or is there anybody have
>>>>>the
>>>>> same issue? Or 59 websites is the complete crawling?
>>>>>
>>>>>
>>>>> Any suggestion would be appreciated.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Shuo Li
>>>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Re: Vagrant Crushed When using Nutch-Selenium

Posted by Jiaxin Ye <ji...@usc.edu>.
Thank you so much!! I am going to try it out tonight.

On Tuesday, February 17, 2015, Mohammed Omer <be...@gmail.com>
wrote:

> Jiaxin,
>
> Each page takes about 3 seconds to crawl due to this piece of code - we
> allow selenium 3 seconds to grab the page [0]. Due to what I was crawling,
> I didn't want to wait for a specific element/class/id to show up. However,
> you can change it up if you want. Selenium documentation [1] has more info
> on Ex/Implicit waiting.
>
> Again, it's not the most efficient way to crawl; but, if you need JS to
> render, it's a backwards way that ensures it happens. Selenium Grid has the
> benefit of being able to handle more throughput, but at the end of the day
> we're waiting for a browser to go out and fetch the url.
>
> I've suggested that most items be configurable when merged into trunk [2],
> but I'll make a specific call-out to the wait time.
>
> Due to the way Selenium standalone works, it's wayyyyyy less efficient
> than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that
> set-up.
>
> Wish I could help out more, but 30 threads might be too much. 5 threads,
> at a total fetch/parse time of 4 seconds per url, would still theoretically
> churn out > 100k urls per day. There are multiple tweaks that could be made
> to optimize for your system, I'd start with reducing thread count, as you
> might be saturating your system [4].
>
> Sorry I can't be of more help!
>
> Thank you,
>
> Mo
>
> [0]:
> https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49
> [1]: http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp
> [2]: https://issues.apache.org/jira/browse/NUTCH-1933
> [3]: https://code.google.com/p/selenium/wiki/Grid2
> [4]: http://stackoverflow.com/a/4895271
>
> On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye <jiaxinye@usc.edu
> <javascript:_e(%7B%7D,'cvml','jiaxinye@usc.edu');>> wrote:
>
>> I am using fetcher.threads.per.queue = 30 by the way.
>>
>> On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye <jiaxinye@usc.edu
>> <javascript:_e(%7B%7D,'cvml','jiaxinye@usc.edu');>> wrote:
>>
>>> Hi Mo,
>>>
>>> I have a problem about the selenium plugin on mac. I think I
>>> successfully set it up on mac but I have a question about the performance.
>>> I am using a Mac with Intel Core i5 processor and 8GB ram, but I found
>>> that each url fetched takes about 1 seconds to open and close
>>> the firefox window. Is it a normal speed? or anything is wrong? And is
>>> it possible to install selenium grid plugin on Mac? I will cry if you
>>> ask me to change machine now......
>>>
>>> Best,
>>> Jiaxin
>>>
>>> On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer <beancinematics@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','beancinematics@gmail.com');>> wrote:
>>>
>>>> No worries man, glad everything works! Glad, since I was having
>>>> hostname issues with nutch/hbase just now as I quickly tried to get it
>>>> working/fixed for ya, ha.
>>>>
>>>> Mo
>>>>
>>>> On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li <sli491@usc.edu
>>>> <javascript:_e(%7B%7D,'cvml','sli491@usc.edu');>> wrote:
>>>>
>>>>> Hey guys,
>>>>>
>>>>> After change my RAM to 2GB, everything works fine. My bad. Thanks for
>>>>> your help.
>>>>>
>>>>> Regards,
>>>>> Shuo Li
>>>>>
>>>>> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980) <
>>>>> chris.a.mattmann@jpl.nasa.gov
>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@jpl.nasa.gov');>>
>>>>> wrote:
>>>>>
>>>>>> Thank you Mo. I sincerely appreciate your guidance and contribution.
>>>>>>
>>>>>> I will work to get your nutch selenium grid plugin contributed
>>>>>> to work with Nutch 1.x.
>>>>>>
>>>>>> Cheers,
>>>>>> Chris
>>>>>>
>>>>>>
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398)
>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email: chris.a.mattmann@nasa.gov
>>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@nasa.gov');>
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Mo Omer <beancinematics@gmail.com
>>>>>> <javascript:_e(%7B%7D,'cvml','beancinematics@gmail.com');>>
>>>>>> Date: Friday, February 13, 2015 at 11:10 AM
>>>>>> To: Chris Mattmann <Chris.A.Mattmann@jpl.nasa.gov
>>>>>> <javascript:_e(%7B%7D,'cvml','Chris.A.Mattmann@jpl.nasa.gov');>>
>>>>>> Cc: "dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" <
>>>>>> dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>>>>>
>>>>>> >Hey all,
>>>>>> >
>>>>>> >When I had run nutch-selenium, it was in a config such that zombies
>>>>>> were
>>>>>> >created from closing Firefox windows and they couldn't be reaped
>>>>>> (again,
>>>>>> >due to the docker configuration I had).
>>>>>> >
>>>>>> >In a normal setup, it should not be an issue - if you're running 20
>>>>>> >threads in nutch that's potentially 20 open FF windows which isn't
>>>>>> good
>>>>>> >for 512mb.
>>>>>> >
>>>>>> >Selenium grid is much more efficient, in that browsers are opened,
>>>>>> but
>>>>>> >tabs are used to fetch sites - and only those are closed.
>>>>>> >
>>>>>> >Additionally, ensure you're using Nutch 2.2.1.
>>>>>> >
>>>>>> >Feel free to fork patch and tinker and PR as needed.
>>>>>> >
>>>>>> >Chris, if you want to be added to contribs on the GitHub project,
>>>>>> that's
>>>>>> >cool with me! Wish I could dedicate more time to this, but I don't
>>>>>> >foresee using Nutch again in the near future, and am now working on
>>>>>> >projects that require lots of reading and possibly patches to Caffe
>>>>>> and
>>>>>> >opencl r-CNN projects.
>>>>>> >
>>>>>> >Tl;dr:
>>>>>> >- no, this shouldn't be typical unless you're creating zombies like
>>>>>> crazy
>>>>>> >and they're not being reaped (too many open file descriptors),
>>>>>> running
>>>>>> >out of memory, or similar resource constraint.
>>>>>> >- selenium grid is TONs more efficient, but a bit more difficult to
>>>>>> set
>>>>>> >up. I used it to crawl 100ks of sites.
>>>>>> >- unfortunately I can't commit more time to this, but if I can
>>>>>> assist in
>>>>>> >any admin way, let me know.
>>>>>> >
>>>>>> >Thank you,
>>>>>> >
>>>>>> >Mo
>>>>>> >
>>>>>> >This message was drafted on a tiny touch screen; please forgive
>>>>>> brevity &
>>>>>> >tpyos
>>>>>> >
>>>>>> >> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
>>>>>> >><chris.a.mattmann@jpl.nasa.gov
>>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@jpl.nasa.gov');>>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Oh yes, please up your memory to like at least 2Gb..
>>>>>> >>
>>>>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> >> Chris Mattmann, Ph.D.
>>>>>> >> Chief Architect
>>>>>> >> Instrument Software and Science Data Systems Section (398)
>>>>>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> >> Office: 168-519, Mailstop: 168-527
>>>>>> >> Email: chris.a.mattmann@nasa.gov
>>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@nasa.gov');>
>>>>>> >> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> >> Adjunct Associate Professor, Computer Science Department
>>>>>> >> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> -----Original Message-----
>>>>>> >> From: Shuo Li <sli491@usc.edu
>>>>>> <javascript:_e(%7B%7D,'cvml','sli491@usc.edu');>>
>>>>>> >> Reply-To: "dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" <
>>>>>> dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>>> >> Date: Friday, February 13, 2015 at 10:38 AM
>>>>>> >> To: "dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" <
>>>>>> dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>>> >> Cc: Mo Omer <beancinematics@gmail.com
>>>>>> <javascript:_e(%7B%7D,'cvml','beancinematics@gmail.com');>>
>>>>>> >> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>>>>> >>
>>>>>> >>> Hey Mo and Prof Mattmann,
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> I will try to crawl the 3 websites in the homework tonight (NASA
>>>>>> AMD,
>>>>>> >>>NSF
>>>>>> >>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
>>>>>> >>>going
>>>>>> >>> on.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Is memory an issue? My vagrant only has 512MB of memory.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Regards,
>>>>>> >>> Shuo Li
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>>>>>> >>> <chris.a.mattmann@jpl.nasa.gov
>>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@jpl.nasa.gov');>>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> Hi Shuo,
>>>>>> >>>
>>>>>> >>> Thanks for your email. I wonder if using selenium grid would
>>>>>> >>> help?
>>>>>> >>>
>>>>>> >>> Please see this plugin:
>>>>>> >>>
>>>>>> >>> https://github.com/momer/nutch-selenium-grid-plugin
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> I’m CC’ing Mo the author of the plugin to see if he experienced
>>>>>> >>> this while running the original selenium plugin - Mo did using
>>>>>> >>> selenium grid help the issue that Shuo is experiencing below?
>>>>>> >>>
>>>>>> >>> Mo: are you cool with portion the grid plugin, or if Lewis or
>>>>>> >>> I do it to trunk (with full credit to you of course?)
>>>>>> >>>
>>>>>> >>> Cheers,
>>>>>> >>> Chris
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> >>> Chris Mattmann, Ph.D.
>>>>>> >>> Chief Architect
>>>>>> >>> Instrument Software and Science Data Systems Section (398)
>>>>>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> >>> Office: 168-519, Mailstop: 168-527
>>>>>> >>> Email: chris.a.mattmann@nasa.gov
>>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattmann@nasa.gov');>
>>>>>> >>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> >>> Adjunct Associate Professor, Computer Science Department
>>>>>> >>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> -----Original Message-----
>>>>>> >>> From: Shuo Li <sli491@usc.edu
>>>>>> <javascript:_e(%7B%7D,'cvml','sli491@usc.edu');>>
>>>>>> >>> Reply-To: "dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" <
>>>>>> dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>>> >>> Date: Friday, February 13, 2015 at 10:12 AM
>>>>>> >>> To: "dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" <
>>>>>> dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>>> >>> Subject: Vagrant Crushed When using Nutch-Selenium
>>>>>> >>>
>>>>>> >>>> Hey guys,
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> I'm trying to use Nutch-Selenium to crawl
>>>>>> >>>> nutch.apache.org <http://nutch.apache.org> <
>>>>>> http://nutch.apache.org>.
>>>>>> >>>> However, my vagrant seems
>>>>>> >>>> crushed after a few minutes. I forced it to shut down and it
>>>>>> turns
>>>>>> >>>>out it
>>>>>> >>>> only crawled 59 websites. My nutch version is 1.10 and my OS is
>>>>>> Ubuntu
>>>>>> >>>> Trusty, 14.04.
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> Is there anything I can provide to you guys? Or is there anybody
>>>>>> have
>>>>>> >>>>the
>>>>>> >>>> same issue? Or 59 websites is the complete crawling?
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> Any suggestion would be appreciated.
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> Regards,
>>>>>> >>>> Shuo Li
>>>>>> >>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Vagrant Crushed When using Nutch-Selenium

Posted by Jiaxin Ye <ji...@usc.edu>.
I am using fetcher.threads.per.queue = 30 by the way.

On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye <ji...@usc.edu> wrote:

> Hi Mo,
>
> I have a problem about the selenium plugin on mac. I think I successfully
> set it up on mac but I have a question about the performance.
> I am using a Mac with Intel Core i5 processor and 8GB ram, but I found
> that each url fetched takes about 1 seconds to open and close
> the firefox window. Is it a normal speed? or anything is wrong? And is it
> possible to install selenium grid plugin on Mac? I will cry if you
> ask me to change machine now......
>
> Best,
> Jiaxin
>
> On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer <be...@gmail.com>
> wrote:
>
>> No worries man, glad everything works! Glad, since I was having hostname
>> issues with nutch/hbase just now as I quickly tried to get it working/fixed
>> for ya, ha.
>>
>> Mo
>>
>> On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li <sl...@usc.edu> wrote:
>>
>>> Hey guys,
>>>
>>> After change my RAM to 2GB, everything works fine. My bad. Thanks for
>>> your help.
>>>
>>> Regards,
>>> Shuo Li
>>>
>>> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980) <
>>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>>
>>>> Thank you Mo. I sincerely appreciate your guidance and contribution.
>>>>
>>>> I will work to get your nutch selenium grid plugin contributed
>>>> to work with Nutch 1.x.
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Mo Omer <be...@gmail.com>
>>>> Date: Friday, February 13, 2015 at 11:10 AM
>>>> To: Chris Mattmann <Ch...@jpl.nasa.gov>
>>>> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>>>
>>>> >Hey all,
>>>> >
>>>> >When I had run nutch-selenium, it was in a config such that zombies
>>>> were
>>>> >created from closing Firefox windows and they couldn't be reaped
>>>> (again,
>>>> >due to the docker configuration I had).
>>>> >
>>>> >In a normal setup, it should not be an issue - if you're running 20
>>>> >threads in nutch that's potentially 20 open FF windows which isn't good
>>>> >for 512mb.
>>>> >
>>>> >Selenium grid is much more efficient, in that browsers are opened, but
>>>> >tabs are used to fetch sites - and only those are closed.
>>>> >
>>>> >Additionally, ensure you're using Nutch 2.2.1.
>>>> >
>>>> >Feel free to fork patch and tinker and PR as needed.
>>>> >
>>>> >Chris, if you want to be added to contribs on the GitHub project,
>>>> that's
>>>> >cool with me! Wish I could dedicate more time to this, but I don't
>>>> >foresee using Nutch again in the near future, and am now working on
>>>> >projects that require lots of reading and possibly patches to Caffe and
>>>> >opencl r-CNN projects.
>>>> >
>>>> >Tl;dr:
>>>> >- no, this shouldn't be typical unless you're creating zombies like
>>>> crazy
>>>> >and they're not being reaped (too many open file descriptors), running
>>>> >out of memory, or similar resource constraint.
>>>> >- selenium grid is TONs more efficient, but a bit more difficult to set
>>>> >up. I used it to crawl 100ks of sites.
>>>> >- unfortunately I can't commit more time to this, but if I can assist
>>>> in
>>>> >any admin way, let me know.
>>>> >
>>>> >Thank you,
>>>> >
>>>> >Mo
>>>> >
>>>> >This message was drafted on a tiny touch screen; please forgive
>>>> brevity &
>>>> >tpyos
>>>> >
>>>> >> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
>>>> >><ch...@jpl.nasa.gov> wrote:
>>>> >>
>>>> >> Oh yes, please up your memory to like at least 2Gb..
>>>> >>
>>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >> Chris Mattmann, Ph.D.
>>>> >> Chief Architect
>>>> >> Instrument Software and Science Data Systems Section (398)
>>>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> >> Office: 168-519, Mailstop: 168-527
>>>> >> Email: chris.a.mattmann@nasa.gov
>>>> >> WWW:  http://sunset.usc.edu/~mattmann/
>>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >> Adjunct Associate Professor, Computer Science Department
>>>> >> University of Southern California, Los Angeles, CA 90089 USA
>>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> -----Original Message-----
>>>> >> From: Shuo Li <sl...@usc.edu>
>>>> >> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>>> >> Date: Friday, February 13, 2015 at 10:38 AM
>>>> >> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>>> >> Cc: Mo Omer <be...@gmail.com>
>>>> >> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>>> >>
>>>> >>> Hey Mo and Prof Mattmann,
>>>> >>>
>>>> >>>
>>>> >>> I will try to crawl the 3 websites in the homework tonight (NASA
>>>> AMD,
>>>> >>>NSF
>>>> >>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
>>>> >>>going
>>>> >>> on.
>>>> >>>
>>>> >>>
>>>> >>> Is memory an issue? My vagrant only has 512MB of memory.
>>>> >>>
>>>> >>>
>>>> >>> Regards,
>>>> >>> Shuo Li
>>>> >>>
>>>> >>>
>>>> >>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>>>> >>> <ch...@jpl.nasa.gov> wrote:
>>>> >>>
>>>> >>> Hi Shuo,
>>>> >>>
>>>> >>> Thanks for your email. I wonder if using selenium grid would
>>>> >>> help?
>>>> >>>
>>>> >>> Please see this plugin:
>>>> >>>
>>>> >>> https://github.com/momer/nutch-selenium-grid-plugin
>>>> >>>
>>>> >>>
>>>> >>> I’m CC’ing Mo the author of the plugin to see if he experienced
>>>> >>> this while running the original selenium plugin - Mo did using
>>>> >>> selenium grid help the issue that Shuo is experiencing below?
>>>> >>>
>>>> >>> Mo: are you cool with portion the grid plugin, or if Lewis or
>>>> >>> I do it to trunk (with full credit to you of course?)
>>>> >>>
>>>> >>> Cheers,
>>>> >>> Chris
>>>> >>>
>>>> >>>
>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>> Chris Mattmann, Ph.D.
>>>> >>> Chief Architect
>>>> >>> Instrument Software and Science Data Systems Section (398)
>>>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> >>> Office: 168-519, Mailstop: 168-527
>>>> >>> Email: chris.a.mattmann@nasa.gov
>>>> >>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>> Adjunct Associate Professor, Computer Science Department
>>>> >>> University of Southern California, Los Angeles, CA 90089 USA
>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> -----Original Message-----
>>>> >>> From: Shuo Li <sl...@usc.edu>
>>>> >>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>>> >>> Date: Friday, February 13, 2015 at 10:12 AM
>>>> >>> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>>> >>> Subject: Vagrant Crushed When using Nutch-Selenium
>>>> >>>
>>>> >>>> Hey guys,
>>>> >>>>
>>>> >>>>
>>>> >>>> I'm trying to use Nutch-Selenium to crawl
>>>> >>>> nutch.apache.org <http://nutch.apache.org> <
>>>> http://nutch.apache.org>.
>>>> >>>> However, my vagrant seems
>>>> >>>> crushed after a few minutes. I forced it to shut down and it turns
>>>> >>>>out it
>>>> >>>> only crawled 59 websites. My nutch version is 1.10 and my OS is
>>>> Ubuntu
>>>> >>>> Trusty, 14.04.
>>>> >>>>
>>>> >>>>
>>>> >>>> Is there anything I can provide to you guys? Or is there anybody
>>>> have
>>>> >>>>the
>>>> >>>> same issue? Or 59 websites is the complete crawling?
>>>> >>>>
>>>> >>>>
>>>> >>>> Any suggestion would be appreciated.
>>>> >>>>
>>>> >>>>
>>>> >>>> Regards,
>>>> >>>> Shuo Li
>>>> >>
>>>>
>>>>
>>>
>>
>

Re: Vagrant Crushed When using Nutch-Selenium

Posted by Jiaxin Ye <ji...@usc.edu>.
Hi Mo,

I have a problem about the selenium plugin on mac. I think I successfully
set it up on mac but I have a question about the performance.
I am using a Mac with Intel Core i5 processor and 8GB ram, but I found that
each url fetched takes about 1 seconds to open and close
the firefox window. Is it a normal speed? or anything is wrong? And is it
possible to install selenium grid plugin on Mac? I will cry if you
ask me to change machine now......

Best,
Jiaxin

On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer <be...@gmail.com>
wrote:

> No worries man, glad everything works! Glad, since I was having hostname
> issues with nutch/hbase just now as I quickly tried to get it working/fixed
> for ya, ha.
>
> Mo
>
> On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li <sl...@usc.edu> wrote:
>
>> Hey guys,
>>
>> After change my RAM to 2GB, everything works fine. My bad. Thanks for
>> your help.
>>
>> Regards,
>> Shuo Li
>>
>> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980) <
>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>
>>> Thank you Mo. I sincerely appreciate your guidance and contribution.
>>>
>>> I will work to get your nutch selenium grid plugin contributed
>>> to work with Nutch 1.x.
>>>
>>> Cheers,
>>> Chris
>>>
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Mo Omer <be...@gmail.com>
>>> Date: Friday, February 13, 2015 at 11:10 AM
>>> To: Chris Mattmann <Ch...@jpl.nasa.gov>
>>> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>>
>>> >Hey all,
>>> >
>>> >When I had run nutch-selenium, it was in a config such that zombies were
>>> >created from closing Firefox windows and they couldn't be reaped (again,
>>> >due to the docker configuration I had).
>>> >
>>> >In a normal setup, it should not be an issue - if you're running 20
>>> >threads in nutch that's potentially 20 open FF windows which isn't good
>>> >for 512mb.
>>> >
>>> >Selenium grid is much more efficient, in that browsers are opened, but
>>> >tabs are used to fetch sites - and only those are closed.
>>> >
>>> >Additionally, ensure you're using Nutch 2.2.1.
>>> >
>>> >Feel free to fork patch and tinker and PR as needed.
>>> >
>>> >Chris, if you want to be added to contribs on the GitHub project, that's
>>> >cool with me! Wish I could dedicate more time to this, but I don't
>>> >foresee using Nutch again in the near future, and am now working on
>>> >projects that require lots of reading and possibly patches to Caffe and
>>> >opencl r-CNN projects.
>>> >
>>> >Tl;dr:
>>> >- no, this shouldn't be typical unless you're creating zombies like
>>> crazy
>>> >and they're not being reaped (too many open file descriptors), running
>>> >out of memory, or similar resource constraint.
>>> >- selenium grid is TONs more efficient, but a bit more difficult to set
>>> >up. I used it to crawl 100ks of sites.
>>> >- unfortunately I can't commit more time to this, but if I can assist in
>>> >any admin way, let me know.
>>> >
>>> >Thank you,
>>> >
>>> >Mo
>>> >
>>> >This message was drafted on a tiny touch screen; please forgive brevity
>>> &
>>> >tpyos
>>> >
>>> >> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
>>> >><ch...@jpl.nasa.gov> wrote:
>>> >>
>>> >> Oh yes, please up your memory to like at least 2Gb..
>>> >>
>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >> Chris Mattmann, Ph.D.
>>> >> Chief Architect
>>> >> Instrument Software and Science Data Systems Section (398)
>>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> >> Office: 168-519, Mailstop: 168-527
>>> >> Email: chris.a.mattmann@nasa.gov
>>> >> WWW:  http://sunset.usc.edu/~mattmann/
>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >> Adjunct Associate Professor, Computer Science Department
>>> >> University of Southern California, Los Angeles, CA 90089 USA
>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> -----Original Message-----
>>> >> From: Shuo Li <sl...@usc.edu>
>>> >> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>> >> Date: Friday, February 13, 2015 at 10:38 AM
>>> >> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>> >> Cc: Mo Omer <be...@gmail.com>
>>> >> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>> >>
>>> >>> Hey Mo and Prof Mattmann,
>>> >>>
>>> >>>
>>> >>> I will try to crawl the 3 websites in the homework tonight (NASA AMD,
>>> >>>NSF
>>> >>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
>>> >>>going
>>> >>> on.
>>> >>>
>>> >>>
>>> >>> Is memory an issue? My vagrant only has 512MB of memory.
>>> >>>
>>> >>>
>>> >>> Regards,
>>> >>> Shuo Li
>>> >>>
>>> >>>
>>> >>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>>> >>> <ch...@jpl.nasa.gov> wrote:
>>> >>>
>>> >>> Hi Shuo,
>>> >>>
>>> >>> Thanks for your email. I wonder if using selenium grid would
>>> >>> help?
>>> >>>
>>> >>> Please see this plugin:
>>> >>>
>>> >>> https://github.com/momer/nutch-selenium-grid-plugin
>>> >>>
>>> >>>
>>> >>> I’m CC’ing Mo the author of the plugin to see if he experienced
>>> >>> this while running the original selenium plugin - Mo did using
>>> >>> selenium grid help the issue that Shuo is experiencing below?
>>> >>>
>>> >>> Mo: are you cool with portion the grid plugin, or if Lewis or
>>> >>> I do it to trunk (with full credit to you of course?)
>>> >>>
>>> >>> Cheers,
>>> >>> Chris
>>> >>>
>>> >>>
>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>> Chris Mattmann, Ph.D.
>>> >>> Chief Architect
>>> >>> Instrument Software and Science Data Systems Section (398)
>>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> >>> Office: 168-519, Mailstop: 168-527
>>> >>> Email: chris.a.mattmann@nasa.gov
>>> >>> WWW:  http://sunset.usc.edu/~mattmann/
>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>> Adjunct Associate Professor, Computer Science Department
>>> >>> University of Southern California, Los Angeles, CA 90089 USA
>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> -----Original Message-----
>>> >>> From: Shuo Li <sl...@usc.edu>
>>> >>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>> >>> Date: Friday, February 13, 2015 at 10:12 AM
>>> >>> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>> >>> Subject: Vagrant Crushed When using Nutch-Selenium
>>> >>>
>>> >>>> Hey guys,
>>> >>>>
>>> >>>>
>>> >>>> I'm trying to use Nutch-Selenium to crawl
>>> >>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org
>>> >.
>>> >>>> However, my vagrant seems
>>> >>>> crushed after a few minutes. I forced it to shut down and it turns
>>> >>>>out it
>>> >>>> only crawled 59 websites. My nutch version is 1.10 and my OS is
>>> Ubuntu
>>> >>>> Trusty, 14.04.
>>> >>>>
>>> >>>>
>>> >>>> Is there anything I can provide to you guys? Or is there anybody
>>> have
>>> >>>>the
>>> >>>> same issue? Or 59 websites is the complete crawling?
>>> >>>>
>>> >>>>
>>> >>>> Any suggestion would be appreciated.
>>> >>>>
>>> >>>>
>>> >>>> Regards,
>>> >>>> Shuo Li
>>> >>
>>>
>>>
>>
>

Re: Vagrant Crushed When using Nutch-Selenium

Posted by Mohammed Omer <be...@gmail.com>.
No worries man, glad everything works! Glad, since I was having hostname
issues with nutch/hbase just now as I quickly tried to get it working/fixed
for ya, ha.

Mo

On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li <sl...@usc.edu> wrote:

> Hey guys,
>
> After change my RAM to 2GB, everything works fine. My bad. Thanks for your
> help.
>
> Regards,
> Shuo Li
>
> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> Thank you Mo. I sincerely appreciate your guidance and contribution.
>>
>> I will work to get your nutch selenium grid plugin contributed
>> to work with Nutch 1.x.
>>
>> Cheers,
>> Chris
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Mo Omer <be...@gmail.com>
>> Date: Friday, February 13, 2015 at 11:10 AM
>> To: Chris Mattmann <Ch...@jpl.nasa.gov>
>> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>
>> >Hey all,
>> >
>> >When I had run nutch-selenium, it was in a config such that zombies were
>> >created from closing Firefox windows and they couldn't be reaped (again,
>> >due to the docker configuration I had).
>> >
>> >In a normal setup, it should not be an issue - if you're running 20
>> >threads in nutch that's potentially 20 open FF windows which isn't good
>> >for 512mb.
>> >
>> >Selenium grid is much more efficient, in that browsers are opened, but
>> >tabs are used to fetch sites - and only those are closed.
>> >
>> >Additionally, ensure you're using Nutch 2.2.1.
>> >
>> >Feel free to fork patch and tinker and PR as needed.
>> >
>> >Chris, if you want to be added to contribs on the GitHub project, that's
>> >cool with me! Wish I could dedicate more time to this, but I don't
>> >foresee using Nutch again in the near future, and am now working on
>> >projects that require lots of reading and possibly patches to Caffe and
>> >opencl r-CNN projects.
>> >
>> >Tl;dr:
>> >- no, this shouldn't be typical unless you're creating zombies like crazy
>> >and they're not being reaped (too many open file descriptors), running
>> >out of memory, or similar resource constraint.
>> >- selenium grid is TONs more efficient, but a bit more difficult to set
>> >up. I used it to crawl 100ks of sites.
>> >- unfortunately I can't commit more time to this, but if I can assist in
>> >any admin way, let me know.
>> >
>> >Thank you,
>> >
>> >Mo
>> >
>> >This message was drafted on a tiny touch screen; please forgive brevity &
>> >tpyos
>> >
>> >> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
>> >><ch...@jpl.nasa.gov> wrote:
>> >>
>> >> Oh yes, please up your memory to like at least 2Gb..
>> >>
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> Chris Mattmann, Ph.D.
>> >> Chief Architect
>> >> Instrument Software and Science Data Systems Section (398)
>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >> Office: 168-519, Mailstop: 168-527
>> >> Email: chris.a.mattmann@nasa.gov
>> >> WWW:  http://sunset.usc.edu/~mattmann/
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> Adjunct Associate Professor, Computer Science Department
>> >> University of Southern California, Los Angeles, CA 90089 USA
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Shuo Li <sl...@usc.edu>
>> >> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> >> Date: Friday, February 13, 2015 at 10:38 AM
>> >> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> >> Cc: Mo Omer <be...@gmail.com>
>> >> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>> >>
>> >>> Hey Mo and Prof Mattmann,
>> >>>
>> >>>
>> >>> I will try to crawl the 3 websites in the homework tonight (NASA AMD,
>> >>>NSF
>> >>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
>> >>>going
>> >>> on.
>> >>>
>> >>>
>> >>> Is memory an issue? My vagrant only has 512MB of memory.
>> >>>
>> >>>
>> >>> Regards,
>> >>> Shuo Li
>> >>>
>> >>>
>> >>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>> >>> <ch...@jpl.nasa.gov> wrote:
>> >>>
>> >>> Hi Shuo,
>> >>>
>> >>> Thanks for your email. I wonder if using selenium grid would
>> >>> help?
>> >>>
>> >>> Please see this plugin:
>> >>>
>> >>> https://github.com/momer/nutch-selenium-grid-plugin
>> >>>
>> >>>
>> >>> I’m CC’ing Mo the author of the plugin to see if he experienced
>> >>> this while running the original selenium plugin - Mo did using
>> >>> selenium grid help the issue that Shuo is experiencing below?
>> >>>
>> >>> Mo: are you cool with portion the grid plugin, or if Lewis or
>> >>> I do it to trunk (with full credit to you of course?)
>> >>>
>> >>> Cheers,
>> >>> Chris
>> >>>
>> >>>
>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>> Chris Mattmann, Ph.D.
>> >>> Chief Architect
>> >>> Instrument Software and Science Data Systems Section (398)
>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>> Office: 168-519, Mailstop: 168-527
>> >>> Email: chris.a.mattmann@nasa.gov
>> >>> WWW:  http://sunset.usc.edu/~mattmann/
>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>> Adjunct Associate Professor, Computer Science Department
>> >>> University of Southern California, Los Angeles, CA 90089 USA
>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> -----Original Message-----
>> >>> From: Shuo Li <sl...@usc.edu>
>> >>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> >>> Date: Friday, February 13, 2015 at 10:12 AM
>> >>> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> >>> Subject: Vagrant Crushed When using Nutch-Selenium
>> >>>
>> >>>> Hey guys,
>> >>>>
>> >>>>
>> >>>> I'm trying to use Nutch-Selenium to crawl
>> >>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org
>> >.
>> >>>> However, my vagrant seems
>> >>>> crushed after a few minutes. I forced it to shut down and it turns
>> >>>>out it
>> >>>> only crawled 59 websites. My nutch version is 1.10 and my OS is
>> Ubuntu
>> >>>> Trusty, 14.04.
>> >>>>
>> >>>>
>> >>>> Is there anything I can provide to you guys? Or is there anybody have
>> >>>>the
>> >>>> same issue? Or 59 websites is the complete crawling?
>> >>>>
>> >>>>
>> >>>> Any suggestion would be appreciated.
>> >>>>
>> >>>>
>> >>>> Regards,
>> >>>> Shuo Li
>> >>
>>
>>
>

Re: Vagrant Crushed When using Nutch-Selenium

Posted by Shuo Li <sl...@usc.edu>.
Hey guys,

After change my RAM to 2GB, everything works fine. My bad. Thanks for your
help.

Regards,
Shuo Li

On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Thank you Mo. I sincerely appreciate your guidance and contribution.
>
> I will work to get your nutch selenium grid plugin contributed
> to work with Nutch 1.x.
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Mo Omer <be...@gmail.com>
> Date: Friday, February 13, 2015 at 11:10 AM
> To: Chris Mattmann <Ch...@jpl.nasa.gov>
> Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>
> >Hey all,
> >
> >When I had run nutch-selenium, it was in a config such that zombies were
> >created from closing Firefox windows and they couldn't be reaped (again,
> >due to the docker configuration I had).
> >
> >In a normal setup, it should not be an issue - if you're running 20
> >threads in nutch that's potentially 20 open FF windows which isn't good
> >for 512mb.
> >
> >Selenium grid is much more efficient, in that browsers are opened, but
> >tabs are used to fetch sites - and only those are closed.
> >
> >Additionally, ensure you're using Nutch 2.2.1.
> >
> >Feel free to fork patch and tinker and PR as needed.
> >
> >Chris, if you want to be added to contribs on the GitHub project, that's
> >cool with me! Wish I could dedicate more time to this, but I don't
> >foresee using Nutch again in the near future, and am now working on
> >projects that require lots of reading and possibly patches to Caffe and
> >opencl r-CNN projects.
> >
> >Tl;dr:
> >- no, this shouldn't be typical unless you're creating zombies like crazy
> >and they're not being reaped (too many open file descriptors), running
> >out of memory, or similar resource constraint.
> >- selenium grid is TONs more efficient, but a bit more difficult to set
> >up. I used it to crawl 100ks of sites.
> >- unfortunately I can't commit more time to this, but if I can assist in
> >any admin way, let me know.
> >
> >Thank you,
> >
> >Mo
> >
> >This message was drafted on a tiny touch screen; please forgive brevity &
> >tpyos
> >
> >> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
> >><ch...@jpl.nasa.gov> wrote:
> >>
> >> Oh yes, please up your memory to like at least 2Gb..
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Shuo Li <sl...@usc.edu>
> >> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >> Date: Friday, February 13, 2015 at 10:38 AM
> >> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >> Cc: Mo Omer <be...@gmail.com>
> >> Subject: Re: Vagrant Crushed When using Nutch-Selenium
> >>
> >>> Hey Mo and Prof Mattmann,
> >>>
> >>>
> >>> I will try to crawl the 3 websites in the homework tonight (NASA AMD,
> >>>NSF
> >>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
> >>>going
> >>> on.
> >>>
> >>>
> >>> Is memory an issue? My vagrant only has 512MB of memory.
> >>>
> >>>
> >>> Regards,
> >>> Shuo Li
> >>>
> >>>
> >>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
> >>> <ch...@jpl.nasa.gov> wrote:
> >>>
> >>> Hi Shuo,
> >>>
> >>> Thanks for your email. I wonder if using selenium grid would
> >>> help?
> >>>
> >>> Please see this plugin:
> >>>
> >>> https://github.com/momer/nutch-selenium-grid-plugin
> >>>
> >>>
> >>> I’m CC’ing Mo the author of the plugin to see if he experienced
> >>> this while running the original selenium plugin - Mo did using
> >>> selenium grid help the issue that Shuo is experiencing below?
> >>>
> >>> Mo: are you cool with portion the grid plugin, or if Lewis or
> >>> I do it to trunk (with full credit to you of course?)
> >>>
> >>> Cheers,
> >>> Chris
> >>>
> >>>
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Chris Mattmann, Ph.D.
> >>> Chief Architect
> >>> Instrument Software and Science Data Systems Section (398)
> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> Office: 168-519, Mailstop: 168-527
> >>> Email: chris.a.mattmann@nasa.gov
> >>> WWW:  http://sunset.usc.edu/~mattmann/
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Adjunct Associate Professor, Computer Science Department
> >>> University of Southern California, Los Angeles, CA 90089 USA
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Shuo Li <sl...@usc.edu>
> >>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >>> Date: Friday, February 13, 2015 at 10:12 AM
> >>> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> >>> Subject: Vagrant Crushed When using Nutch-Selenium
> >>>
> >>>> Hey guys,
> >>>>
> >>>>
> >>>> I'm trying to use Nutch-Selenium to crawl
> >>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org>.
> >>>> However, my vagrant seems
> >>>> crushed after a few minutes. I forced it to shut down and it turns
> >>>>out it
> >>>> only crawled 59 websites. My nutch version is 1.10 and my OS is Ubuntu
> >>>> Trusty, 14.04.
> >>>>
> >>>>
> >>>> Is there anything I can provide to you guys? Or is there anybody have
> >>>>the
> >>>> same issue? Or 59 websites is the complete crawling?
> >>>>
> >>>>
> >>>> Any suggestion would be appreciated.
> >>>>
> >>>>
> >>>> Regards,
> >>>> Shuo Li
> >>
>
>

Re: Vagrant Crushed When using Nutch-Selenium

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thank you Mo. I sincerely appreciate your guidance and contribution.

I will work to get your nutch selenium grid plugin contributed
to work with Nutch 1.x.

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Mo Omer <be...@gmail.com>
Date: Friday, February 13, 2015 at 11:10 AM
To: Chris Mattmann <Ch...@jpl.nasa.gov>
Cc: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: Vagrant Crushed When using Nutch-Selenium

>Hey all,
>
>When I had run nutch-selenium, it was in a config such that zombies were
>created from closing Firefox windows and they couldn't be reaped (again,
>due to the docker configuration I had).
>
>In a normal setup, it should not be an issue - if you're running 20
>threads in nutch that's potentially 20 open FF windows which isn't good
>for 512mb.
>
>Selenium grid is much more efficient, in that browsers are opened, but
>tabs are used to fetch sites - and only those are closed.
>
>Additionally, ensure you're using Nutch 2.2.1.
>
>Feel free to fork patch and tinker and PR as needed.
>
>Chris, if you want to be added to contribs on the GitHub project, that's
>cool with me! Wish I could dedicate more time to this, but I don't
>foresee using Nutch again in the near future, and am now working on
>projects that require lots of reading and possibly patches to Caffe and
>opencl r-CNN projects.
>
>Tl;dr: 
>- no, this shouldn't be typical unless you're creating zombies like crazy
>and they're not being reaped (too many open file descriptors), running
>out of memory, or similar resource constraint.
>- selenium grid is TONs more efficient, but a bit more difficult to set
>up. I used it to crawl 100ks of sites.
>- unfortunately I can't commit more time to this, but if I can assist in
>any admin way, let me know.
>
>Thank you,
>
>Mo
>
>This message was drafted on a tiny touch screen; please forgive brevity &
>tpyos
>
>> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
>><ch...@jpl.nasa.gov> wrote:
>> 
>> Oh yes, please up your memory to like at least 2Gb..
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Shuo Li <sl...@usc.edu>
>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> Date: Friday, February 13, 2015 at 10:38 AM
>> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> Cc: Mo Omer <be...@gmail.com>
>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>> 
>>> Hey Mo and Prof Mattmann,
>>> 
>>> 
>>> I will try to crawl the 3 websites in the homework tonight (NASA AMD,
>>>NSF
>>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
>>>going
>>> on. 
>>> 
>>> 
>>> Is memory an issue? My vagrant only has 512MB of memory.
>>> 
>>> 
>>> Regards,
>>> Shuo Li
>>> 
>>> 
>>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>>> <ch...@jpl.nasa.gov> wrote:
>>> 
>>> Hi Shuo,
>>> 
>>> Thanks for your email. I wonder if using selenium grid would
>>> help?
>>> 
>>> Please see this plugin:
>>> 
>>> https://github.com/momer/nutch-selenium-grid-plugin
>>> 
>>> 
>>> I’m CC’ing Mo the author of the plugin to see if he experienced
>>> this while running the original selenium plugin - Mo did using
>>> selenium grid help the issue that Shuo is experiencing below?
>>> 
>>> Mo: are you cool with portion the grid plugin, or if Lewis or
>>> I do it to trunk (with full credit to you of course?)
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Shuo Li <sl...@usc.edu>
>>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>> Date: Friday, February 13, 2015 at 10:12 AM
>>> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>>> Subject: Vagrant Crushed When using Nutch-Selenium
>>> 
>>>> Hey guys,
>>>> 
>>>> 
>>>> I'm trying to use Nutch-Selenium to crawl
>>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org>.
>>>> However, my vagrant seems
>>>> crushed after a few minutes. I forced it to shut down and it turns
>>>>out it
>>>> only crawled 59 websites. My nutch version is 1.10 and my OS is Ubuntu
>>>> Trusty, 14.04.
>>>> 
>>>> 
>>>> Is there anything I can provide to you guys? Or is there anybody have
>>>>the
>>>> same issue? Or 59 websites is the complete crawling?
>>>> 
>>>> 
>>>> Any suggestion would be appreciated.
>>>> 
>>>> 
>>>> Regards,
>>>> Shuo Li
>> 


Re: Vagrant Crushed When using Nutch-Selenium

Posted by Mo Omer <be...@gmail.com>.
Hey all,

When I had run nutch-selenium, it was in a config such that zombies were created from closing Firefox windows and they couldn't be reaped (again, due to the docker configuration I had).

In a normal setup, it should not be an issue - if you're running 20 threads in nutch that's potentially 20 open FF windows which isn't good for 512mb.

Selenium grid is much more efficient, in that browsers are opened, but tabs are used to fetch sites - and only those are closed.

Additionally, ensure you're using Nutch 2.2.1.

Feel free to fork patch and tinker and PR as needed.

Chris, if you want to be added to contribs on the GitHub project, that's cool with me! Wish I could dedicate more time to this, but I don't foresee using Nutch again in the near future, and am now working on projects that require lots of reading and possibly patches to Caffe and opencl r-CNN projects.

Tl;dr: 
- no, this shouldn't be typical unless you're creating zombies like crazy and they're not being reaped (too many open file descriptors), running out of memory, or similar resource constraint.
- selenium grid is TONs more efficient, but a bit more difficult to set up. I used it to crawl 100ks of sites.
- unfortunately I can't commit more time to this, but if I can assist in any admin way, let me know.

Thank you,

Mo

This message was drafted on a tiny touch screen; please forgive brevity & tpyos

> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> wrote:
> 
> Oh yes, please up your memory to like at least 2Gb..
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Shuo Li <sl...@usc.edu>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Friday, February 13, 2015 at 10:38 AM
> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Cc: Mo Omer <be...@gmail.com>
> Subject: Re: Vagrant Crushed When using Nutch-Selenium
> 
>> Hey Mo and Prof Mattmann,
>> 
>> 
>> I will try to crawl the 3 websites in the homework tonight (NASA AMD, NSF
>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's going
>> on. 
>> 
>> 
>> Is memory an issue? My vagrant only has 512MB of memory.
>> 
>> 
>> Regards,
>> Shuo Li
>> 
>> 
>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>> <ch...@jpl.nasa.gov> wrote:
>> 
>> Hi Shuo,
>> 
>> Thanks for your email. I wonder if using selenium grid would
>> help?
>> 
>> Please see this plugin:
>> 
>> https://github.com/momer/nutch-selenium-grid-plugin
>> 
>> 
>> I’m CC’ing Mo the author of the plugin to see if he experienced
>> this while running the original selenium plugin - Mo did using
>> selenium grid help the issue that Shuo is experiencing below?
>> 
>> Mo: are you cool with portion the grid plugin, or if Lewis or
>> I do it to trunk (with full credit to you of course?)
>> 
>> Cheers,
>> Chris
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Shuo Li <sl...@usc.edu>
>> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> Date: Friday, February 13, 2015 at 10:12 AM
>> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>> Subject: Vagrant Crushed When using Nutch-Selenium
>> 
>>> Hey guys,
>>> 
>>> 
>>> I'm trying to use Nutch-Selenium to crawl
>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org>.
>>> However, my vagrant seems
>>> crushed after a few minutes. I forced it to shut down and it turns out it
>>> only crawled 59 websites. My nutch version is 1.10 and my OS is Ubuntu
>>> Trusty, 14.04.
>>> 
>>> 
>>> Is there anything I can provide to you guys? Or is there anybody have the
>>> same issue? Or 59 websites is the complete crawling?
>>> 
>>> 
>>> Any suggestion would be appreciated.
>>> 
>>> 
>>> Regards,
>>> Shuo Li
> 

Re: Vagrant Crushed When using Nutch-Selenium

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Oh yes, please up your memory to like at least 2Gb..

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Shuo Li <sl...@usc.edu>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Friday, February 13, 2015 at 10:38 AM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Cc: Mo Omer <be...@gmail.com>
Subject: Re: Vagrant Crushed When using Nutch-Selenium

>Hey Mo and Prof Mattmann,
>
>
>I will try to crawl the 3 websites in the homework tonight (NASA AMD, NSF
>ACADIS and NSIDC Arctic Data Explorer). I will let you know what's going
>on. 
>
>
>Is memory an issue? My vagrant only has 512MB of memory.
>
>
>Regards,
>Shuo Li
>
>
>On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
><ch...@jpl.nasa.gov> wrote:
>
>Hi Shuo,
>
>Thanks for your email. I wonder if using selenium grid would
>help?
>
>Please see this plugin:
>
>https://github.com/momer/nutch-selenium-grid-plugin
>
>
>I’m CC’ing Mo the author of the plugin to see if he experienced
>this while running the original selenium plugin - Mo did using
>selenium grid help the issue that Shuo is experiencing below?
>
>Mo: are you cool with portion the grid plugin, or if Lewis or
>I do it to trunk (with full credit to you of course?)
>
>Cheers,
>Chris
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Shuo Li <sl...@usc.edu>
>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>Date: Friday, February 13, 2015 at 10:12 AM
>To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>Subject: Vagrant Crushed When using Nutch-Selenium
>
>>Hey guys,
>>
>>
>>I'm trying to use Nutch-Selenium to crawl
>>nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org>.
>>However, my vagrant seems
>>crushed after a few minutes. I forced it to shut down and it turns out it
>>only crawled 59 websites. My nutch version is 1.10 and my OS is Ubuntu
>>Trusty, 14.04.
>>
>>
>>Is there anything I can provide to you guys? Or is there anybody have the
>>same issue? Or 59 websites is the complete crawling?
>>
>>
>>Any suggestion would be appreciated.
>>
>>
>>Regards,
>>Shuo Li
>>
>
>
>
>
>
>
>
>


Re: Vagrant Crushed When using Nutch-Selenium

Posted by Shuo Li <sl...@usc.edu>.
Hey Mo and Prof Mattmann,

I will try to crawl the 3 websites in the homework tonight (NASA AMD, NSF
ACADIS and NSIDC Arctic Data Explorer). I will let you know what's going
on.

Is memory an issue? My vagrant only has 512MB of memory.

Regards,
Shuo Li

On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi Shuo,
>
> Thanks for your email. I wonder if using selenium grid would
> help?
>
> Please see this plugin:
>
> https://github.com/momer/nutch-selenium-grid-plugin
>
>
> I’m CC’ing Mo the author of the plugin to see if he experienced
> this while running the original selenium plugin - Mo did using
> selenium grid help the issue that Shuo is experiencing below?
>
> Mo: are you cool with portion the grid plugin, or if Lewis or
> I do it to trunk (with full credit to you of course?)
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Shuo Li <sl...@usc.edu>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Friday, February 13, 2015 at 10:12 AM
> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Subject: Vagrant Crushed When using Nutch-Selenium
>
> >Hey guys,
> >
> >
> >I'm trying to use Nutch-Selenium to crawl
> >nutch.apache.org <http://nutch.apache.org>. However, my vagrant seems
> >crushed after a few minutes. I forced it to shut down and it turns out it
> >only crawled 59 websites. My nutch version is 1.10 and my OS is Ubuntu
> >Trusty, 14.04.
> >
> >
> >Is there anything I can provide to you guys? Or is there anybody have the
> >same issue? Or 59 websites is the complete crawling?
> >
> >
> >Any suggestion would be appreciated.
> >
> >
> >Regards,
> >Shuo Li
> >
>
>

Re: Vagrant Crushed When using Nutch-Selenium

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hi Shuo,

Thanks for your email. I wonder if using selenium grid would
help?

Please see this plugin:

https://github.com/momer/nutch-selenium-grid-plugin


I’m CC’ing Mo the author of the plugin to see if he experienced
this while running the original selenium plugin - Mo did using
selenium grid help the issue that Shuo is experiencing below?

Mo: are you cool with portion the grid plugin, or if Lewis or
I do it to trunk (with full credit to you of course?)

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Shuo Li <sl...@usc.edu>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Friday, February 13, 2015 at 10:12 AM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Vagrant Crushed When using Nutch-Selenium

>Hey guys,
>
>
>I'm trying to use Nutch-Selenium to crawl
>nutch.apache.org <http://nutch.apache.org>. However, my vagrant seems
>crushed after a few minutes. I forced it to shut down and it turns out it
>only crawled 59 websites. My nutch version is 1.10 and my OS is Ubuntu
>Trusty, 14.04.
>
>
>Is there anything I can provide to you guys? Or is there anybody have the
>same issue? Or 59 websites is the complete crawling?
>
>
>Any suggestion would be appreciated.
>
>
>Regards,
>Shuo Li
>