You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gajalakshmi G <ga...@tcs.com.INVALID> on 2020/10/07 07:28:31 UTC

Nutch 2.4 with selenium

Hi all,

I am trying to crawl dynamic webpage using Nutch 2.4 with Selenium 3.6.0 with Firefox version 79. I am getting the below error in injector job itself.

java.lang.Exception: java.lang.NullPointerException
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NullPointerException
    at java.io.Reader.<init>(Reader.java:78)
    at java.io.BufferedReader.<init>(BufferedReader.java:101)
    at java.io.BufferedReader.<init>(BufferedReader.java:116)
    at org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:199)
    at org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:171)
    at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
    at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:62)
    at org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:113)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Please guide me on resolving this issue.



Thanks & Regards,

Gajalakshmi.G

Assistant Consultant

Tata Consultancy Services
Mailto: gajalakshmi.g@tcs.com<https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com>
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you



Re: Nutch 2.4 with selenium

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi,

> Nutch 2.4 with selenium

Nutch 2.4 does not include any plugin to use Selenium. In addition, 2.4 is for now the last release on the 2.x branch which is not
maintained anymore. You should use 1.x (1.17 is the
most recent release.

> standalone nutch crawling with selenium.

For 1.x there's a good README how to setup protocol-selenium:
  https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/README.md

In general, the tutorial is the recommended way to start
  https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
Please try to get it running first without Selenium, it's important to understand first
how Nutch works before you start with the clearly more complex Selenium-based crawling.

Best,
Sebastian

On 10/7/20 2:49 PM, Gajalakshmi G wrote:
> Hi,
> 
> Thanks for the response, the 'conf/regex-urlfilter.txt' file was available inside the current working directory.
> 
> Please guide me or share me useful links on standalone  nutch crawling with selenium.
> 
> 
> 
> Thanks & Regards,
> 
> Gajalakshmi.G
> 
> Assistant Consultant
> 
> Tata Consultancy Services
> Mailto: gajalakshmi.g@tcs.com<https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com>
> 
> ________________________________
> From: Shashanka Balakuntala <sh...@gmail.com>
> Sent: Wednesday, October 7, 2020 5:49 PM
> To: user@nutch.apache.org <us...@nutch.apache.org>
> Subject: Re: Nutch 2.4 with selenium
> 
> "External email. Open with Caution"
> 
> Hi Gajalakshmi,
> 
> The NPE can be thrown because of the file not found on the disk. So in the
> working directory/current directory check if you have the file
> conf/regex-urlfilter.txt
> 
> 
> *Regards*
>   Shashanka Balakuntala Srinivasa
> 
> 
> 
> On Wed, Oct 7, 2020 at 2:09 PM Gajalakshmi G <ga...@tcs.com.invalid>
> wrote:
> 
>> Hi all,
>>
>> I am trying to crawl dynamic webpage using Nutch 2.4 with Selenium 3.6.0
>> with Firefox version 79. I am getting the below error in injector job
>> itself.
>>
>> java.lang.Exception: java.lang.NullPointerException
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
>> Caused by: java.lang.NullPointerException
>>     at java.io.Reader.<init>(Reader.java:78)
>>     at java.io.BufferedReader.<init>(BufferedReader.java:101)
>>     at java.io.BufferedReader.<init>(BufferedReader.java:116)
>>     at
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:199)
>>     at
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:171)
>>     at
>> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
>>     at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:62)
>>     at
>> org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:113)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>>     at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>     at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>     at java.lang.Thread.run(Thread.java:748)
>>
>> Please guide me on resolving this issue.
>>
>>
>>
>> Thanks & Regards,
>>
>> Gajalakshmi.G
>>
>> Assistant Consultant
>>
>> Tata Consultancy Services
>> Mailto: gajalakshmi.g@tcs.com<
>> https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com
>>>
>> =====-----=====-----=====
>> Notice: The information contained in this e-mail
>> message and/or attachments to it may contain
>> confidential or privileged information. If you are
>> not the intended recipient, any dissemination, use,
>> review, distribution, printing or copying of the
>> information contained in this e-mail message
>> and/or attachments to it are strictly prohibited. If
>> you have received this communication in error,
>> please notify us by reply e-mail or telephone and
>> immediately and permanently delete the message
>> and any attachments. Thank you
>>
>>
>>
> 


Re: Nutch 2.4 with selenium

Posted by Gajalakshmi G <ga...@tcs.com.INVALID>.
Hi,

Thanks for the response, the 'conf/regex-urlfilter.txt' file was available inside the current working directory.

Please guide me or share me useful links on standalone  nutch crawling with selenium.



Thanks & Regards,

Gajalakshmi.G

Assistant Consultant

Tata Consultancy Services
Mailto: gajalakshmi.g@tcs.com<https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com>

________________________________
From: Shashanka Balakuntala <sh...@gmail.com>
Sent: Wednesday, October 7, 2020 5:49 PM
To: user@nutch.apache.org <us...@nutch.apache.org>
Subject: Re: Nutch 2.4 with selenium

"External email. Open with Caution"

Hi Gajalakshmi,

The NPE can be thrown because of the file not found on the disk. So in the
working directory/current directory check if you have the file
conf/regex-urlfilter.txt


*Regards*
  Shashanka Balakuntala Srinivasa



On Wed, Oct 7, 2020 at 2:09 PM Gajalakshmi G <ga...@tcs.com.invalid>
wrote:

> Hi all,
>
> I am trying to crawl dynamic webpage using Nutch 2.4 with Selenium 3.6.0
> with Firefox version 79. I am getting the below error in injector job
> itself.
>
> java.lang.Exception: java.lang.NullPointerException
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
> Caused by: java.lang.NullPointerException
>     at java.io.Reader.<init>(Reader.java:78)
>     at java.io.BufferedReader.<init>(BufferedReader.java:101)
>     at java.io.BufferedReader.<init>(BufferedReader.java:116)
>     at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:199)
>     at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:171)
>     at
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
>     at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:62)
>     at
> org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:113)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>     at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>
> Please guide me on resolving this issue.
>
>
>
> Thanks & Regards,
>
> Gajalakshmi.G
>
> Assistant Consultant
>
> Tata Consultancy Services
> Mailto: gajalakshmi.g@tcs.com<
> https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com
> >
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>

Re: Nutch 2.4 with selenium

Posted by Shashanka Balakuntala <sh...@gmail.com>.
Hi Gajalakshmi,

The NPE can be thrown because of the file not found on the disk. So in the
working directory/current directory check if you have the file
conf/regex-urlfilter.txt


*Regards*
  Shashanka Balakuntala Srinivasa



On Wed, Oct 7, 2020 at 2:09 PM Gajalakshmi G <ga...@tcs.com.invalid>
wrote:

> Hi all,
>
> I am trying to crawl dynamic webpage using Nutch 2.4 with Selenium 3.6.0
> with Firefox version 79. I am getting the below error in injector job
> itself.
>
> java.lang.Exception: java.lang.NullPointerException
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
> Caused by: java.lang.NullPointerException
>     at java.io.Reader.<init>(Reader.java:78)
>     at java.io.BufferedReader.<init>(BufferedReader.java:101)
>     at java.io.BufferedReader.<init>(BufferedReader.java:116)
>     at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:199)
>     at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:171)
>     at
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
>     at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:62)
>     at
> org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:113)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>     at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>
> Please guide me on resolving this issue.
>
>
>
> Thanks & Regards,
>
> Gajalakshmi.G
>
> Assistant Consultant
>
> Tata Consultancy Services
> Mailto: gajalakshmi.g@tcs.com<
> https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com
> >
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>