You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Venkata MR <Ve...@hcl.com> on 2018/12/07 07:03:30 UTC

Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Hi,

Was trying to fetch the content rendered by ajax call using Apache Nutch 2.3.1.
Seems, it is not able to get the actual rendered content only getting the view source page ( as part of protocol-js plugin).
Has anyone able to fetch the rendered content from Ajax call using Nutch 2.3.1 or any suggestions?

Thanks & Regards
Venkata MR
+91 98455 77125

::DISCLAIMER::
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

RE: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Posted by Venkata MR <Ve...@hcl.com>.
Hi Sebastian,

Pls find the link for issue: https://issues.apache.org/jira/browse/NUTCH-2681

Thanks & Regards
Venkata MR
+91 98455 77125


-----Original Message-----
From: Sebastian Nagel <wa...@googlemail.com> 
Sent: 21 December 2018 19:19
To: user@nutch.apache.org
Cc: Venkata MR <Ve...@hcl.com>
Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Hi,

sorry for the late reply. Looks like one of the really nasty dependency conflicts with incompatible class implementations resp. versions which are only observed at runtime.

That's the potential conflicting candidates (from current master):

runtime/local/plugins/lib-selenium/xml-apis-1.4.01.jar
     3505  2009-12-09 13:02   javax/xml/parsers/DocumentBuilderFactory.class

runtime/local/plugins/lib-selenium/xercesImpl-2.11.0.jar
       51  2010-11-26 15:37   META-INF/services/javax.xml.parsers.DocumentBuilderFactory
     4546  2010-11-26 15:40   org/apache/xerces/jaxp/DocumentBuilderFactoryImpl.class

runtime/local/lib/xml-apis-1.4.01.jar
     3505  2009-12-09 13:02   javax/xml/parsers/DocumentBuilderFactory.class

runtime/local/lib/xercesImpl-2.11.0.jar
       51  2010-11-26 15:37   META-INF/services/javax.xml.parsers.DocumentBuilderFactory
     4546  2010-11-26 15:40   org/apache/xerces/jaxp/DocumentBuilderFactoryImpl.class

runtime/local/lib/xmlParserAPIs-2.6.2.jar
     2067  2003-11-18 15:19   javax/xml/parsers/DocumentBuilderFactory.class

I have no other idea than a trial-error, e.g., remove
   .../lib/xmlParserAPIs-2.6.2.jar
  resp. delete in ivy/ivy.xml:
   <dependency org="xerces" name="xmlParserAPIs" rev="2.6.2" />

Sorry, but I have no time left now and the next two weeks to try find a fix or work-around.

Please also open an issue to fix this on
    https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fprojects%2FNUTCH&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C68d7209dcb3047168ae908d6674b17f3%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636809969571335592&amp;sdata=QsZlP6RpjEo%2BXFcdy7m4FIuDffny0MGV4zvQ5%2Fd8cr8%3D&amp;reserved=0

Thanks,
Sebastian

On 12/19/18 5:50 AM, Venkata MR wrote:
> Hi All,
> 
> Any inputs here really appreciated. Thanks again.
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> -----Original Message-----
> From: Venkata MR
> Sent: 18 December 2018 16:40
> To: 'Sebastian Nagel' <wa...@googlemail.com>
> Cc: user@nutch.apache.org
> Subject: RE: Apache Nutch 2.3.1 not able to fetch content rendered by 
> ajax
> 
> +user@nutch.apache.org
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> 
> -----Original Message-----
> From: Venkata MR
> Sent: 18 December 2018 16:05
> To: 'Sebastian Nagel' <wa...@googlemail.com>
> Subject: RE: Apache Nutch 2.3.1 not able to fetch content rendered by 
> ajax
> 
> Hi Sebastian,
> 
> Went with selenium v2.48.2 and firefox 31.4.0 as specified. It is the same casting exception.
> Pls find below the log details.
> 
> Caused by: org.openqa.selenium.WebDriverException: 
> java.lang.ClassCastException: 
> org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to 
> javax.xml.parsers.DocumentBuilderFactory
> Build info: version: '2.48.2', revision: '41bccdd10cf2c0560f637404c2d96164b67d9d67', time: '2015-10-09 13:08:06'
> System info: host: '24labs', ip: '10.0.10.24', os.name: 'Linux', os.arch: 'amd64', os.version: '3.10.0-327.13.1.el7.x86_64', java.version: '1.8.0_191'
> Driver info: driver.version: FirefoxDriver
> 	at org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:142)
> 	at org.openqa.selenium.firefox.internal.FileExtension.writeTo(FileExtension.java:61)
> 	at org.openqa.selenium.firefox.internal.ClasspathExtension.writeTo(ClasspathExtension.java:64)
> 	at org.openqa.selenium.firefox.FirefoxProfile.installExtensions(FirefoxProfile.java:443)
> 	at org.openqa.selenium.firefox.FirefoxProfile.layoutOnDisk(FirefoxProfile.java:421)
> 	at org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:95)
> 	... 12 more
> Caused by: java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
> 	at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
> 	at 
> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRd
> f(FileExtension.java:95)
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> -----Original Message-----
> From: Sebastian Nagel <wa...@googlemail.com>
> Sent: 17 December 2018 19:53
> To: Venkata MR <Ve...@hcl.com>
> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
> ajax
> 
> Hi,
> 
> what happens if you the same version of Selenium as Nutch 1.15 does - 2.48.2?
> Or at least a "close" version?
> 
> Alternatively, you can try to upgrade the Selenium version in Nutch, but that's not trivial and requires changes in multiple files.
> 
> Best,
> Sebastian
> 
> 
> On 12/17/18 12:07 PM, Venkata MR wrote:
>> Hi Sebastian,
>>
>> Thanks it is working by removing "protocol-httpclient", but some compatibility issues between Nutch 1.x, Selenium v3.4.0 and Geckodriver: 0.23.0.
>> Here is the exception:
>> Caused by: java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
>> 	at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
>> 	at
>> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallR
>> d
>> f(FileExtension.java:95)
>>
>> Thanks & Regards
>> Venkata MR
>> +91 98455 77125
>>
>>
>> -----Original Message-----
>> From: Sebastian Nagel <wa...@googlemail.com.INVALID>
>> Sent: 17 December 2018 14:57
>> To: user@nutch.apache.org
>> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
>> ajax
>>
>> Hi,
>>
>>> protocol-httpclient (as the websites are with https).
>>
>> With Nutch 1.15 protocol-selenium supports https. If 
>> protocol-httpclient is also active, it may be used instead of 
>> protocol-selenium. There is no need to activate it, the description 
>> in nutch-default.xml needs to be fixed, see 
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fis
>> s 
>> ues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2678&amp;data=02%7C01%7CVenkat
>> a
>> .MR%40hcl.com%7Cd7e1c764a7a34216dd6d08d6642b2d47%7C189de737c93a4f5a8b
>> 6 
>> 86f4ca9941912%7C0%7C0%7C636806533951474875&amp;sdata=MFYP3ICA7XSIsOoV
>> v
>> YvNMQahG%2FKLQSWyn82ZrwaqGr8%3D&amp;reserved=0
>>
>> Note that protocol-interactiveselenium will support https in 1.16.
>>
>> Best,
>> Sebastian
>>
>> On 12/16/18 1:40 PM, Venkata MR wrote:
>>> Hi Lewis,
>>>
>>> Thanks for your email, I tried all options with no success before reaching you again referring to the link you had provided.
>>>
>>> Here I am trying to crawl websites which are having the runtime rendered content to extract and parse.
>>> I downloaded the Nutch provided in the below email. Added protocol-interactiveselenium, protocol-selenium along with protocol-httpclient (as the websites are with https).
>>> Selenium - firefox is configured and it is working properly, and selenium is configured and running while doing the crawling.
>>>
>>> Yet, not able to get rendered content. Here I attached the nutch-site.xml for reference to see any input of missing configuration.
>>>
>>> Just wondering to guess if the issue with the tika parsers not able to parse the extracted runtime rendered content or the issue with the Solr (I am using Apache solr to index parsed data) for not having the indexed field to represent the data (schema has content, url, title and id).
>>>
>>> Any input really appreciated to resolve the issue.
>>>
>>> Environment: CentOS-7
>>> Firefox: 60.3 oesr (64 bit)
>>> Selenium : v3.4.0
>>> Geckodriver: 0.23.0 ( 2018-10-04)
>>> Apache Nutch: 1.x
>>>
>>> Thanks & Regards
>>> Venkata MR
>>> +91 98455 77125
>>>
>>> -----Original Message-----
>>> From: Lewis John McGibbney <le...@apache.org>
>>> Sent: 09 December 2018 02:11
>>> To: user@nutch.apache.org
>>> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered 
>>> by ajax
>>>
>>> Hi Venkata,
>>> This functionality is not available in 2.X at the moment.
>>> The functionality is available in the 1.x primary branch. You can 
>>> learn about the implementation at both
>>>
>>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
>>> i
>>> t
>>> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-s
>>> e
>>> l
>>> enium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b4441eb766
>>> 0
>>> 8
>>> d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636806356178
>>> 3
>>> 9
>>> 6679&amp;sdata=P89fyka%2F8bmt%2Fr54hlM60EjNFU4Glo1%2BUCxX73fUJEQ%3D&
>>> a
>>> m
>>> p;reserved=0, and
>>>
>>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
>>> i
>>> t
>>> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-i
>>> n
>>> t
>>> eractiveselenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d
>>> 5
>>> b
>>> 4441eb76608d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C6
>>> 3
>>> 6
>>> 806356178396679&amp;sdata=mrzw3d8XIJDMq9oP%2BN3S0jKdhuYcYUzp%2BO3YOf
>>> Y
>>> 4
>>> gEk%3D&amp;reserved=0
>>>
>>> Lewis
>>>
>>> On 2018/12/07 07:03:30, Venkata MR <Ve...@hcl.com> wrote: 
>>>> Hi,
>>>>
>>>> Was trying to fetch the content rendered by ajax call using Apache Nutch 2.3.1.
>>>> Seems, it is not able to get the actual rendered content only getting the view source page ( as part of protocol-js plugin).
>>>> Has anyone able to fetch the rendered content from Ajax call using Nutch 2.3.1 or any suggestions?
>>>>
>>>> Thanks & Regards
>>>> Venkata MR
>>>> +91 98455 77125
>>>>
>>>> ::DISCLAIMER::
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -- The contents of this e-mail and any attachment(s) are 
>>>> confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> --
>>>>
>>
> 


Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi,

sorry for the late reply. Looks like one of the really nasty dependency conflicts with incompatible
class implementations resp. versions which are only observed at runtime.

That's the potential conflicting candidates (from current master):

runtime/local/plugins/lib-selenium/xml-apis-1.4.01.jar
     3505  2009-12-09 13:02   javax/xml/parsers/DocumentBuilderFactory.class

runtime/local/plugins/lib-selenium/xercesImpl-2.11.0.jar
       51  2010-11-26 15:37   META-INF/services/javax.xml.parsers.DocumentBuilderFactory
     4546  2010-11-26 15:40   org/apache/xerces/jaxp/DocumentBuilderFactoryImpl.class

runtime/local/lib/xml-apis-1.4.01.jar
     3505  2009-12-09 13:02   javax/xml/parsers/DocumentBuilderFactory.class

runtime/local/lib/xercesImpl-2.11.0.jar
       51  2010-11-26 15:37   META-INF/services/javax.xml.parsers.DocumentBuilderFactory
     4546  2010-11-26 15:40   org/apache/xerces/jaxp/DocumentBuilderFactoryImpl.class

runtime/local/lib/xmlParserAPIs-2.6.2.jar
     2067  2003-11-18 15:19   javax/xml/parsers/DocumentBuilderFactory.class

I have no other idea than a trial-error, e.g., remove
   .../lib/xmlParserAPIs-2.6.2.jar
  resp. delete in ivy/ivy.xml:
   <dependency org="xerces" name="xmlParserAPIs" rev="2.6.2" />

Sorry, but I have no time left now and the next two weeks to try find a fix or work-around.

Please also open an issue to fix this on
    https://issues.apache.org/jira/projects/NUTCH

Thanks,
Sebastian

On 12/19/18 5:50 AM, Venkata MR wrote:
> Hi All,
> 
> Any inputs here really appreciated. Thanks again.
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> -----Original Message-----
> From: Venkata MR 
> Sent: 18 December 2018 16:40
> To: 'Sebastian Nagel' <wa...@googlemail.com>
> Cc: user@nutch.apache.org
> Subject: RE: Apache Nutch 2.3.1 not able to fetch content rendered by ajax
> 
> +user@nutch.apache.org
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> 
> -----Original Message-----
> From: Venkata MR
> Sent: 18 December 2018 16:05
> To: 'Sebastian Nagel' <wa...@googlemail.com>
> Subject: RE: Apache Nutch 2.3.1 not able to fetch content rendered by ajax
> 
> Hi Sebastian,
> 
> Went with selenium v2.48.2 and firefox 31.4.0 as specified. It is the same casting exception.
> Pls find below the log details.
> 
> Caused by: org.openqa.selenium.WebDriverException: java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
> Build info: version: '2.48.2', revision: '41bccdd10cf2c0560f637404c2d96164b67d9d67', time: '2015-10-09 13:08:06'
> System info: host: '24labs', ip: '10.0.10.24', os.name: 'Linux', os.arch: 'amd64', os.version: '3.10.0-327.13.1.el7.x86_64', java.version: '1.8.0_191'
> Driver info: driver.version: FirefoxDriver
> 	at org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:142)
> 	at org.openqa.selenium.firefox.internal.FileExtension.writeTo(FileExtension.java:61)
> 	at org.openqa.selenium.firefox.internal.ClasspathExtension.writeTo(ClasspathExtension.java:64)
> 	at org.openqa.selenium.firefox.FirefoxProfile.installExtensions(FirefoxProfile.java:443)
> 	at org.openqa.selenium.firefox.FirefoxProfile.layoutOnDisk(FirefoxProfile.java:421)
> 	at org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:95)
> 	... 12 more
> Caused by: java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
> 	at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
> 	at org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:95)
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> -----Original Message-----
> From: Sebastian Nagel <wa...@googlemail.com>
> Sent: 17 December 2018 19:53
> To: Venkata MR <Ve...@hcl.com>
> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax
> 
> Hi,
> 
> what happens if you the same version of Selenium as Nutch 1.15 does - 2.48.2?
> Or at least a "close" version?
> 
> Alternatively, you can try to upgrade the Selenium version in Nutch, but that's not trivial and requires changes in multiple files.
> 
> Best,
> Sebastian
> 
> 
> On 12/17/18 12:07 PM, Venkata MR wrote:
>> Hi Sebastian,
>>
>> Thanks it is working by removing "protocol-httpclient", but some compatibility issues between Nutch 1.x, Selenium v3.4.0 and Geckodriver: 0.23.0.
>> Here is the exception:
>> Caused by: java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
>> 	at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
>> 	at
>> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRd
>> f(FileExtension.java:95)
>>
>> Thanks & Regards
>> Venkata MR
>> +91 98455 77125
>>
>>
>> -----Original Message-----
>> From: Sebastian Nagel <wa...@googlemail.com.INVALID>
>> Sent: 17 December 2018 14:57
>> To: user@nutch.apache.org
>> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
>> ajax
>>
>> Hi,
>>
>>> protocol-httpclient (as the websites are with https).
>>
>> With Nutch 1.15 protocol-selenium supports https. If 
>> protocol-httpclient is also active, it may be used instead of 
>> protocol-selenium. There is no need to activate it, the description in 
>> nutch-default.xml needs to be fixed, see 
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss
>> ues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2678&amp;data=02%7C01%7CVenkata
>> .MR%40hcl.com%7Cd7e1c764a7a34216dd6d08d6642b2d47%7C189de737c93a4f5a8b6
>> 86f4ca9941912%7C0%7C0%7C636806533951474875&amp;sdata=MFYP3ICA7XSIsOoVv
>> YvNMQahG%2FKLQSWyn82ZrwaqGr8%3D&amp;reserved=0
>>
>> Note that protocol-interactiveselenium will support https in 1.16.
>>
>> Best,
>> Sebastian
>>
>> On 12/16/18 1:40 PM, Venkata MR wrote:
>>> Hi Lewis,
>>>
>>> Thanks for your email, I tried all options with no success before reaching you again referring to the link you had provided.
>>>
>>> Here I am trying to crawl websites which are having the runtime rendered content to extract and parse.
>>> I downloaded the Nutch provided in the below email. Added protocol-interactiveselenium, protocol-selenium along with protocol-httpclient (as the websites are with https).
>>> Selenium - firefox is configured and it is working properly, and selenium is configured and running while doing the crawling.
>>>
>>> Yet, not able to get rendered content. Here I attached the nutch-site.xml for reference to see any input of missing configuration.
>>>
>>> Just wondering to guess if the issue with the tika parsers not able to parse the extracted runtime rendered content or the issue with the Solr (I am using Apache solr to index parsed data) for not having the indexed field to represent the data (schema has content, url, title and id).
>>>
>>> Any input really appreciated to resolve the issue.
>>>
>>> Environment: CentOS-7
>>> Firefox: 60.3 oesr (64 bit)
>>> Selenium : v3.4.0
>>> Geckodriver: 0.23.0 ( 2018-10-04)
>>> Apache Nutch: 1.x
>>>
>>> Thanks & Regards
>>> Venkata MR
>>> +91 98455 77125
>>>
>>> -----Original Message-----
>>> From: Lewis John McGibbney <le...@apache.org>
>>> Sent: 09 December 2018 02:11
>>> To: user@nutch.apache.org
>>> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
>>> ajax
>>>
>>> Hi Venkata,
>>> This functionality is not available in 2.X at the moment.
>>> The functionality is available in the 1.x primary branch. You can 
>>> learn about the implementation at both
>>>
>>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
>>> t
>>> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-se
>>> l
>>> enium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b4441eb7660
>>> 8
>>> d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C6368063561783
>>> 9
>>> 6679&amp;sdata=P89fyka%2F8bmt%2Fr54hlM60EjNFU4Glo1%2BUCxX73fUJEQ%3D&a
>>> m
>>> p;reserved=0, and
>>>
>>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
>>> t
>>> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-in
>>> t
>>> eractiveselenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5
>>> b
>>> 4441eb76608d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C63
>>> 6
>>> 806356178396679&amp;sdata=mrzw3d8XIJDMq9oP%2BN3S0jKdhuYcYUzp%2BO3YOfY
>>> 4
>>> gEk%3D&amp;reserved=0
>>>
>>> Lewis
>>>
>>> On 2018/12/07 07:03:30, Venkata MR <Ve...@hcl.com> wrote: 
>>>> Hi,
>>>>
>>>> Was trying to fetch the content rendered by ajax call using Apache Nutch 2.3.1.
>>>> Seems, it is not able to get the actual rendered content only getting the view source page ( as part of protocol-js plugin).
>>>> Has anyone able to fetch the rendered content from Ajax call using Nutch 2.3.1 or any suggestions?
>>>>
>>>> Thanks & Regards
>>>> Venkata MR
>>>> +91 98455 77125
>>>>
>>>> ::DISCLAIMER::
>>>> --------------------------------------------------------------------
>>>> -
>>>> --------------------------------------------------------------------
>>>> -
>>>> --------------------------------------------------------------------
>>>> -
>>>> --------------------------------------------------------------------
>>>> -
>>>> -- The contents of this e-mail and any attachment(s) are 
>>>> confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
>>>> --------------------------------------------------------------------
>>>> -
>>>> --------------------------------------------------------------------
>>>> -
>>>> --------------------------------------------------------------------
>>>> -
>>>> --------------------------------------------------------------------
>>>> -
>>>> --
>>>>
>>
> 


RE: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Posted by Venkata MR <Ve...@hcl.com>.
Hi All,

Any inputs here really appreciated. Thanks again.

Thanks & Regards
Venkata MR
+91 98455 77125

-----Original Message-----
From: Venkata MR 
Sent: 18 December 2018 16:40
To: 'Sebastian Nagel' <wa...@googlemail.com>
Cc: user@nutch.apache.org
Subject: RE: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

+user@nutch.apache.org

Thanks & Regards
Venkata MR
+91 98455 77125


-----Original Message-----
From: Venkata MR
Sent: 18 December 2018 16:05
To: 'Sebastian Nagel' <wa...@googlemail.com>
Subject: RE: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Hi Sebastian,

Went with selenium v2.48.2 and firefox 31.4.0 as specified. It is the same casting exception.
Pls find below the log details.

Caused by: org.openqa.selenium.WebDriverException: java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
Build info: version: '2.48.2', revision: '41bccdd10cf2c0560f637404c2d96164b67d9d67', time: '2015-10-09 13:08:06'
System info: host: '24labs', ip: '10.0.10.24', os.name: 'Linux', os.arch: 'amd64', os.version: '3.10.0-327.13.1.el7.x86_64', java.version: '1.8.0_191'
Driver info: driver.version: FirefoxDriver
	at org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:142)
	at org.openqa.selenium.firefox.internal.FileExtension.writeTo(FileExtension.java:61)
	at org.openqa.selenium.firefox.internal.ClasspathExtension.writeTo(ClasspathExtension.java:64)
	at org.openqa.selenium.firefox.FirefoxProfile.installExtensions(FirefoxProfile.java:443)
	at org.openqa.selenium.firefox.FirefoxProfile.layoutOnDisk(FirefoxProfile.java:421)
	at org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:95)
	... 12 more
Caused by: java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
	at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
	at org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:95)

Thanks & Regards
Venkata MR
+91 98455 77125

-----Original Message-----
From: Sebastian Nagel <wa...@googlemail.com>
Sent: 17 December 2018 19:53
To: Venkata MR <Ve...@hcl.com>
Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Hi,

what happens if you the same version of Selenium as Nutch 1.15 does - 2.48.2?
Or at least a "close" version?

Alternatively, you can try to upgrade the Selenium version in Nutch, but that's not trivial and requires changes in multiple files.

Best,
Sebastian


On 12/17/18 12:07 PM, Venkata MR wrote:
> Hi Sebastian,
> 
> Thanks it is working by removing "protocol-httpclient", but some compatibility issues between Nutch 1.x, Selenium v3.4.0 and Geckodriver: 0.23.0.
> Here is the exception:
> Caused by: java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
> 	at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
> 	at
> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRd
> f(FileExtension.java:95)
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> 
> -----Original Message-----
> From: Sebastian Nagel <wa...@googlemail.com.INVALID>
> Sent: 17 December 2018 14:57
> To: user@nutch.apache.org
> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
> ajax
> 
> Hi,
> 
>> protocol-httpclient (as the websites are with https).
> 
> With Nutch 1.15 protocol-selenium supports https. If 
> protocol-httpclient is also active, it may be used instead of 
> protocol-selenium. There is no need to activate it, the description in 
> nutch-default.xml needs to be fixed, see 
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss
> ues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2678&amp;data=02%7C01%7CVenkata
> .MR%40hcl.com%7Cd7e1c764a7a34216dd6d08d6642b2d47%7C189de737c93a4f5a8b6
> 86f4ca9941912%7C0%7C0%7C636806533951474875&amp;sdata=MFYP3ICA7XSIsOoVv
> YvNMQahG%2FKLQSWyn82ZrwaqGr8%3D&amp;reserved=0
> 
> Note that protocol-interactiveselenium will support https in 1.16.
> 
> Best,
> Sebastian
> 
> On 12/16/18 1:40 PM, Venkata MR wrote:
>> Hi Lewis,
>>
>> Thanks for your email, I tried all options with no success before reaching you again referring to the link you had provided.
>>
>> Here I am trying to crawl websites which are having the runtime rendered content to extract and parse.
>> I downloaded the Nutch provided in the below email. Added protocol-interactiveselenium, protocol-selenium along with protocol-httpclient (as the websites are with https).
>> Selenium - firefox is configured and it is working properly, and selenium is configured and running while doing the crawling.
>>
>> Yet, not able to get rendered content. Here I attached the nutch-site.xml for reference to see any input of missing configuration.
>>
>> Just wondering to guess if the issue with the tika parsers not able to parse the extracted runtime rendered content or the issue with the Solr (I am using Apache solr to index parsed data) for not having the indexed field to represent the data (schema has content, url, title and id).
>>
>> Any input really appreciated to resolve the issue.
>>
>> Environment: CentOS-7
>> Firefox: 60.3 oesr (64 bit)
>> Selenium : v3.4.0
>> Geckodriver: 0.23.0 ( 2018-10-04)
>> Apache Nutch: 1.x
>>
>> Thanks & Regards
>> Venkata MR
>> +91 98455 77125
>>
>> -----Original Message-----
>> From: Lewis John McGibbney <le...@apache.org>
>> Sent: 09 December 2018 02:11
>> To: user@nutch.apache.org
>> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
>> ajax
>>
>> Hi Venkata,
>> This functionality is not available in 2.X at the moment.
>> The functionality is available in the 1.x primary branch. You can 
>> learn about the implementation at both
>>
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
>> t
>> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-se
>> l
>> enium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b4441eb7660
>> 8
>> d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C6368063561783
>> 9
>> 6679&amp;sdata=P89fyka%2F8bmt%2Fr54hlM60EjNFU4Glo1%2BUCxX73fUJEQ%3D&a
>> m
>> p;reserved=0, and
>>
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
>> t
>> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-in
>> t
>> eractiveselenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5
>> b
>> 4441eb76608d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C63
>> 6
>> 806356178396679&amp;sdata=mrzw3d8XIJDMq9oP%2BN3S0jKdhuYcYUzp%2BO3YOfY
>> 4
>> gEk%3D&amp;reserved=0
>>
>> Lewis
>>
>> On 2018/12/07 07:03:30, Venkata MR <Ve...@hcl.com> wrote: 
>>> Hi,
>>>
>>> Was trying to fetch the content rendered by ajax call using Apache Nutch 2.3.1.
>>> Seems, it is not able to get the actual rendered content only getting the view source page ( as part of protocol-js plugin).
>>> Has anyone able to fetch the rendered content from Ajax call using Nutch 2.3.1 or any suggestions?
>>>
>>> Thanks & Regards
>>> Venkata MR
>>> +91 98455 77125
>>>
>>> ::DISCLAIMER::
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> -- The contents of this e-mail and any attachment(s) are 
>>> confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --
>>>
> 


RE: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Posted by Venkata MR <Ve...@hcl.com>.
+user@nutch.apache.org

Thanks & Regards
Venkata MR
+91 98455 77125


-----Original Message-----
From: Venkata MR 
Sent: 18 December 2018 16:05
To: 'Sebastian Nagel' <wa...@googlemail.com>
Subject: RE: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Hi Sebastian,

Went with selenium v2.48.2 and firefox 31.4.0 as specified. It is the same casting exception.
Pls find below the log details.

Caused by: org.openqa.selenium.WebDriverException: java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
Build info: version: '2.48.2', revision: '41bccdd10cf2c0560f637404c2d96164b67d9d67', time: '2015-10-09 13:08:06'
System info: host: '24labs', ip: '10.0.10.24', os.name: 'Linux', os.arch: 'amd64', os.version: '3.10.0-327.13.1.el7.x86_64', java.version: '1.8.0_191'
Driver info: driver.version: FirefoxDriver
	at org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:142)
	at org.openqa.selenium.firefox.internal.FileExtension.writeTo(FileExtension.java:61)
	at org.openqa.selenium.firefox.internal.ClasspathExtension.writeTo(ClasspathExtension.java:64)
	at org.openqa.selenium.firefox.FirefoxProfile.installExtensions(FirefoxProfile.java:443)
	at org.openqa.selenium.firefox.FirefoxProfile.layoutOnDisk(FirefoxProfile.java:421)
	at org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:95)
	... 12 more
Caused by: java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
	at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
	at org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:95)

Thanks & Regards
Venkata MR
+91 98455 77125

-----Original Message-----
From: Sebastian Nagel <wa...@googlemail.com>
Sent: 17 December 2018 19:53
To: Venkata MR <Ve...@hcl.com>
Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Hi,

what happens if you the same version of Selenium as Nutch 1.15 does - 2.48.2?
Or at least a "close" version?

Alternatively, you can try to upgrade the Selenium version in Nutch, but that's not trivial and requires changes in multiple files.

Best,
Sebastian


On 12/17/18 12:07 PM, Venkata MR wrote:
> Hi Sebastian,
> 
> Thanks it is working by removing "protocol-httpclient", but some compatibility issues between Nutch 1.x, Selenium v3.4.0 and Geckodriver: 0.23.0.
> Here is the exception:
> Caused by: java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
> 	at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
> 	at
> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRd
> f(FileExtension.java:95)
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> 
> -----Original Message-----
> From: Sebastian Nagel <wa...@googlemail.com.INVALID>
> Sent: 17 December 2018 14:57
> To: user@nutch.apache.org
> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
> ajax
> 
> Hi,
> 
>> protocol-httpclient (as the websites are with https).
> 
> With Nutch 1.15 protocol-selenium supports https. If 
> protocol-httpclient is also active, it may be used instead of 
> protocol-selenium. There is no need to activate it, the description in 
> nutch-default.xml needs to be fixed, see 
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss
> ues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2678&amp;data=02%7C01%7CVenkata
> .MR%40hcl.com%7Cd7e1c764a7a34216dd6d08d6642b2d47%7C189de737c93a4f5a8b6
> 86f4ca9941912%7C0%7C0%7C636806533951474875&amp;sdata=MFYP3ICA7XSIsOoVv
> YvNMQahG%2FKLQSWyn82ZrwaqGr8%3D&amp;reserved=0
> 
> Note that protocol-interactiveselenium will support https in 1.16.
> 
> Best,
> Sebastian
> 
> On 12/16/18 1:40 PM, Venkata MR wrote:
>> Hi Lewis,
>>
>> Thanks for your email, I tried all options with no success before reaching you again referring to the link you had provided.
>>
>> Here I am trying to crawl websites which are having the runtime rendered content to extract and parse.
>> I downloaded the Nutch provided in the below email. Added protocol-interactiveselenium, protocol-selenium along with protocol-httpclient (as the websites are with https).
>> Selenium - firefox is configured and it is working properly, and selenium is configured and running while doing the crawling.
>>
>> Yet, not able to get rendered content. Here I attached the nutch-site.xml for reference to see any input of missing configuration.
>>
>> Just wondering to guess if the issue with the tika parsers not able to parse the extracted runtime rendered content or the issue with the Solr (I am using Apache solr to index parsed data) for not having the indexed field to represent the data (schema has content, url, title and id).
>>
>> Any input really appreciated to resolve the issue.
>>
>> Environment: CentOS-7
>> Firefox: 60.3 oesr (64 bit)
>> Selenium : v3.4.0
>> Geckodriver: 0.23.0 ( 2018-10-04)
>> Apache Nutch: 1.x
>>
>> Thanks & Regards
>> Venkata MR
>> +91 98455 77125
>>
>> -----Original Message-----
>> From: Lewis John McGibbney <le...@apache.org>
>> Sent: 09 December 2018 02:11
>> To: user@nutch.apache.org
>> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
>> ajax
>>
>> Hi Venkata,
>> This functionality is not available in 2.X at the moment.
>> The functionality is available in the 1.x primary branch. You can 
>> learn about the implementation at both
>>
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
>> t
>> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-se
>> l
>> enium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b4441eb7660
>> 8
>> d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C6368063561783
>> 9
>> 6679&amp;sdata=P89fyka%2F8bmt%2Fr54hlM60EjNFU4Glo1%2BUCxX73fUJEQ%3D&a
>> m
>> p;reserved=0, and
>>
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
>> t
>> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-in
>> t
>> eractiveselenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5
>> b
>> 4441eb76608d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C63
>> 6
>> 806356178396679&amp;sdata=mrzw3d8XIJDMq9oP%2BN3S0jKdhuYcYUzp%2BO3YOfY
>> 4
>> gEk%3D&amp;reserved=0
>>
>> Lewis
>>
>> On 2018/12/07 07:03:30, Venkata MR <Ve...@hcl.com> wrote: 
>>> Hi,
>>>
>>> Was trying to fetch the content rendered by ajax call using Apache Nutch 2.3.1.
>>> Seems, it is not able to get the actual rendered content only getting the view source page ( as part of protocol-js plugin).
>>> Has anyone able to fetch the rendered content from Ajax call using Nutch 2.3.1 or any suggestions?
>>>
>>> Thanks & Regards
>>> Venkata MR
>>> +91 98455 77125
>>>
>>> ::DISCLAIMER::
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> -- The contents of this e-mail and any attachment(s) are 
>>> confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --
>>>
> 


RE: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Posted by Venkata MR <Ve...@hcl.com>.
Hi Sebastian,

Thanks it is working by removing "protocol-httpclient", but some compatibility issues between Nutch 1.x, Selenium v3.4.0 and Geckodriver: 0.23.0.
Here is the exception:
Caused by: java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
	at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
	at org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:95)

Thanks & Regards
Venkata MR
+91 98455 77125


-----Original Message-----
From: Sebastian Nagel <wa...@googlemail.com.INVALID> 
Sent: 17 December 2018 14:57
To: user@nutch.apache.org
Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Hi,

> protocol-httpclient (as the websites are with https).

With Nutch 1.15 protocol-selenium supports https. If protocol-httpclient is also active, it may be used instead of protocol-selenium. There is no need to activate it, the description in nutch-default.xml needs to be fixed, see https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2678&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b4441eb76608d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636806356178396679&amp;sdata=%2Boc8NCNZNKdFPqvTwpb9R3ytQw2%2BWWbJO277oSB536o%3D&amp;reserved=0

Note that protocol-interactiveselenium will support https in 1.16.

Best,
Sebastian

On 12/16/18 1:40 PM, Venkata MR wrote:
> Hi Lewis,
> 
> Thanks for your email, I tried all options with no success before reaching you again referring to the link you had provided.
> 
> Here I am trying to crawl websites which are having the runtime rendered content to extract and parse.
> I downloaded the Nutch provided in the below email. Added protocol-interactiveselenium, protocol-selenium along with protocol-httpclient (as the websites are with https).
> Selenium - firefox is configured and it is working properly, and selenium is configured and running while doing the crawling.
> 
> Yet, not able to get rendered content. Here I attached the nutch-site.xml for reference to see any input of missing configuration.
> 
> Just wondering to guess if the issue with the tika parsers not able to parse the extracted runtime rendered content or the issue with the Solr (I am using Apache solr to index parsed data) for not having the indexed field to represent the data (schema has content, url, title and id).
> 
> Any input really appreciated to resolve the issue.
> 
> Environment: CentOS-7
> Firefox: 60.3 oesr (64 bit)
> Selenium : v3.4.0
> Geckodriver: 0.23.0 ( 2018-10-04)
> Apache Nutch: 1.x
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> -----Original Message-----
> From: Lewis John McGibbney <le...@apache.org>
> Sent: 09 December 2018 02:11
> To: user@nutch.apache.org
> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
> ajax
> 
> Hi Venkata,
> This functionality is not available in 2.X at the moment.
> The functionality is available in the 1.x primary branch. You can 
> learn about the implementation at both
> 
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-sel
> enium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b4441eb76608
> d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C63680635617839
> 6679&amp;sdata=P89fyka%2F8bmt%2Fr54hlM60EjNFU4Glo1%2BUCxX73fUJEQ%3D&am
> p;reserved=0, and
> 
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-int
> eractiveselenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b
> 4441eb76608d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636
> 806356178396679&amp;sdata=mrzw3d8XIJDMq9oP%2BN3S0jKdhuYcYUzp%2BO3YOfY4
> gEk%3D&amp;reserved=0
> 
> Lewis
> 
> On 2018/12/07 07:03:30, Venkata MR <Ve...@hcl.com> wrote: 
>> Hi,
>>
>> Was trying to fetch the content rendered by ajax call using Apache Nutch 2.3.1.
>> Seems, it is not able to get the actual rendered content only getting the view source page ( as part of protocol-js plugin).
>> Has anyone able to fetch the rendered content from Ajax call using Nutch 2.3.1 or any suggestions?
>>
>> Thanks & Regards
>> Venkata MR
>> +91 98455 77125
>>
>> ::DISCLAIMER::
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> -- The contents of this e-mail and any attachment(s) are confidential 
>> and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> --
>>


Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi,

> protocol-httpclient (as the websites are with https).

With Nutch 1.15 protocol-selenium supports https. If protocol-httpclient
is also active, it may be used instead of protocol-selenium. There is
no need to activate it, the description in nutch-default.xml needs to
be fixed, see https://issues.apache.org/jira/browse/NUTCH-2678

Note that protocol-interactiveselenium will support https in 1.16.

Best,
Sebastian

On 12/16/18 1:40 PM, Venkata MR wrote:
> Hi Lewis,
> 
> Thanks for your email, I tried all options with no success before reaching you again referring to the link you had provided.
> 
> Here I am trying to crawl websites which are having the runtime rendered content to extract and parse.
> I downloaded the Nutch provided in the below email. Added protocol-interactiveselenium, protocol-selenium along with protocol-httpclient (as the websites are with https).
> Selenium - firefox is configured and it is working properly, and selenium is configured and running while doing the crawling.
> 
> Yet, not able to get rendered content. Here I attached the nutch-site.xml for reference to see any input of missing configuration.
> 
> Just wondering to guess if the issue with the tika parsers not able to parse the extracted runtime rendered content or the issue with the Solr (I am using Apache solr to index parsed data) for not having the indexed field to represent the data (schema has content, url, title and id).
> 
> Any input really appreciated to resolve the issue.
> 
> Environment: CentOS-7
> Firefox: 60.3 oesr (64 bit)
> Selenium : v3.4.0
> Geckodriver: 0.23.0 ( 2018-10-04)
> Apache Nutch: 1.x
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> -----Original Message-----
> From: Lewis John McGibbney <le...@apache.org> 
> Sent: 09 December 2018 02:11
> To: user@nutch.apache.org
> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax
> 
> Hi Venkata,
> This functionality is not available in 2.X at the moment.
> The functionality is available in the 1.x primary branch. You can learn about the implementation at both
> 
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-selenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C96535ceefbb64a833f3e08d65d4d6e83%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636798984502302804&amp;sdata=GnB3WTYiEnYCx1Od7W3275L8fdtKPxH3KRi%2B7DXRvGM%3D&amp;reserved=0, and
> 
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-interactiveselenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C96535ceefbb64a833f3e08d65d4d6e83%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636798984502302804&amp;sdata=YENo6rTbviI7ctl5K6%2BV0Bw4NCGMo9l2CDJOGauRWV8%3D&amp;reserved=0
> 
> Lewis
> 
> On 2018/12/07 07:03:30, Venkata MR <Ve...@hcl.com> wrote: 
>> Hi,
>>
>> Was trying to fetch the content rendered by ajax call using Apache Nutch 2.3.1.
>> Seems, it is not able to get the actual rendered content only getting the view source page ( as part of protocol-js plugin).
>> Has anyone able to fetch the rendered content from Ajax call using Nutch 2.3.1 or any suggestions?
>>
>> Thanks & Regards
>> Venkata MR
>> +91 98455 77125
>>
>> ::DISCLAIMER::
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>


RE: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Posted by Venkata MR <Ve...@hcl.com>.
Hi Lewis,

Thanks for your email, I tried all options with no success before reaching you again referring to the link you had provided.

Here I am trying to crawl websites which are having the runtime rendered content to extract and parse.
I downloaded the Nutch provided in the below email. Added protocol-interactiveselenium, protocol-selenium along with protocol-httpclient (as the websites are with https).
Selenium - firefox is configured and it is working properly, and selenium is configured and running while doing the crawling.

Yet, not able to get rendered content. Here I attached the nutch-site.xml for reference to see any input of missing configuration.

Just wondering to guess if the issue with the tika parsers not able to parse the extracted runtime rendered content or the issue with the Solr (I am using Apache solr to index parsed data) for not having the indexed field to represent the data (schema has content, url, title and id).

Any input really appreciated to resolve the issue.

Environment: CentOS-7
Firefox: 60.3 oesr (64 bit)
Selenium : v3.4.0
Geckodriver: 0.23.0 ( 2018-10-04)
Apache Nutch: 1.x

Thanks & Regards
Venkata MR
+91 98455 77125

-----Original Message-----
From: Lewis John McGibbney <le...@apache.org> 
Sent: 09 December 2018 02:11
To: user@nutch.apache.org
Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Hi Venkata,
This functionality is not available in 2.X at the moment.
The functionality is available in the 1.x primary branch. You can learn about the implementation at both

https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-selenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C96535ceefbb64a833f3e08d65d4d6e83%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636798984502302804&amp;sdata=GnB3WTYiEnYCx1Od7W3275L8fdtKPxH3KRi%2B7DXRvGM%3D&amp;reserved=0, and

https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-interactiveselenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C96535ceefbb64a833f3e08d65d4d6e83%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636798984502302804&amp;sdata=YENo6rTbviI7ctl5K6%2BV0Bw4NCGMo9l2CDJOGauRWV8%3D&amp;reserved=0

Lewis

On 2018/12/07 07:03:30, Venkata MR <Ve...@hcl.com> wrote: 
> Hi,
> 
> Was trying to fetch the content rendered by ajax call using Apache Nutch 2.3.1.
> Seems, it is not able to get the actual rendered content only getting the view source page ( as part of protocol-js plugin).
> Has anyone able to fetch the rendered content from Ajax call using Nutch 2.3.1 or any suggestions?
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> ::DISCLAIMER::
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 

Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Posted by Lewis John McGibbney <le...@apache.org>.
Hi Venkata,
This functionality is not available in 2.X at the moment.
The functionality is available in the 1.x primary branch. You can learn about the implementation at both

https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium, and

https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium

Lewis

On 2018/12/07 07:03:30, Venkata MR <Ve...@hcl.com> wrote: 
> Hi,
> 
> Was trying to fetch the content rendered by ajax call using Apache Nutch 2.3.1.
> Seems, it is not able to get the actual rendered content only getting the view source page ( as part of protocol-js plugin).
> Has anyone able to fetch the rendered content from Ajax call using Nutch 2.3.1 or any suggestions?
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> ::DISCLAIMER::
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>