You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Chen <yi...@u.northwestern.edu> on 2017/08/06 00:29:03 UTC

Best practice for Nutch 2.x on AWS?

Hi,

I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering 
if anyone know of a "best set up" for it. The hadoop and hbase version 
in current EMR releases doesn't seem to work with Nutch 2.x. Does it 
sound like a good idea to manually set up Hadoop clusters and then run 
Nutch on it? Will I be able to use S3 as data storage so that I can keep 
the data when EC2 instance stops?

Any suggestions would be very much helpful!

Thanks in advance,

Michael

Re: Best practice for Nutch 2.x on AWS?

Posted by Divjot Singh <di...@gmail.com>.

Glad you found the solution.

On Thu, Aug 17, 2017 at 12:46 PM, Michael Chen <
yiningchen2020@u.northwestern.edu> wrote:

> Fixed the problem... It was most likely a table match problem: it is
> necessary to specify -crawlId during indexing. Also the "Total 0 document
> is added" is probably a bug... The MR input output record is more reliable.
> :)
>
>
> On 08/16/2017 11:30 PM, Divjot Singh wrote:
>
>> Hi Michael
>>
>> I haven't used Solr for indexing. So I won't be able to help you on that
>> one.
>>
>> Divjot
>>
>>
>> On 17-Aug-2017 11:53 AM, "Michael Chen" <yiningchen2020@u.northwestern
>> .edu <ma...@u.northwestern.edu>> wrote:
>>
>>     Hi Divjot,
>>
>>     You're right. I checked the webapp and rootdir is already defined
>>     by "hbase-site.xml" outside of Nutch, probably by CloudEra, though
>>     it is strange why CloudEra didn't take care of quorum too...
>>
>>     I just set up Solr 6.6.0 for lack of a good guide for the CloudEra
>>     Solr 4.10.3. It's running on HDFS standalone mode. Everything
>>     seems good but IndexJob does not index properly. HBase data is
>>     good so I assume it's only indexing that went wrong.
>>
>>     Solr-mapping is reflected properly in stdout. However, I noticed
>>     MR reported 0 input and output records...
>>
>>     Would you have an idea of what might have gone wrong?
>>
>>     Thanks a bunch!
>>
>>     Michael
>>
>>
>>     On 08/16/2017 11:12 PM, Divjot Singh wrote:
>>
>>>     Hi
>>>
>>>     You just need to add the zookeeper quorum of the hbase server you
>>>     to are connecting to in hbase-site.xml no need for hdfs uri. If
>>>     your cluster is configured correctly and you are able to create
>>>     tables in hbase then nutch should work fine once it gets the
>>>     hbase server url from hbase-site.xml.
>>>
>>>     Thanks
>>>     Divjot
>>>
>>>     On 17-Aug-2017 10:25 AM, "Michael Chen"
>>>     <yiningchen2020@u.northwestern.edu
>>>     <ma...@u.northwestern.edu>> wrote:
>>>
>>>         Hi Divjot,
>>>
>>>         Thanks for the reply! I checked the HBase tutorial but still
>>>         am a bit confused. When I set up the standalone build,
>>>         hbase-site.xml resides in the hbase conf/. But it seems that
>>>         with the fully distributed + nutch deployment, I need to
>>>         specify configurations in Nutch's hbase-site.xml, which gets
>>>         deployed into the job JAR.
>>>
>>>         My question is: what should I configure in Nutch's
>>>         hbase-site.xml? Do I need to also include HDFS URI? Does the
>>>         CloudEra HBase build override any default settings (as it
>>>         should...)?
>>>
>>>         Thank you!
>>>         Michael
>>>
>>>
>>>
>>>         On 08/16/2017 09:14 PM, Divjot Singh wrote:
>>>
>>>>         Hi Michael
>>>>
>>>>         You can used the following tutorial
>>>>         https://wiki.apache.org/nutch/Nutch2Tutorial
>>>>         <https://wiki.apache.org/nutch/Nutch2Tutorial>
>>>>
>>>>         Also update hbase-site.xml in the conf folder to add the
>>>>         zookeeper quorum if your hbase is on another cluster.
>>>>
>>>>         Thanks
>>>>         Divjot
>>>>
>>>>
>>>>         On 17-Aug-2017 5:23 AM, "Michael Chen"
>>>>         <yiningchen2020@u.northwestern.edu
>>>>         <ma...@u.northwestern.edu>> wrote:
>>>>
>>>>             Hi Divjot,
>>>>
>>>>             I have a cluster running with CloudEra Manager (Hadoop,
>>>>             HBase, Solr, ZooKeeper). Do you know if I need to modify
>>>>             the hbase-site.xml before "ant runtime"? What
>>>>             configurations did you have to do manually for Nutch
>>>>             (and others)?
>>>>
>>>>             Thanks in advance!
>>>>
>>>>
>>>>             Michael
>>>>
>>>>
>>>>             On 08/14/2017 07:29 PM, Divjot Singh wrote:
>>>>
>>>>                 Hi Michael
>>>>
>>>>                 I am using the latest Cloudera release and it's
>>>>                 working fine. You can use
>>>>                 any Linux distro you are comfortable with. Centos is
>>>>                 mostly used for server
>>>>                 deployments and it's quite stable.
>>>>
>>>>                 Thanks
>>>>                 Divjot
>>>>
>>>>
>>>>                 On 15-Aug-2017 2:09 AM, "Michael Chen"
>>>>                 <yiningchen2020@u.northwestern.edu
>>>>                 <ma...@u.northwestern.edu>>
>>>>                 wrote:
>>>>
>>>>                 Hi Divjot,
>>>>
>>>>                 Thanks for the information! I was wondering if there
>>>>                 is a specific version
>>>>                 of cloudera manager and CDH that works best with
>>>>                 Nutch 2.x (HBase 1.2.3,
>>>>                 Hadoop 2.5.2)?
>>>>
>>>>                 Also, is there a specific reason to use Centos 7
>>>>                 instead of Amazon Linux or
>>>>                 Red Hat?
>>>>
>>>>                 I’ll try to get started with the setup. Thanks!
>>>>
>>>>                 Michael
>>>>
>>>>                 From: Divjot Singh
>>>>                 Sent: Tuesday, August 8, 2017 04:06
>>>>                 To: user@nutch.apache.org <mailto:user@nutch.apache.org
>>>> >
>>>>                 Subject: Re: Best practice for Nutch 2.x on AWS?
>>>>
>>>>                 Hi
>>>>
>>>>                 We have a setup of Hbase on an AWS cluster with
>>>>                 centos 7. The setup was
>>>>                 done using cloudera-manager. Nutch can be then run
>>>>                 in standalone mode or
>>>>                 over yarn by running the deployment jar in deploy
>>>>                 folder.
>>>>
>>>>                 I have not tested with S3 directly but your can
>>>>                 always backup the hbase
>>>>                 data daily to S3.
>>>>
>>>>                 Hope this helps.Let me know if you have further queries.
>>>>
>>>>                 Divjot
>>>>
>>>>
>>>>                 On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
>>>>                 yiningchen2020@u.northwestern.edu
>>>>                 <ma...@u.northwestern.edu>> wrote:
>>>>
>>>>                     Hi,
>>>>
>>>>                     I'm trying to set up Nutch 2.x on AWS EC2
>>>>                     clusters, and I was wondering if
>>>>                     anyone know of a "best set up" for it. The
>>>>                     hadoop and hbase version in
>>>>                     current EMR releases doesn't seem to work with
>>>>                     Nutch 2.x. Does it sound
>>>>                     like a good idea to manually set up Hadoop
>>>>                     clusters and then run Nutch on
>>>>                     it? Will I be able to use S3 as data storage so
>>>>                     that I can keep the data
>>>>                     when EC2 instance stops?
>>>>
>>>>                     Any suggestions would be very much helpful!
>>>>
>>>>                     Thanks in advance,
>>>>
>>>>                     Michael
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Best practice for Nutch 2.x on AWS?

Posted by Michael Chen <yi...@u.northwestern.edu>.

Fixed the problem... It was most likely a table match problem: it is 
necessary to specify -crawlId during indexing. Also the "Total 0 
document is added" is probably a bug... The MR input output record is 
more reliable. :)


On 08/16/2017 11:30 PM, Divjot Singh wrote:
> Hi Michael
>
> I haven't used Solr for indexing. So I won't be able to help you on 
> that one.
>
> Divjot
>
>
> On 17-Aug-2017 11:53 AM, "Michael Chen" 
> <yiningchen2020@u.northwestern.edu 
> <ma...@u.northwestern.edu>> wrote:
>
>     Hi Divjot,
>
>     You're right. I checked the webapp and rootdir is already defined
>     by "hbase-site.xml" outside of Nutch, probably by CloudEra, though
>     it is strange why CloudEra didn't take care of quorum too...
>
>     I just set up Solr 6.6.0 for lack of a good guide for the CloudEra
>     Solr 4.10.3. It's running on HDFS standalone mode. Everything
>     seems good but IndexJob does not index properly. HBase data is
>     good so I assume it's only indexing that went wrong.
>
>     Solr-mapping is reflected properly in stdout. However, I noticed
>     MR reported 0 input and output records...
>
>     Would you have an idea of what might have gone wrong?
>
>     Thanks a bunch!
>
>     Michael
>
>
>     On 08/16/2017 11:12 PM, Divjot Singh wrote:
>>     Hi
>>
>>     You just need to add the zookeeper quorum of the hbase server you
>>     to are connecting to in hbase-site.xml no need for hdfs uri. If
>>     your cluster is configured correctly and you are able to create
>>     tables in hbase then nutch should work fine once it gets the
>>     hbase server url from hbase-site.xml.
>>
>>     Thanks
>>     Divjot
>>
>>     On 17-Aug-2017 10:25 AM, "Michael Chen"
>>     <yiningchen2020@u.northwestern.edu
>>     <ma...@u.northwestern.edu>> wrote:
>>
>>         Hi Divjot,
>>
>>         Thanks for the reply! I checked the HBase tutorial but still
>>         am a bit confused. When I set up the standalone build,
>>         hbase-site.xml resides in the hbase conf/. But it seems that
>>         with the fully distributed + nutch deployment, I need to
>>         specify configurations in Nutch's hbase-site.xml, which gets
>>         deployed into the job JAR.
>>
>>         My question is: what should I configure in Nutch's
>>         hbase-site.xml? Do I need to also include HDFS URI? Does the
>>         CloudEra HBase build override any default settings (as it
>>         should...)?
>>
>>         Thank you!
>>         Michael
>>
>>
>>
>>         On 08/16/2017 09:14 PM, Divjot Singh wrote:
>>>         Hi Michael
>>>
>>>         You can used the following tutorial
>>>         https://wiki.apache.org/nutch/Nutch2Tutorial
>>>         <https://wiki.apache.org/nutch/Nutch2Tutorial>
>>>
>>>         Also update hbase-site.xml in the conf folder to add the
>>>         zookeeper quorum if your hbase is on another cluster.
>>>
>>>         Thanks
>>>         Divjot
>>>
>>>
>>>         On 17-Aug-2017 5:23 AM, "Michael Chen"
>>>         <yiningchen2020@u.northwestern.edu
>>>         <ma...@u.northwestern.edu>> wrote:
>>>
>>>             Hi Divjot,
>>>
>>>             I have a cluster running with CloudEra Manager (Hadoop,
>>>             HBase, Solr, ZooKeeper). Do you know if I need to modify
>>>             the hbase-site.xml before "ant runtime"? What
>>>             configurations did you have to do manually for Nutch
>>>             (and others)?
>>>
>>>             Thanks in advance!
>>>
>>>
>>>             Michael
>>>
>>>
>>>             On 08/14/2017 07:29 PM, Divjot Singh wrote:
>>>
>>>                 Hi Michael
>>>
>>>                 I am using the latest Cloudera release and it's
>>>                 working fine. You can use
>>>                 any Linux distro you are comfortable with. Centos is
>>>                 mostly used for server
>>>                 deployments and it's quite stable.
>>>
>>>                 Thanks
>>>                 Divjot
>>>
>>>
>>>                 On 15-Aug-2017 2:09 AM, "Michael Chen"
>>>                 <yiningchen2020@u.northwestern.edu
>>>                 <ma...@u.northwestern.edu>>
>>>                 wrote:
>>>
>>>                 Hi Divjot,
>>>
>>>                 Thanks for the information! I was wondering if there
>>>                 is a specific version
>>>                 of cloudera manager and CDH that works best with
>>>                 Nutch 2.x (HBase 1.2.3,
>>>                 Hadoop 2.5.2)?
>>>
>>>                 Also, is there a specific reason to use Centos 7
>>>                 instead of Amazon Linux or
>>>                 Red Hat?
>>>
>>>                 I’ll try to get started with the setup. Thanks!
>>>
>>>                 Michael
>>>
>>>                 From: Divjot Singh
>>>                 Sent: Tuesday, August 8, 2017 04:06
>>>                 To: user@nutch.apache.org <ma...@nutch.apache.org>
>>>                 Subject: Re: Best practice for Nutch 2.x on AWS?
>>>
>>>                 Hi
>>>
>>>                 We have a setup of Hbase on an AWS cluster with
>>>                 centos 7. The setup was
>>>                 done using cloudera-manager. Nutch can be then run
>>>                 in standalone mode or
>>>                 over yarn by running the deployment jar in deploy
>>>                 folder.
>>>
>>>                 I have not tested with S3 directly but your can
>>>                 always backup the hbase
>>>                 data daily to S3.
>>>
>>>                 Hope this helps.Let me know if you have further queries.
>>>
>>>                 Divjot
>>>
>>>
>>>                 On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
>>>                 yiningchen2020@u.northwestern.edu
>>>                 <ma...@u.northwestern.edu>> wrote:
>>>
>>>                     Hi,
>>>
>>>                     I'm trying to set up Nutch 2.x on AWS EC2
>>>                     clusters, and I was wondering if
>>>                     anyone know of a "best set up" for it. The
>>>                     hadoop and hbase version in
>>>                     current EMR releases doesn't seem to work with
>>>                     Nutch 2.x. Does it sound
>>>                     like a good idea to manually set up Hadoop
>>>                     clusters and then run Nutch on
>>>                     it? Will I be able to use S3 as data storage so
>>>                     that I can keep the data
>>>                     when EC2 instance stops?
>>>
>>>                     Any suggestions would be very much helpful!
>>>
>>>                     Thanks in advance,
>>>
>>>                     Michael
>>>
>>>
>>>
>>>
>>
>>
>
>

Re: Best practice for Nutch 2.x on AWS?

Posted by Michael Chen <yi...@u.northwestern.edu>.

Hi Divjot,

You're right. I checked the webapp and rootdir is already defined by 
"hbase-site.xml" outside of Nutch, probably by CloudEra, though it is 
strange why CloudEra didn't take care of quorum too...

I just set up Solr 6.6.0 for lack of a good guide for the CloudEra Solr 
4.10.3. It's running on HDFS standalone mode. Everything seems good but 
IndexJob does not index properly. HBase data is good so I assume it's 
only indexing that went wrong.

Solr-mapping is reflected properly in stdout. However, I noticed MR 
reported 0 input and output records...

Would you have an idea of what might have gone wrong?

Thanks a bunch!

Michael


On 08/16/2017 11:12 PM, Divjot Singh wrote:
> Hi
>
> You just need to add the zookeeper quorum of the hbase server you to 
> are connecting to in hbase-site.xml no need for hdfs uri. If your 
> cluster is configured correctly and you are able to create tables in 
> hbase then nutch should work fine once it gets the hbase server url 
> from hbase-site.xml.
>
> Thanks
> Divjot
>
> On 17-Aug-2017 10:25 AM, "Michael Chen" 
> <yiningchen2020@u.northwestern.edu 
> <ma...@u.northwestern.edu>> wrote:
>
>     Hi Divjot,
>
>     Thanks for the reply! I checked the HBase tutorial but still am a
>     bit confused. When I set up the standalone build, hbase-site.xml
>     resides in the hbase conf/. But it seems that with the fully
>     distributed + nutch deployment, I need to specify configurations
>     in Nutch's hbase-site.xml, which gets deployed into the job JAR.
>
>     My question is: what should I configure in Nutch's hbase-site.xml?
>     Do I need to also include HDFS URI? Does the CloudEra HBase build
>     override any default settings (as it should...)?
>
>     Thank you!
>     Michael
>
>
>
>     On 08/16/2017 09:14 PM, Divjot Singh wrote:
>>     Hi Michael
>>
>>     You can used the following tutorial
>>     https://wiki.apache.org/nutch/Nutch2Tutorial
>>     <https://wiki.apache.org/nutch/Nutch2Tutorial>
>>
>>     Also update hbase-site.xml in the conf folder to add the
>>     zookeeper quorum if your hbase is on another cluster.
>>
>>     Thanks
>>     Divjot
>>
>>
>>     On 17-Aug-2017 5:23 AM, "Michael Chen"
>>     <yiningchen2020@u.northwestern.edu
>>     <ma...@u.northwestern.edu>> wrote:
>>
>>         Hi Divjot,
>>
>>         I have a cluster running with CloudEra Manager (Hadoop,
>>         HBase, Solr, ZooKeeper). Do you know if I need to modify the
>>         hbase-site.xml before "ant runtime"? What configurations did
>>         you have to do manually for Nutch (and others)?
>>
>>         Thanks in advance!
>>
>>
>>         Michael
>>
>>
>>         On 08/14/2017 07:29 PM, Divjot Singh wrote:
>>
>>             Hi Michael
>>
>>             I am using the latest Cloudera release and it's working
>>             fine. You can use
>>             any Linux distro you are comfortable with. Centos is
>>             mostly used for server
>>             deployments and it's quite stable.
>>
>>             Thanks
>>             Divjot
>>
>>
>>             On 15-Aug-2017 2:09 AM, "Michael Chen"
>>             <yiningchen2020@u.northwestern.edu
>>             <ma...@u.northwestern.edu>>
>>             wrote:
>>
>>             Hi Divjot,
>>
>>             Thanks for the information! I was wondering if there is a
>>             specific version
>>             of cloudera manager and CDH that works best with Nutch
>>             2.x (HBase 1.2.3,
>>             Hadoop 2.5.2)?
>>
>>             Also, is there a specific reason to use Centos 7 instead
>>             of Amazon Linux or
>>             Red Hat?
>>
>>             I’ll try to get started with the setup. Thanks!
>>
>>             Michael
>>
>>             From: Divjot Singh
>>             Sent: Tuesday, August 8, 2017 04:06
>>             To: user@nutch.apache.org <ma...@nutch.apache.org>
>>             Subject: Re: Best practice for Nutch 2.x on AWS?
>>
>>             Hi
>>
>>             We have a setup of Hbase on an AWS cluster with centos 7.
>>             The setup was
>>             done using cloudera-manager. Nutch can be then run in
>>             standalone mode or
>>             over yarn by running the deployment jar in deploy folder.
>>
>>             I have not tested with S3 directly but your can always
>>             backup the hbase
>>             data daily to S3.
>>
>>             Hope this helps.Let me know if you have further queries.
>>
>>             Divjot
>>
>>
>>             On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
>>             yiningchen2020@u.northwestern.edu
>>             <ma...@u.northwestern.edu>> wrote:
>>
>>                 Hi,
>>
>>                 I'm trying to set up Nutch 2.x on AWS EC2 clusters,
>>                 and I was wondering if
>>                 anyone know of a "best set up" for it. The hadoop and
>>                 hbase version in
>>                 current EMR releases doesn't seem to work with Nutch
>>                 2.x. Does it sound
>>                 like a good idea to manually set up Hadoop clusters
>>                 and then run Nutch on
>>                 it? Will I be able to use S3 as data storage so that
>>                 I can keep the data
>>                 when EC2 instance stops?
>>
>>                 Any suggestions would be very much helpful!
>>
>>                 Thanks in advance,
>>
>>                 Michael
>>
>>
>>
>>
>
>

Re: Best practice for Nutch 2.x on AWS?

Posted by Michael Chen <yi...@u.northwestern.edu>.

Hi Divjot,

Thanks for the reply! I checked the HBase tutorial but still am a bit 
confused. When I set up the standalone build, hbase-site.xml resides in 
the hbase conf/. But it seems that with the fully distributed + nutch 
deployment, I need to specify configurations in Nutch's hbase-site.xml, 
which gets deployed into the job JAR.

My question is: what should I configure in Nutch's hbase-site.xml? Do I 
need to also include HDFS URI? Does the CloudEra HBase build override 
any default settings (as it should...)?

Thank you!
Michael


On 08/16/2017 09:14 PM, Divjot Singh wrote:
> Hi Michael
>
> You can used the following tutorial
> https://wiki.apache.org/nutch/Nutch2Tutorial
>
> Also update hbase-site.xml in the conf folder to add the zookeeper 
> quorum if your hbase is on another cluster.
>
> Thanks
> Divjot
>
>
> On 17-Aug-2017 5:23 AM, "Michael Chen" 
> <yiningchen2020@u.northwestern.edu 
> <ma...@u.northwestern.edu>> wrote:
>
>     Hi Divjot,
>
>     I have a cluster running with CloudEra Manager (Hadoop, HBase,
>     Solr, ZooKeeper). Do you know if I need to modify the
>     hbase-site.xml before "ant runtime"? What configurations did you
>     have to do manually for Nutch (and others)?
>
>     Thanks in advance!
>
>
>     Michael
>
>
>     On 08/14/2017 07:29 PM, Divjot Singh wrote:
>
>         Hi Michael
>
>         I am using the latest Cloudera release and it's working fine.
>         You can use
>         any Linux distro you are comfortable with. Centos is mostly
>         used for server
>         deployments and it's quite stable.
>
>         Thanks
>         Divjot
>
>
>         On 15-Aug-2017 2:09 AM, "Michael Chen"
>         <yiningchen2020@u.northwestern.edu
>         <ma...@u.northwestern.edu>>
>         wrote:
>
>         Hi Divjot,
>
>         Thanks for the information! I was wondering if there is a
>         specific version
>         of cloudera manager and CDH that works best with Nutch 2.x
>         (HBase 1.2.3,
>         Hadoop 2.5.2)?
>
>         Also, is there a specific reason to use Centos 7 instead of
>         Amazon Linux or
>         Red Hat?
>
>         I’ll try to get started with the setup. Thanks!
>
>         Michael
>
>         From: Divjot Singh
>         Sent: Tuesday, August 8, 2017 04:06
>         To: user@nutch.apache.org <ma...@nutch.apache.org>
>         Subject: Re: Best practice for Nutch 2.x on AWS?
>
>         Hi
>
>         We have a setup of Hbase on an AWS cluster with centos 7. The
>         setup was
>         done using cloudera-manager. Nutch can be then run in
>         standalone mode or
>         over yarn by running the deployment jar in deploy folder.
>
>         I have not tested with S3 directly but your can always backup
>         the hbase
>         data daily to S3.
>
>         Hope this helps.Let me know if you have further queries.
>
>         Divjot
>
>
>         On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
>         yiningchen2020@u.northwestern.edu
>         <ma...@u.northwestern.edu>> wrote:
>
>             Hi,
>
>             I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I
>             was wondering if
>             anyone know of a "best set up" for it. The hadoop and
>             hbase version in
>             current EMR releases doesn't seem to work with Nutch 2.x.
>             Does it sound
>             like a good idea to manually set up Hadoop clusters and
>             then run Nutch on
>             it? Will I be able to use S3 as data storage so that I can
>             keep the data
>             when EC2 instance stops?
>
>             Any suggestions would be very much helpful!
>
>             Thanks in advance,
>
>             Michael
>
>
>
>

Re: Best practice for Nutch 2.x on AWS?

Posted by Michael Chen <yi...@u.northwestern.edu>.

Hi Divjot,

I have a cluster running with CloudEra Manager (Hadoop, HBase, Solr, 
ZooKeeper). Do you know if I need to modify the hbase-site.xml before 
"ant runtime"? What configurations did you have to do manually for Nutch 
(and others)?

Thanks in advance!

Michael


On 08/14/2017 07:29 PM, Divjot Singh wrote:
> Hi Michael
>
> I am using the latest Cloudera release and it's working fine. You can use
> any Linux distro you are comfortable with. Centos is mostly used for server
> deployments and it's quite stable.
>
> Thanks
> Divjot
>
>
> On 15-Aug-2017 2:09 AM, "Michael Chen" <yi...@u.northwestern.edu>
> wrote:
>
> Hi Divjot,
>
> Thanks for the information! I was wondering if there is a specific version
> of cloudera manager and CDH that works best with Nutch 2.x (HBase 1.2.3,
> Hadoop 2.5.2)?
>
> Also, is there a specific reason to use Centos 7 instead of Amazon Linux or
> Red Hat?
>
> I’ll try to get started with the setup. Thanks!
>
> Michael
>
> From: Divjot Singh
> Sent: Tuesday, August 8, 2017 04:06
> To: user@nutch.apache.org
> Subject: Re: Best practice for Nutch 2.x on AWS?
>
> Hi
>
> We have a setup of Hbase on an AWS cluster with centos 7. The setup was
> done using cloudera-manager. Nutch can be then run in standalone mode or
> over yarn by running the deployment jar in deploy folder.
>
> I have not tested with S3 directly but your can always backup the hbase
> data daily to S3.
>
> Hope this helps.Let me know if you have further queries.
>
> Divjot
>
>
> On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
> yiningchen2020@u.northwestern.edu> wrote:
>
>> Hi,
>>
>> I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if
>> anyone know of a "best set up" for it. The hadoop and hbase version in
>> current EMR releases doesn't seem to work with Nutch 2.x. Does it sound
>> like a good idea to manually set up Hadoop clusters and then run Nutch on
>> it? Will I be able to use S3 as data storage so that I can keep the data
>> when EC2 instance stops?
>>
>> Any suggestions would be very much helpful!
>>
>> Thanks in advance,
>>
>> Michael
>>
>>

Re: Best practice for Nutch 2.x on AWS?

Posted by Michael Chen <yi...@u.northwestern.edu>.

The problem was caused by mismatched JDK versions. I had to use 1.8 to 
compile Ant and Nutch, but CloudEra Manager defaults to JDK1.6 or 1.7. 
Solution is to override JAVA_HOME in hosts configurations and restart 
cluster...


On 08/15/2017 11:51 PM, Michael Chen wrote:
> Hi,
>
> I just figured out how to deploy jobs to hadoop with the jar file... 
> But I ran into an error during the first injection step:
>
> java.lang.UnsupportedClassVersionError: 
> org/apache/gora/mapreduce/GoraOutputFormat: Unsupported major.minor 
> version 52.0
>
> I'm using 2.x which is configured for gora 0.7, and I specified 
> HBase-common as 1.2.3 to be consistent with -client and -protocol 
> libraries.
>
> CloudEra 5.12(latest version) runs Hadoop 2.6.0, HBase 1.2.0, 
> ZooKeeper 3.4.5, Solr 4.10.3.
>
> Does anyone know what this error is caused by?
>
> Thanks!
>
> Michael
>
>
> On 08/14/2017 07:29 PM, Divjot Singh wrote:
>> Hi Michael
>>
>> I am using the latest Cloudera release and it's working fine. You can 
>> use
>> any Linux distro you are comfortable with. Centos is mostly used for 
>> server
>> deployments and it's quite stable.
>>
>> Thanks
>> Divjot
>>
>>
>> On 15-Aug-2017 2:09 AM, "Michael Chen" 
>> <yi...@u.northwestern.edu>
>> wrote:
>>
>> Hi Divjot,
>>
>> Thanks for the information! I was wondering if there is a specific 
>> version
>> of cloudera manager and CDH that works best with Nutch 2.x (HBase 1.2.3,
>> Hadoop 2.5.2)?
>>
>> Also, is there a specific reason to use Centos 7 instead of Amazon 
>> Linux or
>> Red Hat?
>>
>> I’ll try to get started with the setup. Thanks!
>>
>> Michael
>>
>> From: Divjot Singh
>> Sent: Tuesday, August 8, 2017 04:06
>> To: user@nutch.apache.org
>> Subject: Re: Best practice for Nutch 2.x on AWS?
>>
>> Hi
>>
>> We have a setup of Hbase on an AWS cluster with centos 7. The setup was
>> done using cloudera-manager. Nutch can be then run in standalone mode or
>> over yarn by running the deployment jar in deploy folder.
>>
>> I have not tested with S3 directly but your can always backup the hbase
>> data daily to S3.
>>
>> Hope this helps.Let me know if you have further queries.
>>
>> Divjot
>>
>>
>> On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
>> yiningchen2020@u.northwestern.edu> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was 
>>> wondering if
>>> anyone know of a "best set up" for it. The hadoop and hbase version in
>>> current EMR releases doesn't seem to work with Nutch 2.x. Does it sound
>>> like a good idea to manually set up Hadoop clusters and then run 
>>> Nutch on
>>> it? Will I be able to use S3 as data storage so that I can keep the 
>>> data
>>> when EC2 instance stops?
>>>
>>> Any suggestions would be very much helpful!
>>>
>>> Thanks in advance,
>>>
>>> Michael
>>>
>>>
>

Re: Best practice for Nutch 2.x on AWS?

Posted by Michael Chen <yi...@u.northwestern.edu>.

Hi,

I just figured out how to deploy jobs to hadoop with the jar file... But 
I ran into an error during the first injection step:

java.lang.UnsupportedClassVersionError: 
org/apache/gora/mapreduce/GoraOutputFormat: Unsupported major.minor 
version 52.0

I'm using 2.x which is configured for gora 0.7, and I specified 
HBase-common as 1.2.3 to be consistent with -client and -protocol libraries.

CloudEra 5.12(latest version) runs Hadoop 2.6.0, HBase 1.2.0, ZooKeeper 
3.4.5, Solr 4.10.3.

Does anyone know what this error is caused by?

Thanks!

Michael


On 08/14/2017 07:29 PM, Divjot Singh wrote:
> Hi Michael
>
> I am using the latest Cloudera release and it's working fine. You can use
> any Linux distro you are comfortable with. Centos is mostly used for server
> deployments and it's quite stable.
>
> Thanks
> Divjot
>
>
> On 15-Aug-2017 2:09 AM, "Michael Chen" <yi...@u.northwestern.edu>
> wrote:
>
> Hi Divjot,
>
> Thanks for the information! I was wondering if there is a specific version
> of cloudera manager and CDH that works best with Nutch 2.x (HBase 1.2.3,
> Hadoop 2.5.2)?
>
> Also, is there a specific reason to use Centos 7 instead of Amazon Linux or
> Red Hat?
>
> I’ll try to get started with the setup. Thanks!
>
> Michael
>
> From: Divjot Singh
> Sent: Tuesday, August 8, 2017 04:06
> To: user@nutch.apache.org
> Subject: Re: Best practice for Nutch 2.x on AWS?
>
> Hi
>
> We have a setup of Hbase on an AWS cluster with centos 7. The setup was
> done using cloudera-manager. Nutch can be then run in standalone mode or
> over yarn by running the deployment jar in deploy folder.
>
> I have not tested with S3 directly but your can always backup the hbase
> data daily to S3.
>
> Hope this helps.Let me know if you have further queries.
>
> Divjot
>
>
> On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
> yiningchen2020@u.northwestern.edu> wrote:
>
>> Hi,
>>
>> I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if
>> anyone know of a "best set up" for it. The hadoop and hbase version in
>> current EMR releases doesn't seem to work with Nutch 2.x. Does it sound
>> like a good idea to manually set up Hadoop clusters and then run Nutch on
>> it? Will I be able to use S3 as data storage so that I can keep the data
>> when EC2 instance stops?
>>
>> Any suggestions would be very much helpful!
>>
>> Thanks in advance,
>>
>> Michael
>>
>>

RE: Best practice for Nutch 2.x on AWS?

Posted by Divjot Singh <di...@gmail.com>.

Hi Michael

I am using the latest Cloudera release and it's working fine. You can use
any Linux distro you are comfortable with. Centos is mostly used for server
deployments and it's quite stable.

Thanks
Divjot

On 15-Aug-2017 2:09 AM, "Michael Chen" <yi...@u.northwestern.edu>
wrote:

Hi Divjot,

Thanks for the information! I was wondering if there is a specific version
of cloudera manager and CDH that works best with Nutch 2.x (HBase 1.2.3,
Hadoop 2.5.2)?

Also, is there a specific reason to use Centos 7 instead of Amazon Linux or
Red Hat?

I’ll try to get started with the setup. Thanks!

Michael

From: Divjot Singh
Sent: Tuesday, August 8, 2017 04:06
To: user@nutch.apache.org
Subject: Re: Best practice for Nutch 2.x on AWS?

Hi

We have a setup of Hbase on an AWS cluster with centos 7. The setup was
done using cloudera-manager. Nutch can be then run in standalone mode or
over yarn by running the deployment jar in deploy folder.

I have not tested with S3 directly but your can always backup the hbase
data daily to S3.

Hope this helps.Let me know if you have further queries.

Divjot

On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
yiningchen2020@u.northwestern.edu> wrote:

> Hi,
>
> I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if
> anyone know of a "best set up" for it. The hadoop and hbase version in
> current EMR releases doesn't seem to work with Nutch 2.x. Does it sound
> like a good idea to manually set up Hadoop clusters and then run Nutch on
> it? Will I be able to use S3 as data storage so that I can keep the data
> when EC2 instance stops?
>
> Any suggestions would be very much helpful!
>
> Thanks in advance,
>
> Michael
>
>

RE: Best practice for Nutch 2.x on AWS?

Posted by Michael Chen <yi...@u.northwestern.edu>.

Hi Divjot,

Thanks for the information! I was wondering if there is a specific version of cloudera manager and CDH that works best with Nutch 2.x (HBase 1.2.3, Hadoop 2.5.2)? 

Also, is there a specific reason to use Centos 7 instead of Amazon Linux or Red Hat? 

I’ll try to get started with the setup. Thanks!

Michael

From: Divjot Singh
Sent: Tuesday, August 8, 2017 04:06
To: user@nutch.apache.org
Subject: Re: Best practice for Nutch 2.x on AWS?

Hi

We have a setup of Hbase on an AWS cluster with centos 7. The setup was
done using cloudera-manager. Nutch can be then run in standalone mode or
over yarn by running the deployment jar in deploy folder.

I have not tested with S3 directly but your can always backup the hbase
data daily to S3.

Hope this helps.Let me know if you have further queries.

Divjot

On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
yiningchen2020@u.northwestern.edu> wrote:

> Hi,
>
> I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if
> anyone know of a "best set up" for it. The hadoop and hbase version in
> current EMR releases doesn't seem to work with Nutch 2.x. Does it sound
> like a good idea to manually set up Hadoop clusters and then run Nutch on
> it? Will I be able to use S3 as data storage so that I can keep the data
> when EC2 instance stops?
>
> Any suggestions would be very much helpful!
>
> Thanks in advance,
>
> Michael
>
>

Re: Best practice for Nutch 2.x on AWS?

Posted by Divjot Singh <di...@gmail.com>.

Hi

We have a setup of Hbase on an AWS cluster with centos 7. The setup was
done using cloudera-manager. Nutch can be then run in standalone mode or
over yarn by running the deployment jar in deploy folder.

I have not tested with S3 directly but your can always backup the hbase
data daily to S3.

Hope this helps.Let me know if you have further queries.

Divjot

On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
yiningchen2020@u.northwestern.edu> wrote:

> Hi,
>
> I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if
> anyone know of a "best set up" for it. The hadoop and hbase version in
> current EMR releases doesn't seem to work with Nutch 2.x. Does it sound
> like a good idea to manually set up Hadoop clusters and then run Nutch on
> it? Will I be able to use S3 as data storage so that I can keep the data
> when EC2 instance stops?
>
> Any suggestions would be very much helpful!
>
> Thanks in advance,
>
> Michael
>
>

Re: Best practice for Nutch 2.x on AWS?

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Michael,

except for HBase and Solr nothing has to be deployed or installed.

On the Hadoop master node:
- build Nutch via
   ant runtime
- point NUTCH_HOME to the directory where the job file is placed
   export NUTCH_HOME=.../runtime/deploy
- run
   $NUTCH_HOME/bin/nutch ....
  the Hadoop job is then launched via "hadoop jar $NUTCH_HOME/*.job ..."
  Of course, the executable "hadoop" must be on your path, but that should
  be the case on the master node.

> Is there a good reference for Nutch2 deployment?

I do not know one, but haven't searched for.
The tutorials in the Nutch wiki need an update, esp. for distributed mode in combination with 2.x

Best,
Sebastian

On 08/16/2017 02:38 AM, Michael Chen wrote:
> Hi Sebastian,
> 
> Thanks for the reply. I do have to use 2.x for some functionalities, so I guess I might have to
> stick to HDFS for now...
> 
> I set up a 5-node hadoop cluster with HBase and Solr services by Cloudera Manager (and it still took
> me a while...), and I've installed Nutch on all nodes. I'm a bit confused on how to deploy the job
> to the cluster. Do I only interact with the master node, setting configuration and seeds, and hadoop
> will manage the cluster?
> 
> Is there a good reference for Nutch2 deployment?
> 
> Thank you!
> 
> Michael
> 
> 
> On 08/15/2017 02:49 AM, Sebastian Nagel wrote:
>> Hi Michael,
>>
>>> Will I be able to use S3 as data storage so that I can keep the data when EC2 instance stops?
>> I don't know whether this is easily possible for 2.x and HBase. But Nutch 1.x can read and write
>> data directly from S3 (via S3A file system [1]). Only operations on the CrawlDb need a little
>> modification: data current to old, resp. temp folder to current, and S3 does not support moves.
>> But this is easily worked-around by copying between S3 and HDFS.
>>
>> Best,
>> Sebastian
>>
>> [1] https://wiki.apache.org/hadoop/AmazonS3
>>
>> On 08/06/2017 02:29 AM, Michael Chen wrote:
>>> Hi,
>>>
>>> I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if anyone know of a "best
>>> set up" for it. The hadoop and hbase version in current EMR releases doesn't seem to work with Nutch
>>> 2.x. Does it sound like a good idea to manually set up Hadoop clusters and then run Nutch on it?
>>> Will I be able to use S3 as data storage so that I can keep the data when EC2 instance stops?
>>>
>>> Any suggestions would be very much helpful!
>>>
>>> Thanks in advance,
>>>
>>> Michael
>>>
>

Re: Best practice for Nutch 2.x on AWS?

Posted by Michael Chen <yi...@u.northwestern.edu>.

Hi Sebastian,

Thanks for the reply. I do have to use 2.x for some functionalities, so 
I guess I might have to stick to HDFS for now...

I set up a 5-node hadoop cluster with HBase and Solr services by 
Cloudera Manager (and it still took me a while...), and I've installed 
Nutch on all nodes. I'm a bit confused on how to deploy the job to the 
cluster. Do I only interact with the master node, setting configuration 
and seeds, and hadoop will manage the cluster?

Is there a good reference for Nutch2 deployment?

Thank you!

Michael

On 08/15/2017 02:49 AM, Sebastian Nagel wrote:
> Hi Michael,
>
>> Will I be able to use S3 as data storage so that I can keep the data when EC2 instance stops?
> I don't know whether this is easily possible for 2.x and HBase. But Nutch 1.x can read and write
> data directly from S3 (via S3A file system [1]). Only operations on the CrawlDb need a little
> modification: data current to old, resp. temp folder to current, and S3 does not support moves.
> But this is easily worked-around by copying between S3 and HDFS.
>
> Best,
> Sebastian
>
> [1] https://wiki.apache.org/hadoop/AmazonS3
>
> On 08/06/2017 02:29 AM, Michael Chen wrote:
>> Hi,
>>
>> I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if anyone know of a "best
>> set up" for it. The hadoop and hbase version in current EMR releases doesn't seem to work with Nutch
>> 2.x. Does it sound like a good idea to manually set up Hadoop clusters and then run Nutch on it?
>> Will I be able to use S3 as data storage so that I can keep the data when EC2 instance stops?
>>
>> Any suggestions would be very much helpful!
>>
>> Thanks in advance,
>>
>> Michael
>>

Re: Best practice for Nutch 2.x on AWS?

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Michael,

> Will I be able to use S3 as data storage so that I can keep the data when EC2 instance stops?

I don't know whether this is easily possible for 2.x and HBase. But Nutch 1.x can read and write
data directly from S3 (via S3A file system [1]). Only operations on the CrawlDb need a little
modification: data current to old, resp. temp folder to current, and S3 does not support moves.
But this is easily worked-around by copying between S3 and HDFS.

Best,
Sebastian

[1] https://wiki.apache.org/hadoop/AmazonS3

On 08/06/2017 02:29 AM, Michael Chen wrote:
> Hi,
> 
> I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if anyone know of a "best
> set up" for it. The hadoop and hbase version in current EMR releases doesn't seem to work with Nutch
> 2.x. Does it sound like a good idea to manually set up Hadoop clusters and then run Nutch on it?
> Will I be able to use S3 as data storage so that I can keep the data when EC2 instance stops?
> 
> Any suggestions would be very much helpful!
> 
> Thanks in advance,
> 
> Michael
>