You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Amit Sela <am...@infolinks.com> on 2013/02/14 15:24:22 UTC

Nutch 2.1 over Hadoop 1.0.3 and HBase 0.94.2

Hi everyone,

I'm new to Nutch and I would appreciate some advice...

I want to use Nutch to Crawl over urls and categorize them.

I already have a running Hadoop cluster with Hadoop 1.0.3 and HBase 0.94.2,
and I saw that Nutch 2.1 with Gora supports HBase as backend.

I would like to start by running a basic crawler with this installations on
a standalone machine and after I get the hang of it deploy it on the
cluster / set up on another cluster.

Anyone has a good advise for installation / setup ?

Anyone used Nutch for website categorization ?

Is 2.1 version compatible with HBase0.94.x (or actually is Gora compatible)
?

Any help would be greatly appreciated..

Thanks,

Amit.

need help for web categorization

Posted by Divyang <di...@yahoo.com>.
Hello Amit, 
              I have seen your post that you are doing web categorization. 
You are using apache nutch and Gora with hbase. 

Can you please explain me, how you are using nutch for web categorization?
as I am doing same thing. 

So, we can share ideas about how to implement it. 

Regards,
Divyang Shah



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-2-1-over-Hadoop-1-0-3-and-HBase-0-94-2-tp4040464p4198876.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 2.1 over Hadoop 1.0.3 and HBase 0.94.2

Posted by Lewis John Mcgibbney <le...@gmail.com>.
It takes little time to get up and running with an gora-hbase backed Nutch
deployment.
If you are happy compiling the code from source then this is the way to go.
1.6 is stable, where as 2.x is shipped as source only. This is because you
will inevitably wish to recompile the .job files based on changing storage
conditions etc.
There are plans to improve gora-hbase after Gora 0.3 is released, although
this will not be immediate it will probably happen in the next development
drive. We are always looking for contributions.
Lewis

On Sun, Feb 17, 2013 at 1:34 AM, Amit Sela <am...@infolinks.com> wrote:

> So what (stable) version of Nutch and which architecture would best fit my
> cluster ?
>
> Is there a quick (simplified) deployment if I already have a running
> cluster and I don't want to change it's existing data or configuration ?
>
> Thanks.
>
> On Fri, Feb 15, 2013 at 12:42 AM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Hi Amit,
> >
> > On Thu, Feb 14, 2013 at 6:24 AM, Amit Sela <am...@infolinks.com> wrote:
> >
> > >
> > > I already have a running Hadoop cluster with Hadoop 1.0.3 and HBase
> > 0.94.2,
> > > and I saw that Nutch 2.1 with Gora supports HBase as backend.
> > >
> >
> > First thing's first. We cannot guarantee that Gora and subsequently Nutch
> > will work with the newer HBase 0.94.X branch.
> > You could try it out and get back to us, but the advice would be that it
> is
> > most likely incompatible.
> >
> >
> > > I would like to start by running a basic crawler with this
> installations
> > on
> > > a standalone machine and after I get the hang of it deploy it on the
> > > cluster / set up on another cluster.
> > >
> > > Anyone has a good advise for installation / setup ?
> > >
> >
> > http://wiki.apache.org/nutch/#Other_Tutorial.28s.29
> >
> >
> > >
> > > Anyone used Nutch for website categorization ?
> > >
> >
> > You can find some info on suggestions from this thread
> > http://www.mail-archive.com/user@nutch.apache.org/msg08066.html
> >
> >
> > >
> > >
> >
>



-- 
*Lewis*

Re: Nutch 2.1 over Hadoop 1.0.3 and HBase 0.94.2

Posted by Amit Sela <am...@infolinks.com>.
So what (stable) version of Nutch and which architecture would best fit my
cluster ?

Is there a quick (simplified) deployment if I already have a running
cluster and I don't want to change it's existing data or configuration ?

Thanks.

On Fri, Feb 15, 2013 at 12:42 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Amit,
>
> On Thu, Feb 14, 2013 at 6:24 AM, Amit Sela <am...@infolinks.com> wrote:
>
> >
> > I already have a running Hadoop cluster with Hadoop 1.0.3 and HBase
> 0.94.2,
> > and I saw that Nutch 2.1 with Gora supports HBase as backend.
> >
>
> First thing's first. We cannot guarantee that Gora and subsequently Nutch
> will work with the newer HBase 0.94.X branch.
> You could try it out and get back to us, but the advice would be that it is
> most likely incompatible.
>
>
> > I would like to start by running a basic crawler with this installations
> on
> > a standalone machine and after I get the hang of it deploy it on the
> > cluster / set up on another cluster.
> >
> > Anyone has a good advise for installation / setup ?
> >
>
> http://wiki.apache.org/nutch/#Other_Tutorial.28s.29
>
>
> >
> > Anyone used Nutch for website categorization ?
> >
>
> You can find some info on suggestions from this thread
> http://www.mail-archive.com/user@nutch.apache.org/msg08066.html
>
>
> >
> >
>

Re: slf4j issue with nutch 2.x over hadoop 1.1.1

Posted by Lewis John Mcgibbney <le...@gmail.com>.
OK.

On Tue, Feb 19, 2013 at 12:53 PM, kaveh minooie <ka...@plutoz.com> wrote:

> Thanks Lewis
>
> it turned out that slf4j 1.6.1 is being pulled by both hbase 0.90.4
>  and zookeeper  3.4.5:
>
> Required by
> Organisation    Name    Revision        In Configurations       Asked
>  Revision
> org.apache.nutch        nutch   working@localhost       default, test,
> master   1.4.3
> org.apache.hbase        hbase   0.90.4  compile, runtime        1.6.1
> org.apache.zookeeper    zookeeper       3.4.5   default, compile, runtime,
> master       1.6.1
> Dependencies
> Module  Revision        Status  Resolver        Default Licenses
>  Size
> log4j by log4j  1.2.15  release maven2  false   The Apache Software
> License, Version 2.0        0 kB
> log4j by log4j  1.2.16  release maven2  false   The Apache Software
> License, Version 2.0        470 kB
> slf4j-api by org.slf4j  1.6.1   release maven2  false           25 kB
>
>
> I think I am better off upgrading the slf4j across my hadoop cluster :)
>
>
> thanks,
>
>
> On 02/18/2013 10:26 PM, Lewis John Mcgibbney wrote:
>
>> Hi,
>> So by the sounds of it the slf4j is being pulled transitively and you need
>> to determine from where.
>> You can use the ant report task which makes ivy generate a nice dependency
>> report for you. You can then see which direct dependency includes slf4j. I
>> know for a fact that all gora-core dependency will pull it.
>>
>> Once you've highlighted where it is coming from you can make the relevant
>> exclusion ok.
>>
>> On Saturday, February 16, 2013, kaveh minooie <ka...@plutoz.com> wrote:
>>
>>> So when you say "prune the dependencies", I am not sure what you are
>>>
>> talking about cause what I could think of is not working.  let me explain
>> the situation again. nutch 2.x ivy file ( ivy/ivy.xml ) has this in it:
>>
>>>
>>>   <dependencies>
>>>      <dependency org="org.elasticsearch" name="elasticsearch"
>>> rev="0.19.4"
>>>                  conf="*->default,sources"/>
>>>      <dependency org="org.apache.solr" name="solr-solrj" rev="3.4.0"
>>>        conf="*->default" />
>>>      <dependency org="org.slf4j" name="slf4j-log4j12" rev="1.6.1"
>>>        conf="*->master" />
>>>
>>> hadoop 1.1.1 ships with slf4j 1.4.3.  these 2 are not compatible. now, i
>>>
>> rather not mess with my hadoop cluster so I tried to downgrade slf4j in
>> nutch. I changed the above lines to :
>>
>>>
>>>   <dependencies>
>>> <!--
>>>      <dependency org="org.elasticsearch" name="elasticsearch"
>>> rev="0.19.4"
>>>                  conf="*->default,sources"/>
>>>    -->
>>>      <dependency org="org.apache.zookeeper" name="zookeeper" rev="3.4.5"
>>>                  conf="*->default"/>
>>>      <dependency org="org.apache.solr" name="solr-solrj" rev="3.6.2"
>>>        conf="*->default" />
>>>      <dependency org="org.slf4j" name="slf4j-log4j12" rev="1.4.3"
>>>
>>> as you can see I am upgrading the solr and zookeeper and removing the
>>>
>> elasticsearch, and all of these changes work fine since I can see the
>> appropriate files in the build/lib directory after ant is done. but it
>> doesn't work for slf4j, and the files copied to build/lib ( and
>> subsequently in my job file ) are :
>>
>>> kaveh@d1r2n2:/source/nutch/**nutch$ ll build/lib/slf*
>>> -rw-r--r-- 1 kaveh kaveh 25496 Jul  5  2010 build/lib/slf4j-api-1.6.1.jar
>>> -rw-r--r-- 1 kaveh kaveh  9753 Jul  5  2010
>>>
>> build/lib/slf4j-log4j12-1.6.1.**jar
>>
>>>
>>> since i need the job file i can't just manually change the files in
>>>
>> build/lib, won't do me any good. now I don't know ant very well, and that
>> is mostly why I am asking this from you guys. I have to say that I also
>> changed the same thing in pom.xml as well:
>>
>>>
>>>   <dependency>
>>>          <groupId>org.slf4j</groupId>
>>>          <artifactId>slf4j-log4j12</**artifactId>
>>>          <version>1.4.3</version>
>>>         <optional>true</optional>
>>> </dependency>
>>>
>>> but I still end up with the 1.6.1 version. I don't know how exactly ant
>>>
>> and ivy and pom work together, so I am asking if there is any other config
>> file that I am missing, or why while it is working fine for solr and
>> zookeeper it is not affecting the slf4j?
>>
>>>
>>> thanks,
>>>
>>>
>>> On 02/16/2013 09:42 AM, Lewis John Mcgibbney wrote:
>>>
>>>>
>>>> A solution would be to manually prune the dependencies which are fetched
>>>> via Ivy. If old slf4j dependencies are fetched for Hadoop via Ivy then
>>>> maybe we need to make the exclusions explicit within ivy.xml. if you are
>>>> able , then please provide a patch which fixes this if it is really a
>>>> problem.
>>>> It is important to note that pom.xml will most likely be outdated. You
>>>> should build nutch with ant + ivy for the time being as this is stable.
>>>> Thank you
>>>> Lewis
>>>>
>>>> On Saturday, February 16, 2013, kaveh minooie <ka...@plutoz.com> wrote:
>>>>
>>>>>
>>>>> unfortunately your links have been removed from the email that i got so
>>>>>
>>>> i
>>
>>>
>>>> am not sure what [0] and [1] are, but this is what i am using :
>>>>
>>>>>
>>>>> kaveh@d1r2n2:/source/nutch/**nutch.git$ git remote -v
>>>>> origin    https://github.com/apache/**nutch.git<https://github.com/apache/nutch.git>(fetch)
>>>>> origin    https://github.com/apache/**nutch.git<https://github.com/apache/nutch.git>(push)
>>>>> kaveh@d1r2n2:/source/nutch/**nutch.git$ git branch -v
>>>>> * 2.x   f02dcf6 NUTCH-XX remove unused db.max.inlinks from
>>>>>
>>>>
>>>> nutch-default.xml
>>>>
>>>>>
>>>>>     trunk a7a1b41 NUTCH-1521 CrawlDbFilter pass null url to
>>>>>
>>>> urlNormalizers
>>
>>> kaveh@d1r2n2:/2locos/source/**nutch/nutch.git$
>>>>>
>>>>>
>>>>> i am using branch 2.x
>>>>>
>>>>> On 02/15/2013 06:02 PM, Lewis John Mcgibbney wrote:
>>>>>
>>>>>>
>>>>>> Hi Kaveh,
>>>>>>
>>>>>> Two seconds please. First lets set some thing straight.
>>>>>> Nutch trunk is from here [0]
>>>>>> Nutch 2,x is from here [1]
>>>>>> Which one do you use?
>>>>>>
>>>>>> On Fri, Feb 15, 2013 at 4:53 PM, kaveh minooie <ka...@plutoz.com>
>>>>>>
>>>>> wrote:
>>
>>>
>>>>>>  but here is my problem. I tried to build the nutch using ver 1.4.3 of
>>>>>>>
>>>>>>
>>>> the
>>>>
>>>>>
>>>>>>> slf4j. i changed the version in both ivy.xml and pom.xml and cleaned
>>>>>>>
>>>>>> my
>>
>>>
>>>> ivy
>>>>
>>>>>
>>>>>>> cache but ant still fetches the version 1.6.1 when it builds the
>>>>>>>
>>>>>>
>>>> project.
>>>>
>>>>>
>>>>>>> what am I missing?
>>>>>>>
>>>>>>>
>>>>>>>  We can progress with the problem once we know what's actually going
>>>>>> on.
>>>>>> Thanks
>>>>>> Lewis
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>
> --
> Kaveh Minooie
>
> www.plutoz.com
>



-- 
*Lewis*

Re: slf4j issue with nutch 2.x over hadoop 1.1.1

Posted by kaveh minooie <ka...@plutoz.com>.
Thanks Lewis

it turned out that slf4j 1.6.1 is being pulled by both hbase 0.90.4
  and zookeeper	3.4.5:

Required by
Organisation	Name	Revision	In Configurations	Asked  Revision
org.apache.nutch	nutch	working@localhost	default, test, master	1.4.3
org.apache.hbase	hbase	0.90.4	compile, runtime	1.6.1
org.apache.zookeeper	zookeeper	3.4.5	default, compile, runtime, master	1.6.1
Dependencies
Module	Revision	Status	Resolver	Default	Licenses	Size	
log4j by log4j	1.2.15	release	maven2	false	The Apache Software License, 
Version 2.0	0 kB	
log4j by log4j	1.2.16	release	maven2	false	The Apache Software License, 
Version 2.0	470 kB	
slf4j-api by org.slf4j	1.6.1	release	maven2	false		25 kB


I think I am better off upgrading the slf4j across my hadoop cluster :)


thanks,
	
On 02/18/2013 10:26 PM, Lewis John Mcgibbney wrote:
> Hi,
> So by the sounds of it the slf4j is being pulled transitively and you need
> to determine from where.
> You can use the ant report task which makes ivy generate a nice dependency
> report for you. You can then see which direct dependency includes slf4j. I
> know for a fact that all gora-core dependency will pull it.
>
> Once you've highlighted where it is coming from you can make the relevant
> exclusion ok.
>
> On Saturday, February 16, 2013, kaveh minooie <ka...@plutoz.com> wrote:
>> So when you say "prune the dependencies", I am not sure what you are
> talking about cause what I could think of is not working.  let me explain
> the situation again. nutch 2.x ivy file ( ivy/ivy.xml ) has this in it:
>>
>>   <dependencies>
>>      <dependency org="org.elasticsearch" name="elasticsearch" rev="0.19.4"
>>                  conf="*->default,sources"/>
>>      <dependency org="org.apache.solr" name="solr-solrj" rev="3.4.0"
>>        conf="*->default" />
>>      <dependency org="org.slf4j" name="slf4j-log4j12" rev="1.6.1"
>>        conf="*->master" />
>>
>> hadoop 1.1.1 ships with slf4j 1.4.3.  these 2 are not compatible. now, i
> rather not mess with my hadoop cluster so I tried to downgrade slf4j in
> nutch. I changed the above lines to :
>>
>>   <dependencies>
>> <!--
>>      <dependency org="org.elasticsearch" name="elasticsearch" rev="0.19.4"
>>                  conf="*->default,sources"/>
>>    -->
>>      <dependency org="org.apache.zookeeper" name="zookeeper" rev="3.4.5"
>>                  conf="*->default"/>
>>      <dependency org="org.apache.solr" name="solr-solrj" rev="3.6.2"
>>        conf="*->default" />
>>      <dependency org="org.slf4j" name="slf4j-log4j12" rev="1.4.3"
>>
>> as you can see I am upgrading the solr and zookeeper and removing the
> elasticsearch, and all of these changes work fine since I can see the
> appropriate files in the build/lib directory after ant is done. but it
> doesn't work for slf4j, and the files copied to build/lib ( and
> subsequently in my job file ) are :
>> kaveh@d1r2n2:/source/nutch/nutch$ ll build/lib/slf*
>> -rw-r--r-- 1 kaveh kaveh 25496 Jul  5  2010 build/lib/slf4j-api-1.6.1.jar
>> -rw-r--r-- 1 kaveh kaveh  9753 Jul  5  2010
> build/lib/slf4j-log4j12-1.6.1.jar
>>
>> since i need the job file i can't just manually change the files in
> build/lib, won't do me any good. now I don't know ant very well, and that
> is mostly why I am asking this from you guys. I have to say that I also
> changed the same thing in pom.xml as well:
>>
>>   <dependency>
>>          <groupId>org.slf4j</groupId>
>>          <artifactId>slf4j-log4j12</artifactId>
>>          <version>1.4.3</version>
>>         <optional>true</optional>
>> </dependency>
>>
>> but I still end up with the 1.6.1 version. I don't know how exactly ant
> and ivy and pom work together, so I am asking if there is any other config
> file that I am missing, or why while it is working fine for solr and
> zookeeper it is not affecting the slf4j?
>>
>> thanks,
>>
>>
>> On 02/16/2013 09:42 AM, Lewis John Mcgibbney wrote:
>>>
>>> A solution would be to manually prune the dependencies which are fetched
>>> via Ivy. If old slf4j dependencies are fetched for Hadoop via Ivy then
>>> maybe we need to make the exclusions explicit within ivy.xml. if you are
>>> able , then please provide a patch which fixes this if it is really a
>>> problem.
>>> It is important to note that pom.xml will most likely be outdated. You
>>> should build nutch with ant + ivy for the time being as this is stable.
>>> Thank you
>>> Lewis
>>>
>>> On Saturday, February 16, 2013, kaveh minooie <ka...@plutoz.com> wrote:
>>>>
>>>> unfortunately your links have been removed from the email that i got so
> i
>>>
>>> am not sure what [0] and [1] are, but this is what i am using :
>>>>
>>>> kaveh@d1r2n2:/source/nutch/nutch.git$ git remote -v
>>>> origin    https://github.com/apache/nutch.git (fetch)
>>>> origin    https://github.com/apache/nutch.git (push)
>>>> kaveh@d1r2n2:/source/nutch/nutch.git$ git branch -v
>>>> * 2.x   f02dcf6 NUTCH-XX remove unused db.max.inlinks from
>>>
>>> nutch-default.xml
>>>>
>>>>     trunk a7a1b41 NUTCH-1521 CrawlDbFilter pass null url to
> urlNormalizers
>>>> kaveh@d1r2n2:/2locos/source/nutch/nutch.git$
>>>>
>>>>
>>>> i am using branch 2.x
>>>>
>>>> On 02/15/2013 06:02 PM, Lewis John Mcgibbney wrote:
>>>>>
>>>>> Hi Kaveh,
>>>>>
>>>>> Two seconds please. First lets set some thing straight.
>>>>> Nutch trunk is from here [0]
>>>>> Nutch 2,x is from here [1]
>>>>> Which one do you use?
>>>>>
>>>>> On Fri, Feb 15, 2013 at 4:53 PM, kaveh minooie <ka...@plutoz.com>
> wrote:
>>>>>
>>>>>> but here is my problem. I tried to build the nutch using ver 1.4.3 of
>>>
>>> the
>>>>>>
>>>>>> slf4j. i changed the version in both ivy.xml and pom.xml and cleaned
> my
>>>
>>> ivy
>>>>>>
>>>>>> cache but ant still fetches the version 1.6.1 when it builds the
>>>
>>> project.
>>>>>>
>>>>>> what am I missing?
>>>>>>
>>>>>>
>>>>> We can progress with the problem once we know what's actually going on.
>>>>> Thanks
>>>>> Lewis
>>>>>
>>>>
>>
>>
>

-- 
Kaveh Minooie

www.plutoz.com

Re: slf4j issue with nutch 2.x over hadoop 1.1.1

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,
So by the sounds of it the slf4j is being pulled transitively and you need
to determine from where.
You can use the ant report task which makes ivy generate a nice dependency
report for you. You can then see which direct dependency includes slf4j. I
know for a fact that all gora-core dependency will pull it.

Once you've highlighted where it is coming from you can make the relevant
exclusion ok.

On Saturday, February 16, 2013, kaveh minooie <ka...@plutoz.com> wrote:
> So when you say "prune the dependencies", I am not sure what you are
talking about cause what I could think of is not working.  let me explain
the situation again. nutch 2.x ivy file ( ivy/ivy.xml ) has this in it:
>
>  <dependencies>
>     <dependency org="org.elasticsearch" name="elasticsearch" rev="0.19.4"
>                 conf="*->default,sources"/>
>     <dependency org="org.apache.solr" name="solr-solrj" rev="3.4.0"
>       conf="*->default" />
>     <dependency org="org.slf4j" name="slf4j-log4j12" rev="1.6.1"
>       conf="*->master" />
>
> hadoop 1.1.1 ships with slf4j 1.4.3.  these 2 are not compatible. now, i
rather not mess with my hadoop cluster so I tried to downgrade slf4j in
nutch. I changed the above lines to :
>
>  <dependencies>
> <!--
>     <dependency org="org.elasticsearch" name="elasticsearch" rev="0.19.4"
>                 conf="*->default,sources"/>
>   -->
>     <dependency org="org.apache.zookeeper" name="zookeeper" rev="3.4.5"
>                 conf="*->default"/>
>     <dependency org="org.apache.solr" name="solr-solrj" rev="3.6.2"
>       conf="*->default" />
>     <dependency org="org.slf4j" name="slf4j-log4j12" rev="1.4.3"
>
> as you can see I am upgrading the solr and zookeeper and removing the
elasticsearch, and all of these changes work fine since I can see the
appropriate files in the build/lib directory after ant is done. but it
doesn't work for slf4j, and the files copied to build/lib ( and
subsequently in my job file ) are :
> kaveh@d1r2n2:/source/nutch/nutch$ ll build/lib/slf*
> -rw-r--r-- 1 kaveh kaveh 25496 Jul  5  2010 build/lib/slf4j-api-1.6.1.jar
> -rw-r--r-- 1 kaveh kaveh  9753 Jul  5  2010
build/lib/slf4j-log4j12-1.6.1.jar
>
> since i need the job file i can't just manually change the files in
build/lib, won't do me any good. now I don't know ant very well, and that
is mostly why I am asking this from you guys. I have to say that I also
changed the same thing in pom.xml as well:
>
>  <dependency>
>         <groupId>org.slf4j</groupId>
>         <artifactId>slf4j-log4j12</artifactId>
>         <version>1.4.3</version>
>        <optional>true</optional>
> </dependency>
>
> but I still end up with the 1.6.1 version. I don't know how exactly ant
and ivy and pom work together, so I am asking if there is any other config
file that I am missing, or why while it is working fine for solr and
zookeeper it is not affecting the slf4j?
>
> thanks,
>
>
> On 02/16/2013 09:42 AM, Lewis John Mcgibbney wrote:
>>
>> A solution would be to manually prune the dependencies which are fetched
>> via Ivy. If old slf4j dependencies are fetched for Hadoop via Ivy then
>> maybe we need to make the exclusions explicit within ivy.xml. if you are
>> able , then please provide a patch which fixes this if it is really a
>> problem.
>> It is important to note that pom.xml will most likely be outdated. You
>> should build nutch with ant + ivy for the time being as this is stable.
>> Thank you
>> Lewis
>>
>> On Saturday, February 16, 2013, kaveh minooie <ka...@plutoz.com> wrote:
>>>
>>> unfortunately your links have been removed from the email that i got so
i
>>
>> am not sure what [0] and [1] are, but this is what i am using :
>>>
>>> kaveh@d1r2n2:/source/nutch/nutch.git$ git remote -v
>>> origin    https://github.com/apache/nutch.git (fetch)
>>> origin    https://github.com/apache/nutch.git (push)
>>> kaveh@d1r2n2:/source/nutch/nutch.git$ git branch -v
>>> * 2.x   f02dcf6 NUTCH-XX remove unused db.max.inlinks from
>>
>> nutch-default.xml
>>>
>>>    trunk a7a1b41 NUTCH-1521 CrawlDbFilter pass null url to
urlNormalizers
>>> kaveh@d1r2n2:/2locos/source/nutch/nutch.git$
>>>
>>>
>>> i am using branch 2.x
>>>
>>> On 02/15/2013 06:02 PM, Lewis John Mcgibbney wrote:
>>>>
>>>> Hi Kaveh,
>>>>
>>>> Two seconds please. First lets set some thing straight.
>>>> Nutch trunk is from here [0]
>>>> Nutch 2,x is from here [1]
>>>> Which one do you use?
>>>>
>>>> On Fri, Feb 15, 2013 at 4:53 PM, kaveh minooie <ka...@plutoz.com>
wrote:
>>>>
>>>>> but here is my problem. I tried to build the nutch using ver 1.4.3 of
>>
>> the
>>>>>
>>>>> slf4j. i changed the version in both ivy.xml and pom.xml and cleaned
my
>>
>> ivy
>>>>>
>>>>> cache but ant still fetches the version 1.6.1 when it builds the
>>
>> project.
>>>>>
>>>>> what am I missing?
>>>>>
>>>>>
>>>> We can progress with the problem once we know what's actually going on.
>>>> Thanks
>>>> Lewis
>>>>
>>>
>
>

-- 
*Lewis*

Re: slf4j issue with nutch 2.x over hadoop 1.1.1

Posted by kaveh minooie <ka...@plutoz.com>.
So when you say "prune the dependencies", I am not sure what you are 
talking about cause what I could think of is not working.  let me 
explain the situation again. nutch 2.x ivy file ( ivy/ivy.xml ) has this 
in it:

  <dependencies>
     <dependency org="org.elasticsearch" name="elasticsearch" rev="0.19.4"
                 conf="*->default,sources"/>
     <dependency org="org.apache.solr" name="solr-solrj" rev="3.4.0"
       conf="*->default" />
     <dependency org="org.slf4j" name="slf4j-log4j12" rev="1.6.1"
       conf="*->master" />

hadoop 1.1.1 ships with slf4j 1.4.3.  these 2 are not compatible. now, i 
rather not mess with my hadoop cluster so I tried to downgrade slf4j in 
nutch. I changed the above lines to :

  <dependencies>
<!--
     <dependency org="org.elasticsearch" name="elasticsearch" rev="0.19.4"
                 conf="*->default,sources"/>
   -->
     <dependency org="org.apache.zookeeper" name="zookeeper" rev="3.4.5"
                 conf="*->default"/>
     <dependency org="org.apache.solr" name="solr-solrj" rev="3.6.2"
       conf="*->default" />
     <dependency org="org.slf4j" name="slf4j-log4j12" rev="1.4.3"

as you can see I am upgrading the solr and zookeeper and removing the 
elasticsearch, and all of these changes work fine since I can see the 
appropriate files in the build/lib directory after ant is done. but it 
doesn't work for slf4j, and the files copied to build/lib ( and 
subsequently in my job file ) are :
kaveh@d1r2n2:/source/nutch/nutch$ ll build/lib/slf*
-rw-r--r-- 1 kaveh kaveh 25496 Jul  5  2010 build/lib/slf4j-api-1.6.1.jar
-rw-r--r-- 1 kaveh kaveh  9753 Jul  5  2010 
build/lib/slf4j-log4j12-1.6.1.jar

since i need the job file i can't just manually change the files in 
build/lib, won't do me any good. now I don't know ant very well, and 
that is mostly why I am asking this from you guys. I have to say that I 
also changed the same thing in pom.xml as well:

  <dependency>
         <groupId>org.slf4j</groupId>
         <artifactId>slf4j-log4j12</artifactId>
         <version>1.4.3</version>
        <optional>true</optional>
</dependency>

but I still end up with the 1.6.1 version. I don't know how exactly ant 
and ivy and pom work together, so I am asking if there is any other 
config file that I am missing, or why while it is working fine for solr 
and zookeeper it is not affecting the slf4j?

thanks,


On 02/16/2013 09:42 AM, Lewis John Mcgibbney wrote:
> A solution would be to manually prune the dependencies which are fetched
> via Ivy. If old slf4j dependencies are fetched for Hadoop via Ivy then
> maybe we need to make the exclusions explicit within ivy.xml. if you are
> able , then please provide a patch which fixes this if it is really a
> problem.
> It is important to note that pom.xml will most likely be outdated. You
> should build nutch with ant + ivy for the time being as this is stable.
> Thank you
> Lewis
>
> On Saturday, February 16, 2013, kaveh minooie <ka...@plutoz.com> wrote:
>> unfortunately your links have been removed from the email that i got so i
> am not sure what [0] and [1] are, but this is what i am using :
>> kaveh@d1r2n2:/source/nutch/nutch.git$ git remote -v
>> origin    https://github.com/apache/nutch.git (fetch)
>> origin    https://github.com/apache/nutch.git (push)
>> kaveh@d1r2n2:/source/nutch/nutch.git$ git branch -v
>> * 2.x   f02dcf6 NUTCH-XX remove unused db.max.inlinks from
> nutch-default.xml
>>    trunk a7a1b41 NUTCH-1521 CrawlDbFilter pass null url to urlNormalizers
>> kaveh@d1r2n2:/2locos/source/nutch/nutch.git$
>>
>>
>> i am using branch 2.x
>>
>> On 02/15/2013 06:02 PM, Lewis John Mcgibbney wrote:
>>> Hi Kaveh,
>>>
>>> Two seconds please. First lets set some thing straight.
>>> Nutch trunk is from here [0]
>>> Nutch 2,x is from here [1]
>>> Which one do you use?
>>>
>>> On Fri, Feb 15, 2013 at 4:53 PM, kaveh minooie <ka...@plutoz.com> wrote:
>>>
>>>> but here is my problem. I tried to build the nutch using ver 1.4.3 of
> the
>>>> slf4j. i changed the version in both ivy.xml and pom.xml and cleaned my
> ivy
>>>> cache but ant still fetches the version 1.6.1 when it builds the
> project.
>>>> what am I missing?
>>>>
>>>>
>>> We can progress with the problem once we know what's actually going on.
>>> Thanks
>>> Lewis
>>>
>>


Re: slf4j issue with nutch 2.x over hadoop 1.1.1

Posted by kaveh minooie <ka...@plutoz.com>.
So when you say "prune the dependencies", I am not sure what you are 
talking about cause what I could think of is not working.  let me 
explain the situation again. nutch 2.x ivy file ( ivy/ivy.xml ) has this 
in it:

  <dependencies>
     <dependency org="org.elasticsearch" name="elasticsearch" rev="0.19.4"
                 conf="*->default,sources"/>
     <dependency org="org.apache.solr" name="solr-solrj" rev="3.4.0"
       conf="*->default" />
     <dependency org="org.slf4j" name="slf4j-log4j12" rev="1.6.1"
       conf="*->master" />

hadoop 1.1.1 ships with slf4j 1.4.3.  these 2 are not compatible. now, i 
rather not mess with my hadoop cluster so I tried to downgrade slf4j in 
nutch. I changed the above lines to :

  <dependencies>
<!--
     <dependency org="org.elasticsearch" name="elasticsearch" rev="0.19.4"
                 conf="*->default,sources"/>
   -->
     <dependency org="org.apache.zookeeper" name="zookeeper" rev="3.4.5"
                 conf="*->default"/>
     <dependency org="org.apache.solr" name="solr-solrj" rev="3.6.2"
       conf="*->default" />
     <dependency org="org.slf4j" name="slf4j-log4j12" rev="1.4.3"

as you can see I am upgrading the solr and zookeeper and removing the 
elasticsearch, and all of these changes work fine since I can see the 
appropriate files in the build/lib directory after ant is done. but it 
doesn't work for slf4j, and the files copied to build/lib ( and 
subsequently in my job file ) are :
kaveh@d1r2n2:/source/nutch/nutch$ ll build/lib/slf*
-rw-r--r-- 1 kaveh kaveh 25496 Jul  5  2010 build/lib/slf4j-api-1.6.1.jar
-rw-r--r-- 1 kaveh kaveh  9753 Jul  5  2010 
build/lib/slf4j-log4j12-1.6.1.jar

since i need the job file i can't just manually change the files in 
build/lib, won't do me any good. now I don't know ant very well, and 
that is mostly why I am asking this from you guys. I have to say that I 
also changed the same thing in pom.xml as well:

  <dependency>
         <groupId>org.slf4j</groupId>
         <artifactId>slf4j-log4j12</artifactId>
         <version>1.4.3</version>
        <optional>true</optional>
</dependency>

but I still end up with the 1.6.1 version. I don't know how exactly ant 
and ivy and pom work together, so I am asking if there is any other 
config file that I am missing, or why while it is working fine for solr 
and zookeeper it is not affecting the slf4j?

thanks,


On 02/16/2013 09:42 AM, Lewis John Mcgibbney wrote:
> A solution would be to manually prune the dependencies which are fetched
> via Ivy. If old slf4j dependencies are fetched for Hadoop via Ivy then
> maybe we need to make the exclusions explicit within ivy.xml. if you are
> able , then please provide a patch which fixes this if it is really a
> problem.
> It is important to note that pom.xml will most likely be outdated. You
> should build nutch with ant + ivy for the time being as this is stable.
> Thank you
> Lewis
>
> On Saturday, February 16, 2013, kaveh minooie <ka...@plutoz.com> wrote:
>> unfortunately your links have been removed from the email that i got so i
> am not sure what [0] and [1] are, but this is what i am using :
>> kaveh@d1r2n2:/source/nutch/nutch.git$ git remote -v
>> origin    https://github.com/apache/nutch.git (fetch)
>> origin    https://github.com/apache/nutch.git (push)
>> kaveh@d1r2n2:/source/nutch/nutch.git$ git branch -v
>> * 2.x   f02dcf6 NUTCH-XX remove unused db.max.inlinks from
> nutch-default.xml
>>    trunk a7a1b41 NUTCH-1521 CrawlDbFilter pass null url to urlNormalizers
>> kaveh@d1r2n2:/2locos/source/nutch/nutch.git$
>>
>>
>> i am using branch 2.x
>>
>> On 02/15/2013 06:02 PM, Lewis John Mcgibbney wrote:
>>> Hi Kaveh,
>>>
>>> Two seconds please. First lets set some thing straight.
>>> Nutch trunk is from here [0]
>>> Nutch 2,x is from here [1]
>>> Which one do you use?
>>>
>>> On Fri, Feb 15, 2013 at 4:53 PM, kaveh minooie <ka...@plutoz.com> wrote:
>>>
>>>> but here is my problem. I tried to build the nutch using ver 1.4.3 of
> the
>>>> slf4j. i changed the version in both ivy.xml and pom.xml and cleaned my
> ivy
>>>> cache but ant still fetches the version 1.6.1 when it builds the
> project.
>>>> what am I missing?
>>>>
>>>>
>>> We can progress with the problem once we know what's actually going on.
>>> Thanks
>>> Lewis
>>>
>>


Re: slf4j issue with nutch 2.x over hadoop 1.1.1

Posted by Lewis John Mcgibbney <le...@gmail.com>.
A solution would be to manually prune the dependencies which are fetched
via Ivy. If old slf4j dependencies are fetched for Hadoop via Ivy then
maybe we need to make the exclusions explicit within ivy.xml. if you are
able , then please provide a patch which fixes this if it is really a
problem.
It is important to note that pom.xml will most likely be outdated. You
should build nutch with ant + ivy for the time being as this is stable.
Thank you
Lewis

On Saturday, February 16, 2013, kaveh minooie <ka...@plutoz.com> wrote:
> unfortunately your links have been removed from the email that i got so i
am not sure what [0] and [1] are, but this is what i am using :
>
> kaveh@d1r2n2:/source/nutch/nutch.git$ git remote -v
> origin    https://github.com/apache/nutch.git (fetch)
> origin    https://github.com/apache/nutch.git (push)
> kaveh@d1r2n2:/source/nutch/nutch.git$ git branch -v
> * 2.x   f02dcf6 NUTCH-XX remove unused db.max.inlinks from
nutch-default.xml
>   trunk a7a1b41 NUTCH-1521 CrawlDbFilter pass null url to urlNormalizers
> kaveh@d1r2n2:/2locos/source/nutch/nutch.git$
>
>
> i am using branch 2.x
>
> On 02/15/2013 06:02 PM, Lewis John Mcgibbney wrote:
>>
>> Hi Kaveh,
>>
>> Two seconds please. First lets set some thing straight.
>> Nutch trunk is from here [0]
>> Nutch 2,x is from here [1]
>> Which one do you use?
>>
>> On Fri, Feb 15, 2013 at 4:53 PM, kaveh minooie <ka...@plutoz.com> wrote:
>>
>>> but here is my problem. I tried to build the nutch using ver 1.4.3 of
the
>>> slf4j. i changed the version in both ivy.xml and pom.xml and cleaned my
ivy
>>> cache but ant still fetches the version 1.6.1 when it builds the
project.
>>> what am I missing?
>>>
>>>
>> We can progress with the problem once we know what's actually going on.
>> Thanks
>> Lewis
>>
>
>

-- 
*Lewis*

Re: slf4j issue with nutch 2.x over hadoop 1.1.1

Posted by kaveh minooie <ka...@plutoz.com>.
unfortunately your links have been removed from the email that i got so 
i am not sure what [0] and [1] are, but this is what i am using :

kaveh@d1r2n2:/source/nutch/nutch.git$ git remote -v
origin    https://github.com/apache/nutch.git (fetch)
origin    https://github.com/apache/nutch.git (push)
kaveh@d1r2n2:/source/nutch/nutch.git$ git branch -v
* 2.x   f02dcf6 NUTCH-XX remove unused db.max.inlinks from nutch-default.xml
   trunk a7a1b41 NUTCH-1521 CrawlDbFilter pass null url to urlNormalizers
kaveh@d1r2n2:/2locos/source/nutch/nutch.git$


i am using branch 2.x

On 02/15/2013 06:02 PM, Lewis John Mcgibbney wrote:
> Hi Kaveh,
>
> Two seconds please. First lets set some thing straight.
> Nutch trunk is from here [0]
> Nutch 2,x is from here [1]
> Which one do you use?
>
> On Fri, Feb 15, 2013 at 4:53 PM, kaveh minooie <ka...@plutoz.com> wrote:
>
>> but here is my problem. I tried to build the nutch using ver 1.4.3 of the
>> slf4j. i changed the version in both ivy.xml and pom.xml and cleaned my ivy
>> cache but ant still fetches the version 1.6.1 when it builds the project.
>> what am I missing?
>>
>>
> We can progress with the problem once we know what's actually going on.
> Thanks
> Lewis
>


Re: slf4j issue with nutch 2.x over hadoop 1.1.1

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Kaveh,

Two seconds please. First lets set some thing straight.
Nutch trunk is from here [0]
Nutch 2,x is from here [1]
Which one do you use?

On Fri, Feb 15, 2013 at 4:53 PM, kaveh minooie <ka...@plutoz.com> wrote:

> but here is my problem. I tried to build the nutch using ver 1.4.3 of the
> slf4j. i changed the version in both ivy.xml and pom.xml and cleaned my ivy
> cache but ant still fetches the version 1.6.1 when it builds the project.
> what am I missing?
>
>
We can progress with the problem once we know what's actually going on.
Thanks
Lewis

slf4j issue with nutch 2.x over hadoop 1.1.1

Posted by kaveh minooie <ka...@plutoz.com>.
Hi everyone
    I recently build the nutch 2.x from the trunk, but it crashes almost 
immediately in run time. it seems that the there is a version 
incompatibility between the slf4j in hadoop which is (1.4.3) and the one 
in nutch (1.6.1) : (actually is between versions above 1.6 and below it)

$ PATH="$(pwd)/bin:$PATH" bin/nutch inject /temp/urls/
Error: Could not find or load main class org.apache.hadoop.util.PlatformName
13/02/15 15:47:15 INFO crawl.InjectorJob: InjectorJob: starting at 
2013-02-15 15:47:15
13/02/15 15:47:15 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: 
/temp/urls
Exception in thread "main" java.lang.NoSuchMethodError: 
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
	at 
org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133)
	at 
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:139)
	at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:205)
	at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
	at 
org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
	at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:477)
	at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
	at org.apache.hadoop.mapreduce.JobContext.<init>(JobContext.java:80)
	at org.apache.hadoop.mapreduce.Job.<init>(Job.java:50)
	at org.apache.hadoop.mapreduce.Job.<init>(Job.java:54)
	at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:37)
	at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214)
	at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
	at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



but here is my problem. I tried to build the nutch using ver 1.4.3 of 
the slf4j. i changed the version in both ivy.xml and pom.xml and cleaned 
my ivy cache but ant still fetches the version 1.6.1 when it builds the 
project. what am I missing?

thanks,
-- 
Kaveh Minooie

www.plutoz.com

slf4j issue with nutch 2.x over hadoop 1.1.1

Posted by kaveh minooie <ka...@plutoz.com>.
Hi everyone
    I recently build the nutch 2.x from the trunk, but it crashes almost 
immediately in run time. it seems that the there is a version 
incompatibility between the slf4j in hadoop which is (1.4.3) and the one 
in nutch (1.6.1) : (actually is between versions above 1.6 and below it)

$ PATH="$(pwd)/bin:$PATH" bin/nutch inject /temp/urls/
Error: Could not find or load main class org.apache.hadoop.util.PlatformName
13/02/15 15:47:15 INFO crawl.InjectorJob: InjectorJob: starting at 
2013-02-15 15:47:15
13/02/15 15:47:15 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: 
/temp/urls
Exception in thread "main" java.lang.NoSuchMethodError: 
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
	at 
org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133)
	at 
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:139)
	at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:205)
	at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
	at 
org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
	at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:477)
	at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
	at org.apache.hadoop.mapreduce.JobContext.<init>(JobContext.java:80)
	at org.apache.hadoop.mapreduce.Job.<init>(Job.java:50)
	at org.apache.hadoop.mapreduce.Job.<init>(Job.java:54)
	at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:37)
	at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214)
	at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
	at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



but here is my problem. I tried to build the nutch using ver 1.4.3 of 
the slf4j. i changed the version in both ivy.xml and pom.xml and cleaned 
my ivy cache but ant still fetches the version 1.6.1 when it builds the 
project. what am I missing?

thanks,
-- 
Kaveh Minooie

www.plutoz.com

Re: Nutch 2.1 over Hadoop 1.0.3 and HBase 0.94.2

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Amit,

On Thu, Feb 14, 2013 at 6:24 AM, Amit Sela <am...@infolinks.com> wrote:

>
> I already have a running Hadoop cluster with Hadoop 1.0.3 and HBase 0.94.2,
> and I saw that Nutch 2.1 with Gora supports HBase as backend.
>

First thing's first. We cannot guarantee that Gora and subsequently Nutch
will work with the newer HBase 0.94.X branch.
You could try it out and get back to us, but the advice would be that it is
most likely incompatible.


> I would like to start by running a basic crawler with this installations on
> a standalone machine and after I get the hang of it deploy it on the
> cluster / set up on another cluster.
>
> Anyone has a good advise for installation / setup ?
>

http://wiki.apache.org/nutch/#Other_Tutorial.28s.29


>
> Anyone used Nutch for website categorization ?
>

You can find some info on suggestions from this thread
http://www.mail-archive.com/user@nutch.apache.org/msg08066.html


>
>