You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Thomas COUDERC <TC...@mediametrie.fr> on 2013/10/10 14:10:24 UTC

Nutch 2.2.1 with Map Reduce

Hi everybody,

I'm new to mailing lists so excuse me if I made a mistake. Also, I'm a new
dev contributor for Sauce Labs and DynamoDB subjects.

I read all tutorials for Nutch 2.x and I made Nutch 2.2.1 working with
cassandra 1.2.8 using gora 0.3.
I read that Nutch 2.2.1 (and previous versions) can be run on a Hadoop
cluster.
I also know that gora manage some map reduce operations for backend.

I have two questions :

1/ If Nutch is deployed on a Map Reduce cluster, and for example Hbase is
used as datastore,  where are the Map Reduce tasks distributed? Nutch
hadoop cluster or HBase (via Gora).
2/ In my case, I use Cassandra standalone. If I deploy Nutch 2.2.1 on an
Hadoop Cluster, how many of Nutch can fetch URLs? (1 or all?)

Thank you for helping me., and excuse me for my poor English.

Thomas
Nous vous rappelons que les résultats de Médiamétrie sont et demeurent sa propriété : ils sont protégés au double 
titre du droit d'auteur et de la protection des bases de données.
Ce message est confidentiel et établi à 
l'intention de ses destinataires.
Tout message électronique étant susceptible d'altération,
la société Médiamétrie 
décline toute responsabilité s'il a été altéré, déformé ou falsifié.


We remind you that the results produced by Médiamétrie are and remain its sole property covered by both copyright 
and databases protection.
This message is confidential and intended solely for the adressees.
E-mails are susceptible 
to alteration.
Neither Médiamétrie company shall be liable for the message if altered, changed or falsified.


Réf : Re: Réf : Re: Nutch 2.2.1 with Map Reduce

Posted by Thomas COUDERC <TC...@mediametrie.fr>.
Julien,

Thank you very much for the explanations, It make sense for me now.

I bought the book you told me to carry on and I will also I care about the
links you provided.

For those who would like to have an overview of what "inputs" means, go
here.

By the way, your full name also sounds like french one.

Bye!

Hi Thomas

Well your name + email address is a good clue, isn't it?

Have you looked at the presentations listed in
http://wiki.apache.org/nutch/Presentations? They should help you understand
some of the concepts. There is also a very good chapter on Nutch in Tom
White's book on Hadoop.
There is also relevant stuff on the Gora site.

I don't have much time to answer in detail but :

* - For Nutch 2.2.1, when is MapReduce used (jobs/content retrievemnt/...)
and by whom (Nutch/Gora/datastore/...) ?*
*
*
Every task in Nutch (1 and 2) is one or more MapReduce jobs. Nutch is
basically that : a collection of mapreduce jobs called sequentially (+ a
few other things of course). Nutch 2 uses GORA to provides the inputs for
the Mapreduce jobs from various datastores
*
 - Does it make sense to use a Nutch 2.2.1 cluster (on top of Hadoop) that
uses NoSQL datastore (like Hbase using also Hadoop)? And why?*

let's put it this way : "use Nutch 2.2.1 on a Hadoop cluster". and as I
said GORA provides the input from NoSQL stores. Why? because it can
simplify the architecture (no more segments), allow atomic operations
(read/writes) which HDFS datastructures can't do and gives more options on
how to do certain things. For instance the update step is Nutch 1 is costly
- if you want to modify just a few URLs then you need to read and write the
whole crawldb anyway.

At the moment the performance of Nutch 2 is not on par with Nutch 1 and
hopefully some of it will be addressed in GORA at some point.

If you don't have a specific reason to use Nutch 2 then Nutch 1 would be a
good starting point and would help to get familiar with the main concepts.

Julien


On 10 October 2013 19:21, Thomas COUDERC <TC...@mediametrie.fr> wrote:

>
> Hi Julien,
>
> Thank you very much for your answer !
> I see you noticed I was french ;)
> I added my answers below as you did before :
>
>
> > Bonjour Thomas
>
> > answers below
>
>
> > On 10 October 2013 13:10, Thomas COUDERC <TC...@mediametrie.fr>
> wrote:
>
> >> Hi everybody,
> >>
> >> I'm new to mailing lists so excuse me if I made a mistake. Also, I'm a
> new
> >> dev contributor for Sauce Labs and DynamoDB subjects.
> >>
> >> I read all tutorials for Nutch 2.x and I made Nutch 2.2.1 working with
> >> cassandra 1.2.8 using gora 0.3.
> >> I read that Nutch 2.2.1 (and previous versions) can be run on a Hadoop
> >> cluster.
> >> I also know that gora manage some map reduce operations for backend.
>
> > GORA wraps the content from the backends into inputs for Mapreduce.
>
> For which mapreduce task? Any task (inject, generate, ...) ?
> Does Gora wraps the content from backend not using any Mapreduce?
>
> >> I have two questions :
> >>
> >> 1/ If Nutch is deployed on a Map Reduce cluster, and for example Hbase
> is
> >> used as datastore,  where are the Map Reduce tasks distributed? Nutch
> >> hadoop cluster or HBase (via Gora).
>
> > not clear what you mean by distributed.
> > Nutch uses Gora internally to pull the content from the backends and
This
> > happens on the Hadoop side so to speak, not within the backends.
>
> I don't really understand what you mean. I think I am a bit confused with
> the fact that a datastore can work on top of some MapReduce system
(HBase,
> cassandra also maybe, ...) and the fact that Nutch can also be deployed
on
> top of a such system. In that case with which one does GORA deals?
>
> >> 2/ In my case, I use Cassandra standalone. If I deploy Nutch 2.2.1 on
an
> >> Hadoop Cluster, how many of Nutch can fetch URLs? (1 or all?)
>
> > I don't understand what you mean by 'how many of Nutch'. The number of
> > mappers used for the fetching depends on your configuration, the
> > distribution of URLs and the configuration of the Hadoop cluster.
>
> I thought that in a Nutch cluster there were as many Nutchs as the number
> of machines. For example with a 5 machines cluster, I thought that there
> were 5 Nutchs available, but I think I'm totally wrong. I don't really
> understand how the Nutch .job (in deploy folder) are working and what it
> means. I cannot find some information for that point.
> In fact the question was : can the mappers used for fetching be located
on
> each machine of the cluster so that it is possible to see incoming
network
> trafic on each machine?
>
>
> Maybe I get really confused on these 2 points :
>  - For Nutch 2.2.1, when is MapReduce used (jobs/content retrievemnt/...)
> and by whom (Nutch/Gora/datastore/...) ?
>  - Does it make sense to use a Nutch 2.2.1 cluster (on top of Hadoop)
that
> uses NoSQL datastore (like Hbase using also Hadoop)? And why?
>
> I will try to find these informations into the source code or in the
> internet during the next days . If you have some links it would really
help
> me.
>
> Maybe, I could synthetize these informations into graphical diagrams for
> the wiki.
>
>
> Again, Thank you very much for your help Julien.
>
>
> HTH
>
> Julien
>
>
>
> >
> > Thank you for helping me., and excuse me for my poor English.
> >
> > Thomas
> > Nous vous rappelons que les résultats de Médiamétrie sont et demeurent
sa
> > propriété : ils sont protégés au double
> > titre du droit d'auteur et de la protection des bases de données.
> > Ce message est confidentiel et établi à
> > l'intention de ses destinataires.
> > Tout message électronique étant susceptible d'altération,
> > la société Médiamétrie
> > décline toute responsabilité s'il a été altéré, déformé ou falsifié.
> >
> >
> > We remind you that the results produced by Médiamétrie are and remain
its
> > sole property covered by both copyright
> > and databases protection.
> > This message is confidential and intended solely for the adressees.
> > E-mails are susceptible
> > to alteration.
> > Neither Médiamétrie company shall be liable for the message if altered,
> > changed or falsified.
> >
> >
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble




--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Réf : Re: Nutch 2.2.1 with Map Reduce

Posted by Julien Nioche <li...@gmail.com>.
Hi Thomas

Well your name + email address is a good clue, isn't it?

Have you looked at the presentations listed in
http://wiki.apache.org/nutch/Presentations? They should help you understand
some of the concepts. There is also a very good chapter on Nutch in Tom
White's book on Hadoop.
There is also relevant stuff on the Gora site.

I don't have much time to answer in detail but :

* - For Nutch 2.2.1, when is MapReduce used (jobs/content retrievemnt/...)
and by whom (Nutch/Gora/datastore/...) ?*
*
*
Every task in Nutch (1 and 2) is one or more MapReduce jobs. Nutch is
basically that : a collection of mapreduce jobs called sequentially (+ a
few other things of course). Nutch 2 uses GORA to provides the inputs for
the Mapreduce jobs from various datastores
*
 - Does it make sense to use a Nutch 2.2.1 cluster (on top of Hadoop) that
uses NoSQL datastore (like Hbase using also Hadoop)? And why?*

let's put it this way : "use Nutch 2.2.1 on a Hadoop cluster". and as I
said GORA provides the input from NoSQL stores. Why? because it can
simplify the architecture (no more segments), allow atomic operations
(read/writes) which HDFS datastructures can't do and gives more options on
how to do certain things. For instance the update step is Nutch 1 is costly
- if you want to modify just a few URLs then you need to read and write the
whole crawldb anyway.

At the moment the performance of Nutch 2 is not on par with Nutch 1 and
hopefully some of it will be addressed in GORA at some point.

If you don't have a specific reason to use Nutch 2 then Nutch 1 would be a
good starting point and would help to get familiar with the main concepts.

Julien


On 10 October 2013 19:21, Thomas COUDERC <TC...@mediametrie.fr> wrote:

>
> Hi Julien,
>
> Thank you very much for your answer !
> I see you noticed I was french ;)
> I added my answers below as you did before :
>
>
> > Bonjour Thomas
>
> > answers below
>
>
> > On 10 October 2013 13:10, Thomas COUDERC <TC...@mediametrie.fr>
> wrote:
>
> >> Hi everybody,
> >>
> >> I'm new to mailing lists so excuse me if I made a mistake. Also, I'm a
> new
> >> dev contributor for Sauce Labs and DynamoDB subjects.
> >>
> >> I read all tutorials for Nutch 2.x and I made Nutch 2.2.1 working with
> >> cassandra 1.2.8 using gora 0.3.
> >> I read that Nutch 2.2.1 (and previous versions) can be run on a Hadoop
> >> cluster.
> >> I also know that gora manage some map reduce operations for backend.
>
> > GORA wraps the content from the backends into inputs for Mapreduce.
>
> For which mapreduce task? Any task (inject, generate, ...) ?
> Does Gora wraps the content from backend not using any Mapreduce?
>
> >> I have two questions :
> >>
> >> 1/ If Nutch is deployed on a Map Reduce cluster, and for example Hbase
> is
> >> used as datastore,  where are the Map Reduce tasks distributed? Nutch
> >> hadoop cluster or HBase (via Gora).
>
> > not clear what you mean by distributed.
> > Nutch uses Gora internally to pull the content from the backends and This
> > happens on the Hadoop side so to speak, not within the backends.
>
> I don't really understand what you mean. I think I am a bit confused with
> the fact that a datastore can work on top of some MapReduce system (HBase,
> cassandra also maybe, ...) and the fact that Nutch can also be deployed on
> top of a such system. In that case with which one does GORA deals?
>
> >> 2/ In my case, I use Cassandra standalone. If I deploy Nutch 2.2.1 on an
> >> Hadoop Cluster, how many of Nutch can fetch URLs? (1 or all?)
>
> > I don't understand what you mean by 'how many of Nutch'. The number of
> > mappers used for the fetching depends on your configuration, the
> > distribution of URLs and the configuration of the Hadoop cluster.
>
> I thought that in a Nutch cluster there were as many Nutchs as the number
> of machines. For example with a 5 machines cluster, I thought that there
> were 5 Nutchs available, but I think I'm totally wrong. I don't really
> understand how the Nutch .job (in deploy folder) are working and what it
> means. I cannot find some information for that point.
> In fact the question was : can the mappers used for fetching be located on
> each machine of the cluster so that it is possible to see incoming network
> trafic on each machine?
>
>
> Maybe I get really confused on these 2 points :
>  - For Nutch 2.2.1, when is MapReduce used (jobs/content retrievemnt/...)
> and by whom (Nutch/Gora/datastore/...) ?
>  - Does it make sense to use a Nutch 2.2.1 cluster (on top of Hadoop) that
> uses NoSQL datastore (like Hbase using also Hadoop)? And why?
>
> I will try to find these informations into the source code or in the
> internet during the next days . If you have some links it would really help
> me.
>
> Maybe, I could synthetize these informations into graphical diagrams for
> the wiki.
>
>
> Again, Thank you very much for your help Julien.
>
>
> HTH
>
> Julien
>
>
>
> >
> > Thank you for helping me., and excuse me for my poor English.
> >
> > Thomas
> > Nous vous rappelons que les résultats de Médiamétrie sont et demeurent sa
> > propriété : ils sont protégés au double
> > titre du droit d'auteur et de la protection des bases de données.
> > Ce message est confidentiel et établi à
> > l'intention de ses destinataires.
> > Tout message électronique étant susceptible d'altération,
> > la société Médiamétrie
> > décline toute responsabilité s'il a été altéré, déformé ou falsifié.
> >
> >
> > We remind you that the results produced by Médiamétrie are and remain its
> > sole property covered by both copyright
> > and databases protection.
> > This message is confidential and intended solely for the adressees.
> > E-mails are susceptible
> > to alteration.
> > Neither Médiamétrie company shall be liable for the message if altered,
> > changed or falsified.
> >
> >
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Réf : Re: Nutch 2.2.1 with Map Reduce

Posted by Thomas COUDERC <TC...@mediametrie.fr>.
Hi Julien,

Thank you very much for your answer !
I see you noticed I was french ;)
I added my answers below as you did before :


> Bonjour Thomas

> answers below


> On 10 October 2013 13:10, Thomas COUDERC <TC...@mediametrie.fr> wrote:

>> Hi everybody,
>>
>> I'm new to mailing lists so excuse me if I made a mistake. Also, I'm a
new
>> dev contributor for Sauce Labs and DynamoDB subjects.
>>
>> I read all tutorials for Nutch 2.x and I made Nutch 2.2.1 working with
>> cassandra 1.2.8 using gora 0.3.
>> I read that Nutch 2.2.1 (and previous versions) can be run on a Hadoop
>> cluster.
>> I also know that gora manage some map reduce operations for backend.

> GORA wraps the content from the backends into inputs for Mapreduce.

For which mapreduce task? Any task (inject, generate, ...) ?
Does Gora wraps the content from backend not using any Mapreduce?

>> I have two questions :
>>
>> 1/ If Nutch is deployed on a Map Reduce cluster, and for example Hbase
is
>> used as datastore,  where are the Map Reduce tasks distributed? Nutch
>> hadoop cluster or HBase (via Gora).

> not clear what you mean by distributed.
> Nutch uses Gora internally to pull the content from the backends and This
> happens on the Hadoop side so to speak, not within the backends.

I don't really understand what you mean. I think I am a bit confused with
the fact that a datastore can work on top of some MapReduce system (HBase,
cassandra also maybe, ...) and the fact that Nutch can also be deployed on
top of a such system. In that case with which one does GORA deals?

>> 2/ In my case, I use Cassandra standalone. If I deploy Nutch 2.2.1 on an
>> Hadoop Cluster, how many of Nutch can fetch URLs? (1 or all?)

> I don't understand what you mean by 'how many of Nutch'. The number of
> mappers used for the fetching depends on your configuration, the
> distribution of URLs and the configuration of the Hadoop cluster.

I thought that in a Nutch cluster there were as many Nutchs as the number
of machines. For example with a 5 machines cluster, I thought that there
were 5 Nutchs available, but I think I'm totally wrong. I don't really
understand how the Nutch .job (in deploy folder) are working and what it
means. I cannot find some information for that point.
In fact the question was : can the mappers used for fetching be located on
each machine of the cluster so that it is possible to see incoming network
trafic on each machine?


Maybe I get really confused on these 2 points :
 - For Nutch 2.2.1, when is MapReduce used (jobs/content retrievemnt/...)
and by whom (Nutch/Gora/datastore/...) ?
 - Does it make sense to use a Nutch 2.2.1 cluster (on top of Hadoop) that
uses NoSQL datastore (like Hbase using also Hadoop)? And why?

I will try to find these informations into the source code or in the
internet during the next days . If you have some links it would really help
me.

Maybe, I could synthetize these informations into graphical diagrams for
the wiki.


Again, Thank you very much for your help Julien.


HTH

Julien



>
> Thank you for helping me., and excuse me for my poor English.
>
> Thomas
> Nous vous rappelons que les résultats de Médiamétrie sont et demeurent sa
> propriété : ils sont protégés au double
> titre du droit d'auteur et de la protection des bases de données.
> Ce message est confidentiel et établi à
> l'intention de ses destinataires.
> Tout message électronique étant susceptible d'altération,
> la société Médiamétrie
> décline toute responsabilité s'il a été altéré, déformé ou falsifié.
>
>
> We remind you that the results produced by Médiamétrie are and remain its
> sole property covered by both copyright
> and databases protection.
> This message is confidential and intended solely for the adressees.
> E-mails are susceptible
> to alteration.
> Neither Médiamétrie company shall be liable for the message if altered,
> changed or falsified.
>
>


--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 2.2.1 with Map Reduce

Posted by Julien Nioche <li...@gmail.com>.
Bonjour Thomas

answers below


On 10 October 2013 13:10, Thomas COUDERC <TC...@mediametrie.fr> wrote:

>
> Hi everybody,
>
> I'm new to mailing lists so excuse me if I made a mistake. Also, I'm a new
> dev contributor for Sauce Labs and DynamoDB subjects.
>
> I read all tutorials for Nutch 2.x and I made Nutch 2.2.1 working with
> cassandra 1.2.8 using gora 0.3.
> I read that Nutch 2.2.1 (and previous versions) can be run on a Hadoop
> cluster.
> I also know that gora manage some map reduce operations for backend.
>

GORA wraps the content from the backends into inputs for Mapreduce.



>
> I have two questions :
>
> 1/ If Nutch is deployed on a Map Reduce cluster, and for example Hbase is
> used as datastore,  where are the Map Reduce tasks distributed? Nutch
> hadoop cluster or HBase (via Gora).
>

not clear what you mean by distributed.
Nutch uses Gora internally to pull the content from the backends and This
happens on the Hadoop side so to speak, not within the backends.



> 2/ In my case, I use Cassandra standalone. If I deploy Nutch 2.2.1 on an
> Hadoop Cluster, how many of Nutch can fetch URLs? (1 or all?)
>

I don't understand what you mean by 'how many of Nutch'. The number of
mappers used for the fetching depends on your configuration, the
distribution of URLs and the configuration of the Hadoop cluster.

HTH

Julien



>
> Thank you for helping me., and excuse me for my poor English.
>
> Thomas
> Nous vous rappelons que les résultats de Médiamétrie sont et demeurent sa
> propriété : ils sont protégés au double
> titre du droit d'auteur et de la protection des bases de données.
> Ce message est confidentiel et établi à
> l'intention de ses destinataires.
> Tout message électronique étant susceptible d'altération,
> la société Médiamétrie
> décline toute responsabilité s'il a été altéré, déformé ou falsifié.
>
>
> We remind you that the results produced by Médiamétrie are and remain its
> sole property covered by both copyright
> and databases protection.
> This message is confidential and intended solely for the adressees.
> E-mails are susceptible
> to alteration.
> Neither Médiamétrie company shall be liable for the message if altered,
> changed or falsified.
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble