You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Thomas COUDERC <TC...@mediametrie.fr> on 2013/10/11 11:50:28 UTC

Réf : Re: Réf : Re: Nutch 2.2.1 with Map Reduce

Julien,

Thank you very much for the explanations, It make sense for me now.

I bought the book you told me to carry on and I will also I care about the
links you provided.

For those who would like to have an overview of what "inputs" means, go
here.

By the way, your full name also sounds like french one.

Bye!

Hi Thomas

Well your name + email address is a good clue, isn't it?

Have you looked at the presentations listed in
http://wiki.apache.org/nutch/Presentations? They should help you understand
some of the concepts. There is also a very good chapter on Nutch in Tom
White's book on Hadoop.
There is also relevant stuff on the Gora site.

I don't have much time to answer in detail but :

* - For Nutch 2.2.1, when is MapReduce used (jobs/content retrievemnt/...)
and by whom (Nutch/Gora/datastore/...) ?*
*
*
Every task in Nutch (1 and 2) is one or more MapReduce jobs. Nutch is
basically that : a collection of mapreduce jobs called sequentially (+ a
few other things of course). Nutch 2 uses GORA to provides the inputs for
the Mapreduce jobs from various datastores
*
 - Does it make sense to use a Nutch 2.2.1 cluster (on top of Hadoop) that
uses NoSQL datastore (like Hbase using also Hadoop)? And why?*

let's put it this way : "use Nutch 2.2.1 on a Hadoop cluster". and as I
said GORA provides the input from NoSQL stores. Why? because it can
simplify the architecture (no more segments), allow atomic operations
(read/writes) which HDFS datastructures can't do and gives more options on
how to do certain things. For instance the update step is Nutch 1 is costly
- if you want to modify just a few URLs then you need to read and write the
whole crawldb anyway.

At the moment the performance of Nutch 2 is not on par with Nutch 1 and
hopefully some of it will be addressed in GORA at some point.

If you don't have a specific reason to use Nutch 2 then Nutch 1 would be a
good starting point and would help to get familiar with the main concepts.

Julien


On 10 October 2013 19:21, Thomas COUDERC <TC...@mediametrie.fr> wrote:

>
> Hi Julien,
>
> Thank you very much for your answer !
> I see you noticed I was french ;)
> I added my answers below as you did before :
>
>
> > Bonjour Thomas
>
> > answers below
>
>
> > On 10 October 2013 13:10, Thomas COUDERC <TC...@mediametrie.fr>
> wrote:
>
> >> Hi everybody,
> >>
> >> I'm new to mailing lists so excuse me if I made a mistake. Also, I'm a
> new
> >> dev contributor for Sauce Labs and DynamoDB subjects.
> >>
> >> I read all tutorials for Nutch 2.x and I made Nutch 2.2.1 working with
> >> cassandra 1.2.8 using gora 0.3.
> >> I read that Nutch 2.2.1 (and previous versions) can be run on a Hadoop
> >> cluster.
> >> I also know that gora manage some map reduce operations for backend.
>
> > GORA wraps the content from the backends into inputs for Mapreduce.
>
> For which mapreduce task? Any task (inject, generate, ...) ?
> Does Gora wraps the content from backend not using any Mapreduce?
>
> >> I have two questions :
> >>
> >> 1/ If Nutch is deployed on a Map Reduce cluster, and for example Hbase
> is
> >> used as datastore,  where are the Map Reduce tasks distributed? Nutch
> >> hadoop cluster or HBase (via Gora).
>
> > not clear what you mean by distributed.
> > Nutch uses Gora internally to pull the content from the backends and
This
> > happens on the Hadoop side so to speak, not within the backends.
>
> I don't really understand what you mean. I think I am a bit confused with
> the fact that a datastore can work on top of some MapReduce system
(HBase,
> cassandra also maybe, ...) and the fact that Nutch can also be deployed
on
> top of a such system. In that case with which one does GORA deals?
>
> >> 2/ In my case, I use Cassandra standalone. If I deploy Nutch 2.2.1 on
an
> >> Hadoop Cluster, how many of Nutch can fetch URLs? (1 or all?)
>
> > I don't understand what you mean by 'how many of Nutch'. The number of
> > mappers used for the fetching depends on your configuration, the
> > distribution of URLs and the configuration of the Hadoop cluster.
>
> I thought that in a Nutch cluster there were as many Nutchs as the number
> of machines. For example with a 5 machines cluster, I thought that there
> were 5 Nutchs available, but I think I'm totally wrong. I don't really
> understand how the Nutch .job (in deploy folder) are working and what it
> means. I cannot find some information for that point.
> In fact the question was : can the mappers used for fetching be located
on
> each machine of the cluster so that it is possible to see incoming
network
> trafic on each machine?
>
>
> Maybe I get really confused on these 2 points :
>  - For Nutch 2.2.1, when is MapReduce used (jobs/content retrievemnt/...)
> and by whom (Nutch/Gora/datastore/...) ?
>  - Does it make sense to use a Nutch 2.2.1 cluster (on top of Hadoop)
that
> uses NoSQL datastore (like Hbase using also Hadoop)? And why?
>
> I will try to find these informations into the source code or in the
> internet during the next days . If you have some links it would really
help
> me.
>
> Maybe, I could synthetize these informations into graphical diagrams for
> the wiki.
>
>
> Again, Thank you very much for your help Julien.
>
>
> HTH
>
> Julien
>
>
>
> >
> > Thank you for helping me., and excuse me for my poor English.
> >
> > Thomas
> > Nous vous rappelons que les résultats de Médiamétrie sont et demeurent
sa
> > propriété : ils sont protégés au double
> > titre du droit d'auteur et de la protection des bases de données.
> > Ce message est confidentiel et établi à
> > l'intention de ses destinataires.
> > Tout message électronique étant susceptible d'altération,
> > la société Médiamétrie
> > décline toute responsabilité s'il a été altéré, déformé ou falsifié.
> >
> >
> > We remind you that the results produced by Médiamétrie are and remain
its
> > sole property covered by both copyright
> > and databases protection.
> > This message is confidential and intended solely for the adressees.
> > E-mails are susceptible
> > to alteration.
> > Neither Médiamétrie company shall be liable for the message if altered,
> > changed or falsified.
> >
> >
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble




--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble