You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Thiago Galery <tg...@gmail.com> on 2016/04/06 14:48:59 UTC

Best Practices for Plugin Dev and Deployment

Dear list,
I'm a new Nutch Developer and I have a few questions to ask you.

1 - Are there any general guidelines for plugin development (in addition to
the ones specified in the wiki guide).
I looked around github and it seems that many plugins are developed as a
monolithic piece of code that is attached to / forked from the main Nutch
repo. I take it that, ideally, plugins should be developed as their own
separate repositories, so they can be versioned and tested against
different versions of Nutch. Is there a recommended way to do this ? I'm
considering using git submodules to add plugin repos as Nutch dependencies
or else crating symlinks from the plugins folder to the right plugin
repositories.

2 - As a specific use case for point (1), I have developed a plugin that
reads some Machine Learning models from a directory. Ideally, I'd like to
leave the files in the same repository as the plugin, and leave it in a way
so that it can be tested, versioned and developed as an independent repo.
At the moment, I can just make it work by specifying the path to these
models in nutch-site.xml, but I wonder whether that directory could be
accessible by the plugin in some other way (either by some classes in the
Plugin system or by ivy/ant). Any thoughts ?

3 - Is there any tooling developed by the community to deploy and monitor
Nutch applications ? At the moment, we have a scrip that deploys Nutch but
is not robust enough. I see that there's a dockefile. I'm just wondering if
it could be used (possibly together with some other tooling) to provision a
hadoop cluster which the app runs on top. Another tool to run the crawling
steps (fetch, parse, index) and provide some form of monitoring would be
great. I hear that this is somehow present in Nutch 2, but I was more
interested in Nutch 1 (since v2 is not production ready yet, is it?). I was
wondering if there are any community recipes for Chef/Puppet/Ansible/Salt
or some work using Kubernates or Mesos. If anyone has experience with this
and could give me some pointers, I would greatly appreciate it.

4 - At the moment we collect some websites which we extract some metadata
from, but we don't need to make the results available in a search server
like Solr or ElasticSearch. Is there any queue or streaming based plugin
for Nutch, so 'indexing' can be regarding as sending to a queue ? I know
that Nutch 2 has Gora as an abstraction layer, so maybe this could be a
gora plug-in, but I'm mainly interested in something for Nutch 1 (or else
good reasons for moving to Nutch 2).

All the best,
Thiago Galery

Re: Best Practices for Plugin Dev and Deployment

Posted by Thiago Galery <tg...@gmail.com>.

Thanks for the pointers Chris

On Wed, Apr 6, 2016 at 2:12 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi Thiago,
>
> Welcome!
>
> First thing to check out:
>
> http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
>
>
> I would follow that by checking out info on how to use our
> Source Code repo:
>
> http://wiki.apache.org/nutch/UsingGit
>
>
> OK now on to your specific questions:
>
>
>
>
> On 4/6/16, 8:48 AM, "Thiago Galery" <tg...@gmail.com> wrote:
>
> >Dear list,
> >I'm a new Nutch Developer and I have a few questions to ask you.
> >
> >1 - Are there any general guidelines for plugin development (in addition
> to
> >the ones specified in the wiki guide).
> >I looked around github and it seems that many plugins are developed as a
> >monolithic piece of code that is attached to / forked from the main Nutch
> >repo. I take it that, ideally, plugins should be developed as their own
> >separate repositories, so they can be versioned and tested against
> >different versions of Nutch. Is there a recommended way to do this ? I'm
> >considering using git submodules to add plugin repos as Nutch dependencies
> >or else crating symlinks from the plugins folder to the right plugin
> >repositories.
>
> I would recommend plugin develop to be done against the master branch of
> nutch, which you can find a cloned copy of here:
>
> http://github.com/apache/nutch/tree/master
>
> You can follow this process to submit pull requests to add plugins:
>
> http://github.com/apache/nutch/#contributing
>
> >
> >2 - As a specific use case for point (1), I have developed a plugin that
> >reads some Machine Learning models from a directory. Ideally, I'd like to
> >leave the files in the same repository as the plugin, and leave it in a
> way
> >so that it can be tested, versioned and developed as an independent repo.
>
> Use a nutch property defined in either $NUTCH/conf/nutch-{default|site}.xml
> Then read the property in your plugin via
> NutchConfiguration.create().get(“name”)
>
> If the property references a model file, add a property that lists
> (relatively)
> the file path, and then read the property assuming that your Nutch *.job
> or jar code depending on whether you are running on Hadoop or locally has
> access to $NUTCH/conf
>
> >At the moment, I can just make it work by specifying the path to these
> >models in nutch-site.xml, but I wonder whether that directory could be
> >accessible by the plugin in some other way (either by some classes in the
> >Plugin system or by ivy/ant). Any thoughts ?
>
> See above.
>
> >
> >3 - Is there any tooling developed by the community to deploy and monitor
> >Nutch applications ? At the moment, we have a scrip that deploys Nutch but
> >is not robust enough. I see that there's a dockefile. I'm just wondering
> if
> >it could be used (possibly together with some other tooling) to provision
> a
> >hadoop cluster which the app runs on top. Another tool to run the crawling
> >steps (fetch, parse, index) and provide some form of monitoring would be
> >great.
>
> We have been working on a project called Memex Explorer:
> http://github.com/memex-explorer/memex-explorer
>
> that provides these types of capabilities. Have a look.
>
> >I hear that this is somehow present in Nutch 2, but I was more
> >interested in Nutch 1 (since v2 is not production ready yet, is it?). I
> was
> >wondering if there are any community recipes for Chef/Puppet/Ansible/Salt
> >or some work using Kubernates or Mesos. If anyone has experience with this
> >and could give me some pointers, I would greatly appreciate it.
>
> FYI above.
>
> >
> >4 - At the moment we collect some websites which we extract some metadata
> >from, but we don't need to make the results available in a search server
> >like Solr or ElasticSearch. Is there any queue or streaming based plugin
> >for Nutch, so 'indexing' can be regarding as sending to a queue ? I know
> >that Nutch 2 has Gora as an abstraction layer, so maybe this could be a
> >gora plug-in, but I'm mainly interested in something for Nutch 1 (or else
> >good reasons for moving to Nutch 2).
>
> Lots of people are interested in this and there is Storm Crawler
> that sort of does this, which involves some of the Nutch PMC and
> committers.
>
> Within Nutch there is also work done by my USC masters student and
> Nutch PMC member and committer Sujen Shah where he added a publisher
> using ActiveMQ Artemis that publishes Nutch events so we can display
> what’s up in D3 and JSON. You can see the work here, I intend to commit
> it soon:
>
> https://issues.apache.org/jira/browse/NUTCH-2132
>
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>

Re: Best Practices for Plugin Dev and Deployment

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Hi Thiago,

Sorry for the top post:

1. Yes you could do conf/models, and/or an HDFS url, either one.
The conf directory is packaged up when you create a *.job file
for Hadoop by running ant job. That said, if your job jar includes
100-1GB model files that’s how big your *.job will be. A better way
would probably be to pre-stage the models on to HDFS, and ref
them via an HDFS url.

2. Yes MEMEX explorer is right now in hiatus. It was a proof of
feasibility that we used in the DARPA MEMEX program and it already
takes care of a lot of the stuff you were talking about with Salt,
and e.g., Docker/Vagrant and Nutch. So that’s why I pointed you there
it’s certainly something to build off of rather than re-invent.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++










On 4/8/16, 2:25 AM, "Thiago Galery" <tg...@gmail.com> wrote:

>Hi Chris, thanks for the response, here are some elaborations of my initial
>questions on the basis of your reply.
>
>On Wed, Apr 6, 2016 at 2:12 PM, Mattmann, Chris A (3980) <
>chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> Hi Thiago,
>>
>> Welcome!
>>
>> First thing to check out:
>>
>> http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
>>
>>
>> I would follow that by checking out info on how to use our
>> Source Code repo:
>>
>> http://wiki.apache.org/nutch/UsingGit
>>
>>
>> OK now on to your specific questions:
>>
>>
>>
>>
>> On 4/6/16, 8:48 AM, "Thiago Galery" <tg...@gmail.com> wrote:
>>
>> >Dear list,
>> >I'm a new Nutch Developer and I have a few questions to ask you.
>> >
>> >1 - Are there any general guidelines for plugin development (in addition
>> to
>> >the ones specified in the wiki guide).
>> >I looked around github and it seems that many plugins are developed as a
>> >monolithic piece of code that is attached to / forked from the main Nutch
>> >repo. I take it that, ideally, plugins should be developed as their own
>> >separate repositories, so they can be versioned and tested against
>> >different versions of Nutch. Is there a recommended way to do this ? I'm
>> >considering using git submodules to add plugin repos as Nutch dependencies
>> >or else crating symlinks from the plugins folder to the right plugin
>> >repositories.
>>
>> I would recommend plugin develop to be done against the master branch of
>> nutch, which you can find a cloned copy of here:
>>
>> http://github.com/apache/nutch/tree/master
>>
>> You can follow this process to submit pull requests to add plugins:
>>
>> http://github.com/apache/nutch/#contributing
>>
>> >
>> >2 - As a specific use case for point (1), I have developed a plugin that
>> >reads some Machine Learning models from a directory. Ideally, I'd like to
>> >leave the files in the same repository as the plugin, and leave it in a
>> way
>> >so that it can be tested, versioned and developed as an independent repo.
>>
>> Use a nutch property defined in either $NUTCH/conf/nutch-{default|site}.xml
>> Then read the property in your plugin via
>> NutchConfiguration.create().get(“name”)
>>
>> If the property references a model file, add a property that lists
>> (relatively)
>> the file path, and then read the property assuming that your Nutch *.job
>> or jar code depending on whether you are running on Hadoop or locally has
>> access to $NUTCH/conf
>>
>
>
>Could you elaborate on this a bit more. At the moment I'm specifying the
>full path or the models,
>this works well on local mode, but might raise problems when running on a
>hadoop cluster.
>I understand that the path should be specified relatively, but I'm not sure
>relative to what, that is,
>if the job file has access to the conf folder, should I put the models
>inside conf and just add the property
>models.folder = conf/models ? I imagine that another option is to use a
>hdfs url for the models location,
>would that work ?
>
>
>
>> >At the moment, I can just make it work by specifying the path to these
>> >models in nutch-site.xml, but I wonder whether that directory could be
>> >accessible by the plugin in some other way (either by some classes in the
>> >Plugin system or by ivy/ant). Any thoughts ?
>>
>> See above.
>>
>> >
>> >3 - Is there any tooling developed by the community to deploy and monitor
>> >Nutch applications ? At the moment, we have a scrip that deploys Nutch but
>> >is not robust enough. I see that there's a dockefile. I'm just wondering
>> if
>> >it could be used (possibly together with some other tooling) to provision
>> a
>> >hadoop cluster which the app runs on top. Another tool to run the crawling
>> >steps (fetch, parse, index) and provide some form of monitoring would be
>> >great.
>>
>> We have been working on a project called Memex Explorer:
>> http://github.com/memex-explorer/memex-explorer
>>
>
>
>Memex explorer seems to be really interesting !!! However, I had some
>issues (tests not passing, redis not runnning, some screens unavailable).
>On the github page, it says that the project is not maintained. I'd be
>happy to fix bugs and contribute, but if the project is just gonna be
>ditched, then I'd be less inclined to do so.
>Does anyone know what the plans for memex are ?
>
>
>> that provides these types of capabilities. Have a look.
>>
>> >I hear that this is somehow present in Nutch 2, but I was more
>> >interested in Nutch 1 (since v2 is not production ready yet, is it?). I
>> was
>> >wondering if there are any community recipes for Chef/Puppet/Ansible/Salt
>> >or some work using Kubernates or Mesos. If anyone has experience with this
>> >and could give me some pointers, I would greatly appreciate it.
>>
>> FYI above.
>>
>> >
>> >4 - At the moment we collect some websites which we extract some metadata
>> >from, but we don't need to make the results available in a search server
>> >like Solr or ElasticSearch. Is there any queue or streaming based plugin
>> >for Nutch, so 'indexing' can be regarding as sending to a queue ? I know
>> >that Nutch 2 has Gora as an abstraction layer, so maybe this could be a
>> >gora plug-in, but I'm mainly interested in something for Nutch 1 (or else
>> >good reasons for moving to Nutch 2).
>>
>> Lots of people are interested in this and there is Storm Crawler
>> that sort of does this, which involves some of the Nutch PMC and
>> committers.
>>
>> Within Nutch there is also work done by my USC masters student and
>> Nutch PMC member and committer Sujen Shah where he added a publisher
>> using ActiveMQ Artemis that publishes Nutch events so we can display
>> what’s up in D3 and JSON. You can see the work here, I intend to commit
>> it soon:
>>
>> https://issues.apache.org/jira/browse/NUTCH-2132
>>
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>

Re: Best Practices for Plugin Dev and Deployment

Posted by Thiago Galery <tg...@gmail.com>.

Hi Chris, thanks for the response, here are some elaborations of my initial
questions on the basis of your reply.

On Wed, Apr 6, 2016 at 2:12 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi Thiago,
>
> Welcome!
>
> First thing to check out:
>
> http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
>
>
> I would follow that by checking out info on how to use our
> Source Code repo:
>
> http://wiki.apache.org/nutch/UsingGit
>
>
> OK now on to your specific questions:
>
>
>
>
> On 4/6/16, 8:48 AM, "Thiago Galery" <tg...@gmail.com> wrote:
>
> >Dear list,
> >I'm a new Nutch Developer and I have a few questions to ask you.
> >
> >1 - Are there any general guidelines for plugin development (in addition
> to
> >the ones specified in the wiki guide).
> >I looked around github and it seems that many plugins are developed as a
> >monolithic piece of code that is attached to / forked from the main Nutch
> >repo. I take it that, ideally, plugins should be developed as their own
> >separate repositories, so they can be versioned and tested against
> >different versions of Nutch. Is there a recommended way to do this ? I'm
> >considering using git submodules to add plugin repos as Nutch dependencies
> >or else crating symlinks from the plugins folder to the right plugin
> >repositories.
>
> I would recommend plugin develop to be done against the master branch of
> nutch, which you can find a cloned copy of here:
>
> http://github.com/apache/nutch/tree/master
>
> You can follow this process to submit pull requests to add plugins:
>
> http://github.com/apache/nutch/#contributing
>
> >
> >2 - As a specific use case for point (1), I have developed a plugin that
> >reads some Machine Learning models from a directory. Ideally, I'd like to
> >leave the files in the same repository as the plugin, and leave it in a
> way
> >so that it can be tested, versioned and developed as an independent repo.
>
> Use a nutch property defined in either $NUTCH/conf/nutch-{default|site}.xml
> Then read the property in your plugin via
> NutchConfiguration.create().get(“name”)
>
> If the property references a model file, add a property that lists
> (relatively)
> the file path, and then read the property assuming that your Nutch *.job
> or jar code depending on whether you are running on Hadoop or locally has
> access to $NUTCH/conf
>


Could you elaborate on this a bit more. At the moment I'm specifying the
full path or the models,
this works well on local mode, but might raise problems when running on a
hadoop cluster.
I understand that the path should be specified relatively, but I'm not sure
relative to what, that is,
if the job file has access to the conf folder, should I put the models
inside conf and just add the property
models.folder = conf/models ? I imagine that another option is to use a
hdfs url for the models location,
would that work ?



> >At the moment, I can just make it work by specifying the path to these
> >models in nutch-site.xml, but I wonder whether that directory could be
> >accessible by the plugin in some other way (either by some classes in the
> >Plugin system or by ivy/ant). Any thoughts ?
>
> See above.
>
> >
> >3 - Is there any tooling developed by the community to deploy and monitor
> >Nutch applications ? At the moment, we have a scrip that deploys Nutch but
> >is not robust enough. I see that there's a dockefile. I'm just wondering
> if
> >it could be used (possibly together with some other tooling) to provision
> a
> >hadoop cluster which the app runs on top. Another tool to run the crawling
> >steps (fetch, parse, index) and provide some form of monitoring would be
> >great.
>
> We have been working on a project called Memex Explorer:
> http://github.com/memex-explorer/memex-explorer
>


Memex explorer seems to be really interesting !!! However, I had some
issues (tests not passing, redis not runnning, some screens unavailable).
On the github page, it says that the project is not maintained. I'd be
happy to fix bugs and contribute, but if the project is just gonna be
ditched, then I'd be less inclined to do so.
Does anyone know what the plans for memex are ?


> that provides these types of capabilities. Have a look.
>
> >I hear that this is somehow present in Nutch 2, but I was more
> >interested in Nutch 1 (since v2 is not production ready yet, is it?). I
> was
> >wondering if there are any community recipes for Chef/Puppet/Ansible/Salt
> >or some work using Kubernates or Mesos. If anyone has experience with this
> >and could give me some pointers, I would greatly appreciate it.
>
> FYI above.
>
> >
> >4 - At the moment we collect some websites which we extract some metadata
> >from, but we don't need to make the results available in a search server
> >like Solr or ElasticSearch. Is there any queue or streaming based plugin
> >for Nutch, so 'indexing' can be regarding as sending to a queue ? I know
> >that Nutch 2 has Gora as an abstraction layer, so maybe this could be a
> >gora plug-in, but I'm mainly interested in something for Nutch 1 (or else
> >good reasons for moving to Nutch 2).
>
> Lots of people are interested in this and there is Storm Crawler
> that sort of does this, which involves some of the Nutch PMC and
> committers.
>
> Within Nutch there is also work done by my USC masters student and
> Nutch PMC member and committer Sujen Shah where he added a publisher
> using ActiveMQ Artemis that publishes Nutch events so we can display
> what’s up in D3 and JSON. You can see the work here, I intend to commit
> it soon:
>
> https://issues.apache.org/jira/browse/NUTCH-2132
>
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>

Re: Best Practices for Plugin Dev and Deployment

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Hi Thiago,

Welcome! 

First thing to check out:

http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer


I would follow that by checking out info on how to use our
Source Code repo:

http://wiki.apache.org/nutch/UsingGit


OK now on to your specific questions:




On 4/6/16, 8:48 AM, "Thiago Galery" <tg...@gmail.com> wrote:

>Dear list,
>I'm a new Nutch Developer and I have a few questions to ask you.
>
>1 - Are there any general guidelines for plugin development (in addition to
>the ones specified in the wiki guide).
>I looked around github and it seems that many plugins are developed as a
>monolithic piece of code that is attached to / forked from the main Nutch
>repo. I take it that, ideally, plugins should be developed as their own
>separate repositories, so they can be versioned and tested against
>different versions of Nutch. Is there a recommended way to do this ? I'm
>considering using git submodules to add plugin repos as Nutch dependencies
>or else crating symlinks from the plugins folder to the right plugin
>repositories.

I would recommend plugin develop to be done against the master branch of
nutch, which you can find a cloned copy of here:

http://github.com/apache/nutch/tree/master

You can follow this process to submit pull requests to add plugins:

http://github.com/apache/nutch/#contributing

>
>2 - As a specific use case for point (1), I have developed a plugin that
>reads some Machine Learning models from a directory. Ideally, I'd like to
>leave the files in the same repository as the plugin, and leave it in a way
>so that it can be tested, versioned and developed as an independent repo.

Use a nutch property defined in either $NUTCH/conf/nutch-{default|site}.xml
Then read the property in your plugin via NutchConfiguration.create().get(“name”)

If the property references a model file, add a property that lists (relatively)
the file path, and then read the property assuming that your Nutch *.job
or jar code depending on whether you are running on Hadoop or locally has
access to $NUTCH/conf

>At the moment, I can just make it work by specifying the path to these
>models in nutch-site.xml, but I wonder whether that directory could be
>accessible by the plugin in some other way (either by some classes in the
>Plugin system or by ivy/ant). Any thoughts ?

See above.

>
>3 - Is there any tooling developed by the community to deploy and monitor
>Nutch applications ? At the moment, we have a scrip that deploys Nutch but
>is not robust enough. I see that there's a dockefile. I'm just wondering if
>it could be used (possibly together with some other tooling) to provision a
>hadoop cluster which the app runs on top. Another tool to run the crawling
>steps (fetch, parse, index) and provide some form of monitoring would be
>great. 

We have been working on a project called Memex Explorer:
http://github.com/memex-explorer/memex-explorer

that provides these types of capabilities. Have a look.

>I hear that this is somehow present in Nutch 2, but I was more
>interested in Nutch 1 (since v2 is not production ready yet, is it?). I was
>wondering if there are any community recipes for Chef/Puppet/Ansible/Salt
>or some work using Kubernates or Mesos. If anyone has experience with this
>and could give me some pointers, I would greatly appreciate it.

FYI above.

>
>4 - At the moment we collect some websites which we extract some metadata
>from, but we don't need to make the results available in a search server
>like Solr or ElasticSearch. Is there any queue or streaming based plugin
>for Nutch, so 'indexing' can be regarding as sending to a queue ? I know
>that Nutch 2 has Gora as an abstraction layer, so maybe this could be a
>gora plug-in, but I'm mainly interested in something for Nutch 1 (or else
>good reasons for moving to Nutch 2).

Lots of people are interested in this and there is Storm Crawler 
that sort of does this, which involves some of the Nutch PMC and
committers. 

Within Nutch there is also work done by my USC masters student and
Nutch PMC member and committer Sujen Shah where he added a publisher
using ActiveMQ Artemis that publishes Nutch events so we can display
what’s up in D3 and JSON. You can see the work here, I intend to commit
it soon:

https://issues.apache.org/jira/browse/NUTCH-2132


Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++