You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chris Mielke <cm...@marinsoftware.com> on 2014/06/18 23:54:34 UTC

Elasticsearch & customized indicies

Hey all,

Pretty new to Nutch and getting it integrated with Elasticsearch. I've
managed to finally get it working. Ideally, I'd like to have a separate
Elasticsearch index for each site that is crawled, or a separate
Elasticsearch index type for each site.

For example:
Site abc.com ends up in the index "abc" in Elasticsearch
Site xyz.com ends up in the index "xyz" in Elasticsearch

Is there a way to do this?

Thanks!

..Chris

Chris Mielke
Web Developer

Re: Elasticsearch & customized indicies

Posted by Chris Mielke <cm...@marinsoftware.com>.
Thanks Jake! I looked at hacking it, but wasn't exactly sure what to change.

I added the following to the write() method and it's now indexing exactly
how I want it to.

defaultIndex = (String) doc.getFieldValue("host");

Thanks again!

..Chris

Chris Mielke
Web Developer


On Wed, Jun 18, 2014 at 3:12 PM, Jake Dodd <ja...@ontopic.io> wrote:

> Hi Chris,
>
> You should be able to do this quick-and-dirty with a relatively simple
> modification to Nutch’s integrated Elasticsearch indexer plugin  (called
> indexer-elastic). Within the
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter.write() method, try
> changing the index name (specifically the line IndexRequestBuilder request
> = client.prepareIndex(defaultIndex, type, id);) from defaultIndex to the
> domain name of the document that you’re indexing.
>
> And to answer Markus’s question, I think that the ElasticIndexWriter opens
> a single ES client connection, so you shouldn’t have to worry about a
> separate connection for each host. But maybe somebody with more know-how
> can give you a more affirmative answer.
>
> Cheers
>
> Jake
>
> On Jun 18, 2014, at 2:54 PM, Chris Mielke <cm...@marinsoftware.com>
> wrote:
>
> > Hey all,
> >
> > Pretty new to Nutch and getting it integrated with Elasticsearch. I've
> > managed to finally get it working. Ideally, I'd like to have a separate
> > Elasticsearch index for each site that is crawled, or a separate
> > Elasticsearch index type for each site.
> >
> > For example:
> > Site abc.com ends up in the index "abc" in Elasticsearch
> > Site xyz.com ends up in the index "xyz" in Elasticsearch
> >
> > Is there a way to do this?
> >
> > Thanks!
> >
> > ..Chris
> >
> > Chris Mielke
> > Web Developer
>
>

Re: Elasticsearch & customized indicies

Posted by Jake Dodd <ja...@ontopic.io>.
Hi Chris,

You should be able to do this quick-and-dirty with a relatively simple modification to Nutch’s integrated Elasticsearch indexer plugin  (called indexer-elastic). Within the org.apache.nutch.indexwriter.elastic.ElasticIndexWriter.write() method, try changing the index name (specifically the line IndexRequestBuilder request = client.prepareIndex(defaultIndex, type, id);) from defaultIndex to the domain name of the document that you’re indexing. 

And to answer Markus’s question, I think that the ElasticIndexWriter opens a single ES client connection, so you shouldn’t have to worry about a separate connection for each host. But maybe somebody with more know-how can give you a more affirmative answer.

Cheers

Jake

On Jun 18, 2014, at 2:54 PM, Chris Mielke <cm...@marinsoftware.com> wrote:

> Hey all,
> 
> Pretty new to Nutch and getting it integrated with Elasticsearch. I've
> managed to finally get it working. Ideally, I'd like to have a separate
> Elasticsearch index for each site that is crawled, or a separate
> Elasticsearch index type for each site.
> 
> For example:
> Site abc.com ends up in the index "abc" in Elasticsearch
> Site xyz.com ends up in the index "xyz" in Elasticsearch
> 
> Is there a way to do this?
> 
> Thanks!
> 
> ..Chris
> 
> Chris Mielke
> Web Developer