You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by co...@complexityintelligence.com on 2012/02/07 15:12:59 UTC

Dump into Cassandra using Nutch 1.x

Hello,

   We're developing an instastructure that uses Nutch, but when flushing
data into Solr, we also want to store such data into our Cassandra
cluster
for furher hadoop based analysis.

   Now, I know Nutch 2.x has Cassandra support as storage backend, and
this looks great, but we're using Nutch 1.4, and our target is slightly
different.

   How to achieve that ? Writing a plug-in that for each SolrIndexing
action stores the same fields into Cassandra ? This is, in my mind, the
easiest
approach, but what extension point to use ? How to catch all fields ?

   Or, I thought about adding a new command, like 'dumpCassandra', that
acts like SolrIndexing, but of course, pushes data into Cassandra. I
think
this is not feasible be means of plug-in, but I've to change some
internal
code, even if I can start from SolrIndexing and using Cassandra storage
as output.

   Any idea ?

Alessio


Re: Dump into Cassandra using Nutch 1.x

Posted by Julien Nioche <li...@gmail.com>.
Hi Alessio

We are planning to add plugable backends (
https://issues.apache.org/jira/browse/NUTCH-1047) in the short term. What
you are describing would fit within that.

The easiest way to get started on this would be to piggyback the
SOLRindexing commands e.g. have an export command that would call the
indexing filters and call plugins implementing a new endpoint (export
plugins?). There could be one such plugin for Cassandra or maybe one using
GORA to leverage the backends it already supports. We could then move the
SOLR based commands backed into that framework.

Feel free to comment on the JIRA issue above.

Thanks

Julien




  How to achieve that ? Writing a plug-in that for each SolrIndexing
> action stores the same fields into Cassandra ? This is, in my mind, the
> easiest
> approach, but what extension point to use ? How to catch all fields ?
>

>   Or, I thought about adding a new command, like 'dumpCassandra', that
> acts like SolrIndexing, but of course, pushes data into Cassandra. I
> think
> this is not feasible be means of plug-in, but I've to change some
> internal
> code, even if I can start from SolrIndexing and using Cassandra storage
> as output.
>
>   Any idea ?
>
> Alessio
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble