You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by og...@yahoo.com on 2007/06/26 14:42:31 UTC

Re: [Nutch-general] Integrate nutch crawler with Solr index server

Is this actually planned (addition of SolrIndexer to Nutch)?
A search for SolrIndexer in JIRA got no hits.

Otis

----- Original Message ----
From: Brian Whitman <br...@variogr.am>
To: nutch-user@lucene.apache.org
Sent: Saturday, June 23, 2007 4:13:02 PM
Subject: Re: [Nutch-general] Integrate nutch crawler with Solr index server

On Jun 23, 2007, at 8:37 AM, David Xiao wrote:
> As title said, I have some difficult to integrate them together. I  
> tried to followed instruction at http://blog.foofactory.fi/2007/02/ 
> online-indexing-integrating-nutch-with.html but I don’t actually  
> understand part that java piece of code. In article it doesn’t go  
> detail configuration of Solr. I have download solr-client.zip but  
> what to do with Nutch?

It's my understanding that the code Sami posted will no longer work  
with recent versions of Solr / solrj.

However, the solr client (SOLR-20) was recently added to trunk,  
http://issues.apache.org/jira/browse/SOLR-20#action_12505314 , I sent  
Sami a patch on his posted code and hopefully we'll see SolrIndexer  
get into Nutch trunk sometime soon?

As far as configuration of Solr, that post does a good job at  
explaining it, there's not much to it- just use the schema he posted  
and start Solr normally.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Integrate nutch crawler with Solr index server

Posted by Andrzej Bialecki <ab...@getopt.org>.

Brian Whitman wrote:

> The Solr installations I know with many millions of docs don't have 
> hundreds of KB of text per doc. The "special" thing I'm doing is storing 
> the parse text from the nutch crawls (and other sources), which we need 
> for various reasons. We have an extraordinary amount of unique tokens, 
> which turns Solr/Lucene into a disk seek speed test. Full text search is 

This thread is already slightly off-topic ... but regarding the number 
of unique terms: when I'm faced with an explosion of unique terms due to 
the nature of the data or the tokenization method, if possible I use one 
(or both) of the following methods: splitting and combining. Example of 
splitting would be with dates - if you split year, month and day into 
separate fields then even if you have to store many unique dates the 
total number of unique terms in these fields will be smaller than if the 
dates (with this resolution) were stored in a single field. The other 
method (combining) is already in use in Nutch, and implemented in 
CommonGrams.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [Nutch-general] Integrate nutch crawler with Solr index server

Posted by Brian Whitman <br...@variogr.am>.

On Jun 26, 2007, at 3:22 PM, rubdabadub wrote:
>>
>> I currently use Sami's SolrIndexer with the trunk solrj, and we have
>> a single Solr index of about 5m pages on a single 4GB machine, with
>> stored content. Although the indexing is fast and stable, complicated
>> full text queries are too slow for comfort (forget about MLT/faceting
>> etc.) We are currently looking into ways of partitioning this and we
>> may be of service in the future here.
>
> Brain just wondering searching woudn't that be more of a Solr issue?
> I know some of the Solr site has more then 5m docs? no? are you
> doing something special? I am very curious to know. We are
> looking into implementing Solr on production and so far so good.  
> However
> we are only dealing with 10 fileds 3 mil lucene doc.
>

The Solr installations I know with many millions of docs don't have  
hundreds of KB of text per doc. The "special" thing I'm doing is  
storing the parse text from the nutch crawls (and other sources),  
which we need for various reasons. We have an extraordinary amount of  
unique tokens, which turns Solr/Lucene into a disk seek speed test.  
Full text search is certainly possible, even with stored content, but  
I am seeing a drop off in QTime (milliseconds to process and return a  
solr query) after we crossed the 2-3m document mark. It's currently  
at ~200-1000ms or so for uncached single term queries on a very nice  
server with lots of heap. Not tenable for a real-time case (but we  
don't use it in this manner.)

Re: [Nutch-general] Integrate nutch crawler with Solr index server

Posted by rubdabadub <ru...@gmail.com>.

On 6/26/07, Brian Whitman <br...@variogr.am> wrote:
>
> On Jun 26, 2007, at 10:46 AM, Doğacan Güney wrote:
> >>
> >> I think that the distributed online Index part should be done
> >> outside of
> >> Nutch (or if done here do it with extreme caution:) so it does not
> >> get
> >> tied to Nutch.
> >
> > I am not sure I understand you here. If I have 10 machines I am using
> > for serving indexes(I am assuming I have a Solr instance running on
> > each one), IndexerSolr should be able to partition my index to 10
> > machines.
> >
>
> It may be that Solr handles this with a master server to send to
> distributed Solr indexes.
>
> I currently use Sami's SolrIndexer with the trunk solrj, and we have
> a single Solr index of about 5m pages on a single 4GB machine, with
> stored content. Although the indexing is fast and stable, complicated
> full text queries are too slow for comfort (forget about MLT/faceting
> etc.) We are currently looking into ways of partitioning this and we
> may be of service in the future here.

Brain just wondering searching woudn't that be more of a Solr issue?
I know some of the Solr site has more then 5m docs? no? are you
doing something special? I am very curious to know. We are
looking into implementing Solr on production and so far so good. However
we are only dealing with 10 fileds 3 mil lucene doc.

@Sami: I am looking forward to Nutch-442 :-) Cool! Also to add I am
a regular search engine operator so simplification of things makes lots
of sense to me.

Regards
Rajesh

Re: [Nutch-general] Integrate nutch crawler with Solr index server

Posted by Brian Whitman <br...@variogr.am>.

On Jun 26, 2007, at 10:46 AM, Doğacan Güney wrote:
>>
>> I think that the distributed online Index part should be done  
>> outside of
>> Nutch (or if done here do it with extreme caution:) so it does not  
>> get
>> tied to Nutch.
>
> I am not sure I understand you here. If I have 10 machines I am using
> for serving indexes(I am assuming I have a Solr instance running on
> each one), IndexerSolr should be able to partition my index to 10
> machines.
>

It may be that Solr handles this with a master server to send to  
distributed Solr indexes.

I currently use Sami's SolrIndexer with the trunk solrj, and we have  
a single Solr index of about 5m pages on a single 4GB machine, with  
stored content. Although the indexing is fast and stable, complicated  
full text queries are too slow for comfort (forget about MLT/faceting  
etc.) We are currently looking into ways of partitioning this and we  
may be of service in the future here.

Re: [Nutch-general] Integrate nutch crawler with Solr index server

Posted by Sami Siren <ss...@gmail.com>.

>> I think that the distributed online Index part should be done outside of
>> Nutch (or if done here do it with extreme caution:) so it does not get
>> tied to Nutch.
> 
> I am not sure I understand you here. If I have 10 machines I am using
> for serving indexes(I am assuming I have a Solr instance running on
> each one), IndexerSolr should be able to partition my index to 10
> machines.

There are more dimensions to distribution (or scaling) and the case you
describe is a very basic one.  Of course we could support such special
setups inside nutch too and just remember that once it starts to look
like a "thing" that can manage large online indexes perhaps it would
serve most goodness if it was not tied to nutch.

-- 
 Sami Siren

Re: [Nutch-general] Integrate nutch crawler with Solr index server

Posted by Sami Siren <ss...@gmail.com>.

Doğacan Güney wrote:
>>
>> I actually think that the endless adding of configuration options does
>> not do any good to anyone, we should instead start to write reusable
>> pieces of code and/or bring the number of different options down
>> (<imo>The massive number of already available configuration/runtime
>> options and the fact that most of nutch is not designed to be extended
>> by coding is harmful for advanced users. In the other hand I think that
>> things are already too complicated for novice users</imo>)
> 
> OK, adding new configuration options all the time is probably not a
> great idea. But I strongly believe that indexing to different targets
> should be done in Indexer.OutputFormat (OutputFormat outputs to
> different targets, makes sense to me :). For example, I would love the
> ability to index to solr but I would also need to store the original
> lucene index in DFS (so that if solr machine dies, I don't lose my
> index). I shouldn't have to run Indexer twice to achieve this.

In one application I added extension point for different indexing
backends, that way by implementing a composite index backend you could
achieve that same thing.

The code shown in blog post was mainly done simplicity in mind, other
motivation was doing it without touching Nutch source code.

--
 Sami Siren

Re: [Nutch-general] Integrate nutch crawler with Solr index server

Posted by Doğacan Güney <do...@gmail.com>.

On 6/26/07, Sami Siren <ss...@gmail.com> wrote:
> Doğacan Güney wrote:
> > Hi,
> >
> > On 6/26/07, ogjunk-nutch@yahoo.com <og...@yahoo.com> wrote:
> >> Is this actually planned (addition of SolrIndexer to Nutch)?
> >> A search for SolrIndexer in JIRA got no hits.
> >
> > There is NUTCH-442 (one of the most popular issues). But, after Sami's
> > work, there have been no further developments.
> >
> > I think Sami Siren's original patch no longer works with Solr, I am
> > not sure if it still applies to nutch. So, if anyone wants to tackle
> > this, here are a couple of items off the top of my mind:
>
> It still applies to nutch (actually there were just two additional
> classes) and works with the original client (don't know if it's still
> available).
>
> I am currently working on something around solr-nutch integration and
> hoping that I can give out something within the next few weeks.

Excellent, nice to see you working on this :)

>
> >
> > 1) Bring Sami's patch up-to-date (both with solr and with nutch). I
> > think a seperate Indexer job is unnecessary, we should just change
> > Indexer.OutputFormat to check for a parameter, and if its true,
> > OutputFormat should also send documents to Solr (besides writing it to
> > lucene index in DFS).
>
> I actually think that the endless adding of configuration options does
> not do any good to anyone, we should instead start to write reusable
> pieces of code and/or bring the number of different options down
> (<imo>The massive number of already available configuration/runtime
> options and the fact that most of nutch is not designed to be extended
> by coding is harmful for advanced users. In the other hand I think that
> things are already too complicated for novice users</imo>)

OK, adding new configuration options all the time is probably not a
great idea. But I strongly believe that indexing to different targets
should be done in Indexer.OutputFormat (OutputFormat outputs to
different targets, makes sense to me :). For example, I would love the
ability to index to solr but I would also need to store the original
lucene index in DFS (so that if solr machine dies, I don't lose my
index). I shouldn't have to run Indexer twice to achieve this.

>
> > 2) Make it work in distributed setups (i.e. with more than 1 index
> > server)  . Sami Siren also makes a note of this, but I don't believe
> > that a simple hash-the-url approach is appropriate for nutch. It would
> > be nice to guarantee that a url always goes to the same indexing
> > server, even if we add or remove index servers (if we just take the
> > hash of url, then adding a new machine would cause pretty much all
> > urls to be distributed to different servers).
>
> I think that the distributed online Index part should be done outside of
> Nutch (or if done here do it with extreme caution:) so it does not get
> tied to Nutch.

I am not sure I understand you here. If I have 10 machines I am using
for serving indexes(I am assuming I have a Solr instance running on
each one), IndexerSolr should be able to partition my index to 10
machines.

>
> --
>  Sami Siren
>


-- 
Doğacan Güney

Re: [Nutch-general] Integrate nutch crawler with Solr index server

Posted by Sami Siren <ss...@gmail.com>.

Doğacan Güney wrote:
> Hi,
> 
> On 6/26/07, ogjunk-nutch@yahoo.com <og...@yahoo.com> wrote:
>> Is this actually planned (addition of SolrIndexer to Nutch)?
>> A search for SolrIndexer in JIRA got no hits.
> 
> There is NUTCH-442 (one of the most popular issues). But, after Sami's
> work, there have been no further developments.
> 
> I think Sami Siren's original patch no longer works with Solr, I am
> not sure if it still applies to nutch. So, if anyone wants to tackle
> this, here are a couple of items off the top of my mind:

It still applies to nutch (actually there were just two additional
classes) and works with the original client (don't know if it's still
available).

I am currently working on something around solr-nutch integration and
hoping that I can give out something within the next few weeks.

> 
> 1) Bring Sami's patch up-to-date (both with solr and with nutch). I
> think a seperate Indexer job is unnecessary, we should just change
> Indexer.OutputFormat to check for a parameter, and if its true,
> OutputFormat should also send documents to Solr (besides writing it to
> lucene index in DFS).

I actually think that the endless adding of configuration options does
not do any good to anyone, we should instead start to write reusable
pieces of code and/or bring the number of different options down
(<imo>The massive number of already available configuration/runtime
options and the fact that most of nutch is not designed to be extended
by coding is harmful for advanced users. In the other hand I think that
things are already too complicated for novice users</imo>)

> 2) Make it work in distributed setups (i.e. with more than 1 index
> server)  . Sami Siren also makes a note of this, but I don't believe
> that a simple hash-the-url approach is appropriate for nutch. It would
> be nice to guarantee that a url always goes to the same indexing
> server, even if we add or remove index servers (if we just take the
> hash of url, then adding a new machine would cause pretty much all
> urls to be distributed to different servers).

I think that the distributed online Index part should be done outside of
Nutch (or if done here do it with extreme caution:) so it does not get
tied to Nutch.

-- 
 Sami Siren

Re: [Nutch-general] Integrate nutch crawler with Solr index server

Posted by Doğacan Güney <do...@gmail.com>.

Hi,

On 6/26/07, ogjunk-nutch@yahoo.com <og...@yahoo.com> wrote:
> Is this actually planned (addition of SolrIndexer to Nutch)?
> A search for SolrIndexer in JIRA got no hits.

There is NUTCH-442 (one of the most popular issues). But, after Sami's
work, there have been no further developments.

I think Sami Siren's original patch no longer works with Solr, I am
not sure if it still applies to nutch. So, if anyone wants to tackle
this, here are a couple of items off the top of my mind:

1) Bring Sami's patch up-to-date (both with solr and with nutch). I
think a seperate Indexer job is unnecessary, we should just change
Indexer.OutputFormat to check for a parameter, and if its true,
OutputFormat should also send documents to Solr (besides writing it to
lucene index in DFS).

2) Make it work in distributed setups (i.e. with more than 1 index
server)  . Sami Siren also makes a note of this, but I don't believe
that a simple hash-the-url approach is appropriate for nutch. It would
be nice to guarantee that a url always goes to the same indexing
server, even if we add or remove index servers (if we just take the
hash of url, then adding a new machine would cause pretty much all
urls to be distributed to different servers).

3) We need to code a SolrSearcher similar to o.a.n.s.IndexSearcher(so
that Solr is a drop-in replacement for IndexSearcher). This class
should handle stuff like generating summaries, etc. This one is easy
(if a bit boring:).

If anyone is interested, I would be glad to help him/her with the
nutch side of things. I also would like to work on it, but I don't
have time right now.

>
> Otis
>
>
> ----- Original Message ----
> From: Brian Whitman <br...@variogr.am>
> To: nutch-user@lucene.apache.org
> Sent: Saturday, June 23, 2007 4:13:02 PM
> Subject: Re: [Nutch-general] Integrate nutch crawler with Solr index server
>
>
> On Jun 23, 2007, at 8:37 AM, David Xiao wrote:
> > As title said, I have some difficult to integrate them together. I
> > tried to followed instruction at http://blog.foofactory.fi/2007/02/
> > online-indexing-integrating-nutch-with.html but I don't actually
> > understand part that java piece of code. In article it doesn't go
> > detail configuration of Solr. I have download solr-client.zip but
> > what to do with Nutch?
>
>
> It's my understanding that the code Sami posted will no longer work
> with recent versions of Solr / solrj.
>
> However, the solr client (SOLR-20) was recently added to trunk,
> http://issues.apache.org/jira/browse/SOLR-20#action_12505314 , I sent
> Sami a patch on his posted code and hopefully we'll see SolrIndexer
> get into Nutch trunk sometime soon?
>
> As far as configuration of Solr, that post does a good job at
> explaining it, there's not much to it- just use the schema he posted
> and start Solr normally.
>
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>
>
>
>

-- 
Doğacan Güney