You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/07/11 15:46:37 UTC

The solrindex command

Hello there,

where can I find informations about the solr document structure which 
the solrindex command sends to solr for indexing?

As far as I know, you add data to the solr index by sending a document 
with specific fields to the engine.

I would like to know how nutch creates these documents and which fields 
these documents contain.

In other words, what kind of information about a website is transferred 
to solr?

Thank you very much.




Re: The solrindex command

Posted by Marek Bachmann <m....@uni-kassel.de>.
On 11.07.2011 16:15, Markus Jelsma wrote:
>
>
> On Monday 11 July 2011 16:11:47 Marek Bachmann wrote:
>> Thank you very much
>>
>> On 11.07.2011 15:48, Markus Jelsma wrote:
>>> Hi,
>>>
>>> Using the brand-new IndexingFiltersChecker in 1.4-dev you can see exactly
>>> what Nutch is going to send. It comes down to the plugins you have
>>> defined. See the schema config for a list of fields per plug-in:
>>>
>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/conf/schema.xml?vi
>>> ew=markup
>>>
>>> Cheers
>>
>> So, as there is no "score" field in the schema.xml I guess the score for
>> a webpage in the crawl db has no effect in solr by default, am I right? :)
>
> There is no score field indeed but there is a boost field. This contains the
> score. Nutch will also set the Lucene document boost and field boost weights
> with this value.
>

Ahh! This is really an important information for me! :-) Thanks!

>>
>>> On Monday 11 July 2011 15:46:37 Marek Bachmann wrote:
>>>> Hello there,
>>>>
>>>> where can I find informations about the solr document structure which
>>>> the solrindex command sends to solr for indexing?
>>>>
>>>> As far as I know, you add data to the solr index by sending a document
>>>> with specific fields to the engine.
>>>>
>>>> I would like to know how nutch creates these documents and which fields
>>>> these documents contain.
>>>>
>>>> In other words, what kind of information about a website is transferred
>>>> to solr?
>>>>
>>>> Thank you very much.
>


Re: The solrindex command

Posted by Markus Jelsma <ma...@openindex.io>.

On Monday 11 July 2011 16:11:47 Marek Bachmann wrote:
> Thank you very much
> 
> On 11.07.2011 15:48, Markus Jelsma wrote:
> > Hi,
> > 
> > Using the brand-new IndexingFiltersChecker in 1.4-dev you can see exactly
> > what Nutch is going to send. It comes down to the plugins you have
> > defined. See the schema config for a list of fields per plug-in:
> > 
> > http://svn.apache.org/viewvc/nutch/branches/branch-1.4/conf/schema.xml?vi
> > ew=markup
> > 
> > Cheers
> 
> So, as there is no "score" field in the schema.xml I guess the score for
> a webpage in the crawl db has no effect in solr by default, am I right? :)

There is no score field indeed but there is a boost field. This contains the 
score. Nutch will also set the Lucene document boost and field boost weights 
with this value.

> 
> > On Monday 11 July 2011 15:46:37 Marek Bachmann wrote:
> >> Hello there,
> >> 
> >> where can I find informations about the solr document structure which
> >> the solrindex command sends to solr for indexing?
> >> 
> >> As far as I know, you add data to the solr index by sending a document
> >> with specific fields to the engine.
> >> 
> >> I would like to know how nutch creates these documents and which fields
> >> these documents contain.
> >> 
> >> In other words, what kind of information about a website is transferred
> >> to solr?
> >> 
> >> Thank you very much.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: The solrindex command

Posted by Marek Bachmann <m....@uni-kassel.de>.
Thank you very much

On 11.07.2011 15:48, Markus Jelsma wrote:
> Hi,
>
> Using the brand-new IndexingFiltersChecker in 1.4-dev you can see exactly what
> Nutch is going to send. It comes down to the plugins you have defined. See the
> schema config for a list of fields per plug-in:
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/conf/schema.xml?view=markup
>
> Cheers

So, as there is no "score" field in the schema.xml I guess the score for 
a webpage in the crawl db has no effect in solr by default, am I right? :)

>
> On Monday 11 July 2011 15:46:37 Marek Bachmann wrote:
>> Hello there,
>>
>> where can I find informations about the solr document structure which
>> the solrindex command sends to solr for indexing?
>>
>> As far as I know, you add data to the solr index by sending a document
>> with specific fields to the engine.
>>
>> I would like to know how nutch creates these documents and which fields
>> these documents contain.
>>
>> In other words, what kind of information about a website is transferred
>> to solr?
>>
>> Thank you very much.
>


Re: The solrindex command

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

Using the brand-new IndexingFiltersChecker in 1.4-dev you can see exactly what 
Nutch is going to send. It comes down to the plugins you have defined. See the 
schema config for a list of fields per plug-in:

http://svn.apache.org/viewvc/nutch/branches/branch-1.4/conf/schema.xml?view=markup

Cheers

On Monday 11 July 2011 15:46:37 Marek Bachmann wrote:
> Hello there,
> 
> where can I find informations about the solr document structure which
> the solrindex command sends to solr for indexing?
> 
> As far as I know, you add data to the solr index by sending a document
> with specific fields to the engine.
> 
> I would like to know how nutch creates these documents and which fields
> these documents contain.
> 
> In other words, what kind of information about a website is transferred
> to solr?
> 
> Thank you very much.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350