You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Lajos <la...@protulae.com> on 2014/01/14 14:07:00 UTC

Proposal for SolrIndexWriter

Hi all,

I've been working with Nutch/Solr integration for several enterprise 
search projects for clients (as well as my forthcoming Solr book). I 
think there are some real issues with the paradigm, and I'd like to 
propose a slightly modified approach which I've had to take myself.

I think its backwards to base mapping of the NutchDocument to 
SolrDocument based on the fields in the former. There are several problems:

1) this requires Solr to support all Nutch fields, which might not be 
the case (like segment). That is an unreasonable requirement
2) you can map a Nutch field to at most 2 Solr fields (i.e. one via a 
<field> and one via a <copy> tag because the source attribute is the Map 
key and therefore you can only have one)
3) there is no support for any transformations, literals, etc, like say 
for Solr data import

For example, I've built an enterprise search tool that aggregates lots 
of different data sources together and uses Nutch to crawl the intranet. 
The schema doesn't match everything Nutch sends. I have some literals 
that need to be set and I need transformations.

My approach was to reverse the building of the SolrDocument, and 
populate the doc based on the Solr destination fields as defined in 
solrindex-mapping.xml, i.e., it populates the doc based on what the 
target Solr wants to receive, not just what Nutch wants to send.

The map of fields in solrindex-mapping.xml is now keyed by dest, i.e. 
the Solr field name, not source. That way, I can map a source to 
multiple destinations if I want. I further add a mapping type attribute 
(defaults to just a simple copy from Nutch to Solr) that supports 
literals and (shortly) transformations.

The change is easy, works well and fits better I think with the Solr 
paradigm. I've done this change in the 1.x plugin but obviously it can 
easily port to 2.x.

If you see some merit to this approach, I'd can open a JIRA and submit 
the changes. I also have somewhere an apache.org account (from my 
openejb days) and would be happy to actually help implement it if you'd 
like. I think adding in transformations would be a further benefit.

Let me know.

Thanks,

Lajos

Re: Proposal for SolrIndexWriter

Posted by Lajos <la...@protulae.com>.

Hi Markus,

Sorry for the delay, I've been swamped.

Sure, adding additional Nutch fields to my Solr schema isn't a big deal. 
But its not always so simple.

For one thing, I just happen to need to map a Nutch field to 3 Solr 
fields (ok, the use case could be changed, but it illustrates the 
point). The current mapping won't allow it, because the maps are keyed 
by source field and you can therefore only have one entry in each map 
(keyMap and copyMap). That bit in the SolrMappingReader I feel was just 
thrown in there and not completely well thought out.

As well, I ideally need to do some extra things to the fields as they 
are being mapped, ala Solr DIH.

And finally, it isn't a big effort - the work of rewriting the necessary 
methods took only an hour, and it has been ticking along perfectly ever 
since. I've been working on the transformation as I go along. I took the 
route of just making a new plugin, although I'm not sure how that would 
work best with 2.x.

At any rate, I'm not going to push the issue, but the work is available.

Cheers,

L


On 17/01/2014 12:43, Markus Jelsma wrote:
> Hi - i've been thinking about your mail but i cannot see any reason to make a big effort to support something that in my point of view is not worth it. Whether you speak about Nutch- or Solr-centric, in the end it comes down to some arbitrary fields being populated. If for some reason a user really needs to have his fields named something else then he/she can map it very easy. It is also trivial to create a custom indexing filter. That is much more flexible than some automatic thing that is supposed to work for everyone.
>
> You write you had to integrate Nutch' fields in your Solr schema. That doesn't sound like a task you'd spend more than one hour on.
>
> I can totally misunderstand your ideas but so far i don't see a big problem that needs to be fixed.
>
> Cheers,
> Markus
>
> -----Original message-----
>> From:Lajos <la...@protulae.com>
>> Sent: Tuesday 14th January 2014 23:08
>> To: dev@nutch.apache.org
>> Subject: Re: Proposal for SolrIndexWriter
>>
>> I realise I should have made myself clearer on one point.
>>
>> I understand that the current design comes from a Nutch-centric
>> paradigm, in which Solr is used to hold the indexing data from Nutch. In
>> this paradigm, I suppose the Nutch data needs to be fully mapped to Solr.
>>
>> But I'm interested in a Solr-centric paradigm where Nutch is feeding
>> data to Solr for a Solr-based application to use. I don't have any idea
>> which is more popular, but all my own uses of Nutch have required me to
>> integrate it to existing Solr schemas and for that, I have to have a
>> different and much more flexible approach.
>>
>> So maybe what I'm suggesting would be a parallel set of components for
>> the second scenario, given that the first would still need to be
>> supported. Possibly the existing set of components could support both
>> paradigms, but that would be messy.
>>
>> L
>>
>>
>> On 14/01/2014 14:07, Lajos wrote:
>>> Hi all,
>>>
>>> I've been working with Nutch/Solr integration for several enterprise
>>> search projects for clients (as well as my forthcoming Solr book). I
>>> think there are some real issues with the paradigm, and I'd like to
>>> propose a slightly modified approach which I've had to take myself.
>>>
>>> I think its backwards to base mapping of the NutchDocument to
>>> SolrDocument based on the fields in the former. There are several problems:
>>>
>>> 1) this requires Solr to support all Nutch fields, which might not be
>>> the case (like segment). That is an unreasonable requirement
>>> 2) you can map a Nutch field to at most 2 Solr fields (i.e. one via a
>>> <field> and one via a <copy> tag because the source attribute is the Map
>>> key and therefore you can only have one)
>>> 3) there is no support for any transformations, literals, etc, like say
>>> for Solr data import
>>>
>>> For example, I've built an enterprise search tool that aggregates lots
>>> of different data sources together and uses Nutch to crawl the intranet.
>>> The schema doesn't match everything Nutch sends. I have some literals
>>> that need to be set and I need transformations.
>>>
>>> My approach was to reverse the building of the SolrDocument, and
>>> populate the doc based on the Solr destination fields as defined in
>>> solrindex-mapping.xml, i.e., it populates the doc based on what the
>>> target Solr wants to receive, not just what Nutch wants to send.
>>>
>>> The map of fields in solrindex-mapping.xml is now keyed by dest, i.e.
>>> the Solr field name, not source. That way, I can map a source to
>>> multiple destinations if I want. I further add a mapping type attribute
>>> (defaults to just a simple copy from Nutch to Solr) that supports
>>> literals and (shortly) transformations.
>>>
>>> The change is easy, works well and fits better I think with the Solr
>>> paradigm. I've done this change in the 1.x plugin but obviously it can
>>> easily port to 2.x.
>>>
>>> If you see some merit to this approach, I'd can open a JIRA and submit
>>> the changes. I also have somewhere an apache.org account (from my
>>> openejb days) and would be happy to actually help implement it if you'd
>>> like. I think adding in transformations would be a further benefit.
>>>
>>> Let me know.
>>>
>>> Thanks,
>>>
>>> Lajos
>>>
>>

RE: Proposal for SolrIndexWriter

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - i've been thinking about your mail but i cannot see any reason to make a big effort to support something that in my point of view is not worth it. Whether you speak about Nutch- or Solr-centric, in the end it comes down to some arbitrary fields being populated. If for some reason a user really needs to have his fields named something else then he/she can map it very easy. It is also trivial to create a custom indexing filter. That is much more flexible than some automatic thing that is supposed to work for everyone.

You write you had to integrate Nutch' fields in your Solr schema. That doesn't sound like a task you'd spend more than one hour on.

I can totally misunderstand your ideas but so far i don't see a big problem that needs to be fixed.

Cheers,
Markus
 
-----Original message-----
> From:Lajos <la...@protulae.com>
> Sent: Tuesday 14th January 2014 23:08
> To: dev@nutch.apache.org
> Subject: Re: Proposal for SolrIndexWriter
> 
> I realise I should have made myself clearer on one point.
> 
> I understand that the current design comes from a Nutch-centric 
> paradigm, in which Solr is used to hold the indexing data from Nutch. In 
> this paradigm, I suppose the Nutch data needs to be fully mapped to Solr.
> 
> But I'm interested in a Solr-centric paradigm where Nutch is feeding 
> data to Solr for a Solr-based application to use. I don't have any idea 
> which is more popular, but all my own uses of Nutch have required me to 
> integrate it to existing Solr schemas and for that, I have to have a 
> different and much more flexible approach.
> 
> So maybe what I'm suggesting would be a parallel set of components for 
> the second scenario, given that the first would still need to be 
> supported. Possibly the existing set of components could support both 
> paradigms, but that would be messy.
> 
> L
> 
> 
> On 14/01/2014 14:07, Lajos wrote:
> > Hi all,
> >
> > I've been working with Nutch/Solr integration for several enterprise
> > search projects for clients (as well as my forthcoming Solr book). I
> > think there are some real issues with the paradigm, and I'd like to
> > propose a slightly modified approach which I've had to take myself.
> >
> > I think its backwards to base mapping of the NutchDocument to
> > SolrDocument based on the fields in the former. There are several problems:
> >
> > 1) this requires Solr to support all Nutch fields, which might not be
> > the case (like segment). That is an unreasonable requirement
> > 2) you can map a Nutch field to at most 2 Solr fields (i.e. one via a
> > <field> and one via a <copy> tag because the source attribute is the Map
> > key and therefore you can only have one)
> > 3) there is no support for any transformations, literals, etc, like say
> > for Solr data import
> >
> > For example, I've built an enterprise search tool that aggregates lots
> > of different data sources together and uses Nutch to crawl the intranet.
> > The schema doesn't match everything Nutch sends. I have some literals
> > that need to be set and I need transformations.
> >
> > My approach was to reverse the building of the SolrDocument, and
> > populate the doc based on the Solr destination fields as defined in
> > solrindex-mapping.xml, i.e., it populates the doc based on what the
> > target Solr wants to receive, not just what Nutch wants to send.
> >
> > The map of fields in solrindex-mapping.xml is now keyed by dest, i.e.
> > the Solr field name, not source. That way, I can map a source to
> > multiple destinations if I want. I further add a mapping type attribute
> > (defaults to just a simple copy from Nutch to Solr) that supports
> > literals and (shortly) transformations.
> >
> > The change is easy, works well and fits better I think with the Solr
> > paradigm. I've done this change in the 1.x plugin but obviously it can
> > easily port to 2.x.
> >
> > If you see some merit to this approach, I'd can open a JIRA and submit
> > the changes. I also have somewhere an apache.org account (from my
> > openejb days) and would be happy to actually help implement it if you'd
> > like. I think adding in transformations would be a further benefit.
> >
> > Let me know.
> >
> > Thanks,
> >
> > Lajos
> >
>

Re: Proposal for SolrIndexWriter

Posted by Lajos <la...@protulae.com>.

I realise I should have made myself clearer on one point.

I understand that the current design comes from a Nutch-centric 
paradigm, in which Solr is used to hold the indexing data from Nutch. In 
this paradigm, I suppose the Nutch data needs to be fully mapped to Solr.

But I'm interested in a Solr-centric paradigm where Nutch is feeding 
data to Solr for a Solr-based application to use. I don't have any idea 
which is more popular, but all my own uses of Nutch have required me to 
integrate it to existing Solr schemas and for that, I have to have a 
different and much more flexible approach.

So maybe what I'm suggesting would be a parallel set of components for 
the second scenario, given that the first would still need to be 
supported. Possibly the existing set of components could support both 
paradigms, but that would be messy.

L


On 14/01/2014 14:07, Lajos wrote:
> Hi all,
>
> I've been working with Nutch/Solr integration for several enterprise
> search projects for clients (as well as my forthcoming Solr book). I
> think there are some real issues with the paradigm, and I'd like to
> propose a slightly modified approach which I've had to take myself.
>
> I think its backwards to base mapping of the NutchDocument to
> SolrDocument based on the fields in the former. There are several problems:
>
> 1) this requires Solr to support all Nutch fields, which might not be
> the case (like segment). That is an unreasonable requirement
> 2) you can map a Nutch field to at most 2 Solr fields (i.e. one via a
> <field> and one via a <copy> tag because the source attribute is the Map
> key and therefore you can only have one)
> 3) there is no support for any transformations, literals, etc, like say
> for Solr data import
>
> For example, I've built an enterprise search tool that aggregates lots
> of different data sources together and uses Nutch to crawl the intranet.
> The schema doesn't match everything Nutch sends. I have some literals
> that need to be set and I need transformations.
>
> My approach was to reverse the building of the SolrDocument, and
> populate the doc based on the Solr destination fields as defined in
> solrindex-mapping.xml, i.e., it populates the doc based on what the
> target Solr wants to receive, not just what Nutch wants to send.
>
> The map of fields in solrindex-mapping.xml is now keyed by dest, i.e.
> the Solr field name, not source. That way, I can map a source to
> multiple destinations if I want. I further add a mapping type attribute
> (defaults to just a simple copy from Nutch to Solr) that supports
> literals and (shortly) transformations.
>
> The change is easy, works well and fits better I think with the Solr
> paradigm. I've done this change in the 1.x plugin but obviously it can
> easily port to 2.x.
>
> If you see some merit to this approach, I'd can open a JIRA and submit
> the changes. I also have somewhere an apache.org account (from my
> openejb days) and would be happy to actually help implement it if you'd
> like. I think adding in transformations would be a further benefit.
>
> Let me know.
>
> Thanks,
>
> Lajos
>