You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Pulkit Singhal <pu...@gmail.com> on 2011/10/09 21:47:46 UTC

Interesting DIH challenge

Hello Folks,

I'm a big DIH fan but I'm fairly sure that now I've run into a scenario
where it can't help me anymore ... but before I give up and roll my own
solution, I jsut wanted to check with everyone else.

The scenario:
- already have 1M+ documents indexed
- the schema.xml needs to have one more field added to it ...
problem/do-able? yes? no? remove all the old data? or do the update per doc
(add/delete)?
- need to populate data from a file that has a key and value per line and i
need to use the key to find the doc to update and then add the value to the
new schema field

Any ideas?

Re: Interesting DIH challenge

Posted by Chantal Ackermann <ch...@btelligent.de>.
Hi Gora,

sure, glad to be of help.
If you find any problems with the xslt it would be great if you could
notify me.

I remember having problems with empty fields (see SOLR-1790:
https://issues.apache.org/jira/browse/SOLR-1790 ).

I think my solution was to make sure that the response of the source
core does not include any empty fields because it seems not to be
possible to handle that in XSLT (at least from what I found out
searching the i-net).

Chantal

On Mon, 2011-10-10 at 16:07 +0200, Gora Mohanty wrote:
> On Mon, Oct 10, 2011 at 4:40 PM, Chantal Ackermann
> <ch...@btelligent.de> wrote:
> [...]
> 
> > (2) response-to-update.xsl (this goes into
> > $SOLR_HOME/sourceCore/conf/xslt/):
> [...]
> 
> Thanks for sharing that, and will check it out when possible.
> 
> This looks like it would be very useful, as we once rolled our
> own half-baked, non-Solr solution for a similar problem (I had
> somehow missed XsltResponseWriter back then). Maybe this
> should go into contributed code for Solr.
> 
> Regards,
> Gora


Re: Interesting DIH challenge

Posted by Gora Mohanty <go...@mimirtech.com>.
On Mon, Oct 10, 2011 at 4:40 PM, Chantal Ackermann
<ch...@btelligent.de> wrote:
[...]

> (2) response-to-update.xsl (this goes into
> $SOLR_HOME/sourceCore/conf/xslt/):
[...]

Thanks for sharing that, and will check it out when possible.

This looks like it would be very useful, as we once rolled our
own half-baked, non-Solr solution for a similar problem (I had
somehow missed XsltResponseWriter back then). Maybe this
should go into contributed code for Solr.

Regards,
Gora

Re: Interesting DIH challenge

Posted by Chantal Ackermann <ch...@btelligent.de>.
Hi there,

I have been using cores to built up new cores (because of various
reasons). (I am not using SOLR as data storage, the cores are re-indexed
frequently.)

This solution works for releases 1.4 and 3 as it does not use the
SolrEntityProcessor.

To load data from another SOLR core and populate part of the new
document I use:

(1) in the target data-config.xml:
<entity name="content" dataSource="sourceCore"
url="solr/gmaContent/select?q=contentid:
${targetDoc.ID}&amp;wt=xslt&amp;tr=response-to-update.xsl"
processor="my.custom.handler.dataimport.CachingXPathEntityProcessor"
cacheKey="${targetDoc.ID}" useSolrAddSchema="true">
</entity>

(2) sourceCore's solrconfig.xml needs an entry (uncomment) for the xslt
response writer:

  <!-- XSLT response writer transforms the XML output by any xslt file
found
       in Solr's conf/xslt directory.  Changes to xslt files are checked
for
       every xsltCacheLifetimeSeconds.  
    -->
  <queryResponseWriter name="xslt" class="solr.XSLTResponseWriter">
    <int name="xsltCacheLifetimeSeconds">6000</int>
  </queryResponseWriter>


(2) response-to-update.xsl (this goes into
$SOLR_HOME/sourceCore/conf/xslt/):


<?xml version='1.0' encoding='UTF-8'?>
<xsl:stylesheet version='1.0'
	xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
	
	<xsl:output method="xml" media-type="text/xml;charset=utf-8"
		indent="yes" encoding="UTF-8" omit-xml-declaration="no" />
	
	<xsl:template match='/'>
		<add>
			<xsl:apply-templates select="/response/result/doc" />
		</add>
	</xsl:template>

	<xsl:template match="doc">
	<doc>
		<xsl:choose>
		<xsl:when test="doc/*[name()='arr']">
			<xsl:apply-templates select="//arr" />
		</xsl:when>
		<xsl:otherwise>
			<xsl:apply-templates select="child::node()" />
		</xsl:otherwise>
		</xsl:choose>
	</doc>
	</xsl:template>

	<xsl:template match="//arr">
		<xsl:for-each select="child::node()">
			<xsl:element name="field">
				<xsl:attribute name="name"><xsl:value-of
select="../@name"></xsl:value-of>
				</xsl:attribute>
				<xsl:value-of select="." />
			</xsl:element>
		</xsl:for-each>
	</xsl:template>

	<xsl:template match="child::node()">
		<xsl:element name="field">
			<xsl:attribute name="name"><xsl:value-of
select="@name"></xsl:value-of>
			</xsl:attribute>
			<xsl:value-of select="." />
		</xsl:element>
	</xsl:template>

</xsl:stylesheet>

Cheers,
Chantal




On Mon, 2011-10-10 at 06:26 +0200, Gora Mohanty wrote:
> On Mon, Oct 10, 2011 at 6:30 AM, Pulkit Singhal <pu...@gmail.com> wrote:
> > @Gora Thank You!
> >
> > I know that Solr accepts xml with Solr specific elements that are commands
> > that only it understands ... such as <add/>, <commit/> etc.
> >
> > Question: Is there some way to ask Solr to dump out whatever it has in its
> > index already ... as a Solr xml document?
> 
> As far as I know, there is no way to do that out of the box. One would get
> the contents of each record with a normal Solr query, massage that into
> a Solr XML document, and use that to rebuild the index. Have not tried
> this, but it should be possible to get the desired output format with the
> XsltResponseWriter: http://wiki.apache.org/solr/XsltResponseWriter .
> 
> All in all, it seems easier to me to just reindex from the base source, unless
> that is not possible for some reason.
> 
> > Plan: I intend to message that xml dump (add the field + value that I need
> > in every doc's xml element) and then I should be able to push this dump back
> > to Solr to get data indexed again, I hope.
> 
> Yes, that should be the general idea.
> 
> Regards,
> Gora


Re: Interesting DIH challenge

Posted by Gora Mohanty <go...@mimirtech.com>.
On Mon, Oct 10, 2011 at 6:30 AM, Pulkit Singhal <pu...@gmail.com> wrote:
> @Gora Thank You!
>
> I know that Solr accepts xml with Solr specific elements that are commands
> that only it understands ... such as <add/>, <commit/> etc.
>
> Question: Is there some way to ask Solr to dump out whatever it has in its
> index already ... as a Solr xml document?

As far as I know, there is no way to do that out of the box. One would get
the contents of each record with a normal Solr query, massage that into
a Solr XML document, and use that to rebuild the index. Have not tried
this, but it should be possible to get the desired output format with the
XsltResponseWriter: http://wiki.apache.org/solr/XsltResponseWriter .

All in all, it seems easier to me to just reindex from the base source, unless
that is not possible for some reason.

> Plan: I intend to message that xml dump (add the field + value that I need
> in every doc's xml element) and then I should be able to push this dump back
> to Solr to get data indexed again, I hope.

Yes, that should be the general idea.

Regards,
Gora

Re: Interesting DIH challenge

Posted by Ahmet Arslan <io...@yahoo.com>.
> Oh also: Does DIH have any
> experimental way for folks to be reading data
> from one solr core and then massaging it and importing it
> into another core?

SolrEntityProcessor can do that.
https://issues.apache.org/jira/browse/SOLR-1499

Re: Interesting DIH challenge

Posted by Pulkit Singhal <pu...@gmail.com>.
Oh also: Does DIH have any experimental way for folks to be reading data
from one solr core and then massaging it and importing it into another core?
If not, then would that be a good addition or just a waste of time for some
architectural reason?

On Sun, Oct 9, 2011 at 8:00 PM, Pulkit Singhal <pu...@gmail.com>wrote:

> @Gora Thank You!
>
> I know that Solr accepts xml with Solr specific elements that are commands
> that only it understands ... such as <add/>, <commit/> etc.
>
> Question: Is there some way to ask Solr to dump out whatever it has in its
> index already ... as a Solr xml document?
>
> Plan: I intend to message that xml dump (add the field + value that I need
> in every doc's xml element) and then I should be able to push this dump back
> to Solr to get data indexed again, I hope.
>
> Thanks!
> - Pulkit
>
>
> On Sun, Oct 9, 2011 at 2:57 PM, Gora Mohanty <go...@mimirtech.com> wrote:
>
>> On Mon, Oct 10, 2011 at 1:17 AM, Pulkit Singhal <pu...@gmail.com>
>> wrote:
>> > Hello Folks,
>> >
>> > I'm a big DIH fan but I'm fairly sure that now I've run into a scenario
>> > where it can't help me anymore ... but before I give up and roll my own
>> > solution, I jsut wanted to check with everyone else.
>> >
>> > The scenario:
>> > - already have 1M+ documents indexed
>> > - the schema.xml needs to have one more field added to it ...
>> > problem/do-able? yes? no? remove all the old data? or do the update per
>> doc
>> > (add/delete)?
>>
>> This is independent of DIH. If you want to add a new field to the schema,
>> you should reindex. 1M documents should not take that long.
>>
>> > - need to populate data from a file that has a key and value per line
>> and i
>> > need to use the key to find the doc to update and then add the value to
>> the
>> > new schema field
>>
>> It is best just to reindex, but it should be possible to write a script to
>> pull
>> the doc from the existing Solr index, massage the return format into
>> Solr's XML format, adding a value for the new field in the process, and
>> then posting the new file to Solr for indexing.
>>
>> Regards,
>> Gora
>>
>
>

Re: Interesting DIH challenge

Posted by Shawn Heisey <so...@elyograg.org>.
On 10/9/2011 7:00 PM, Pulkit Singhal wrote:
> I know that Solr accepts xml with Solr specific elements that are commands
> that only it understands ... such as<add/>,<commit/>  etc.
>
> Question: Is there some way to ask Solr to dump out whatever it has in its
> index already ... as a Solr xml document?
>
> Plan: I intend to message that xml dump (add the field + value that I need
> in every doc's xml element) and then I should be able to push this dump back
> to Solr to get data indexed again, I hope.

I don't know whether Solr will dump a format with the add tags, but I am 
guessing that it won't.

Although it is possible to use Solr as a data storage mechanism, it's 
not really designed for that role.  If you have not set all your fields 
to stored=true in schema.xml, you won't be able to do what you are 
thinking about at all.  Most solr installations do not store every 
field, because it makes the index huge.

For best results, you should be prepared at any time to rebuild your 
index from the original data source.  Have you incorporated the extra 
field into your original data source and normal DIH mechanism? If you 
have, simply run a full-import and you're in business.  Hopefully you've 
got a robust installation with multiple copies of the index, and can 
take one copy offline to do the rebuild.

At the Lucene Revolution conference in Boston last year, I  saw a 
presentation where one company was getting data from multiple sources 
into one or more staging Solr instances, and using that as a data source 
for their real index.  I believe it was the Hathi Trust, but I may be 
wrong there.  Whoever it was, I don't know if they have released source 
for this mechanism or not.

Thanks,
Shawn


Re: Interesting DIH challenge

Posted by Pulkit Singhal <pu...@gmail.com>.
@Gora Thank You!

I know that Solr accepts xml with Solr specific elements that are commands
that only it understands ... such as <add/>, <commit/> etc.

Question: Is there some way to ask Solr to dump out whatever it has in its
index already ... as a Solr xml document?

Plan: I intend to message that xml dump (add the field + value that I need
in every doc's xml element) and then I should be able to push this dump back
to Solr to get data indexed again, I hope.

Thanks!
- Pulkit

On Sun, Oct 9, 2011 at 2:57 PM, Gora Mohanty <go...@mimirtech.com> wrote:

> On Mon, Oct 10, 2011 at 1:17 AM, Pulkit Singhal <pu...@gmail.com>
> wrote:
> > Hello Folks,
> >
> > I'm a big DIH fan but I'm fairly sure that now I've run into a scenario
> > where it can't help me anymore ... but before I give up and roll my own
> > solution, I jsut wanted to check with everyone else.
> >
> > The scenario:
> > - already have 1M+ documents indexed
> > - the schema.xml needs to have one more field added to it ...
> > problem/do-able? yes? no? remove all the old data? or do the update per
> doc
> > (add/delete)?
>
> This is independent of DIH. If you want to add a new field to the schema,
> you should reindex. 1M documents should not take that long.
>
> > - need to populate data from a file that has a key and value per line and
> i
> > need to use the key to find the doc to update and then add the value to
> the
> > new schema field
>
> It is best just to reindex, but it should be possible to write a script to
> pull
> the doc from the existing Solr index, massage the return format into
> Solr's XML format, adding a value for the new field in the process, and
> then posting the new file to Solr for indexing.
>
> Regards,
> Gora
>

Re: Interesting DIH challenge

Posted by Gora Mohanty <go...@mimirtech.com>.
On Mon, Oct 10, 2011 at 1:17 AM, Pulkit Singhal <pu...@gmail.com> wrote:
> Hello Folks,
>
> I'm a big DIH fan but I'm fairly sure that now I've run into a scenario
> where it can't help me anymore ... but before I give up and roll my own
> solution, I jsut wanted to check with everyone else.
>
> The scenario:
> - already have 1M+ documents indexed
> - the schema.xml needs to have one more field added to it ...
> problem/do-able? yes? no? remove all the old data? or do the update per doc
> (add/delete)?

This is independent of DIH. If you want to add a new field to the schema,
you should reindex. 1M documents should not take that long.

> - need to populate data from a file that has a key and value per line and i
> need to use the key to find the doc to update and then add the value to the
> new schema field

It is best just to reindex, but it should be possible to write a script to pull
the doc from the existing Solr index, massage the return format into
Solr's XML format, adding a value for the new field in the process, and
then posting the new file to Solr for indexing.

Regards,
Gora