You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tom Evans <te...@googlemail.com> on 2016/01/12 14:05:12 UTC

SolrCloud, DIH, and XPathEntityProcessor

Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having
some problems with a DIH config that attempts to load an XML file and
iterate through the nodes in that file, it trys to load the file from
disk instead of from zookeeper.

<entity
    dataSource="lookup_conf"
    rootEntity="false"
    name="lookups"
    processor="XPathEntityProcessor"
    url="lookup_conf.xml"
    forEach="/lookups/lookup">

The file exists in zookeeper, adjacent to the data_import.conf in the
lookups_config conf folder.

The exception:

2016-01-12 12:59:47.852 ERROR (Thread-44) [c:lookups s:shard1
r:core_node6 x:lookups_shard1_replica2] o.a.s.h.d.DataImporter Full
Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.io.FileNotFoundException: Could not
find file: lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
        at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
        at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:417)
        at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:481)
        at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:462)
Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.io.FileNotFoundException: Could not
find file: lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
        at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
        at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
        at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
        ... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.io.FileNotFoundException: Could not
find file: lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
        at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:62)
        at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:287)
        at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:225)
        at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:202)
        at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
        at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
        at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
        ... 5 more
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException:
Could not find file: lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
        at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:127)
        at org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:86)
        at org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:48)
        at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:284)
        ... 10 more
Caused by: java.io.FileNotFoundException: Could not find file:
lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
        at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:123)
        ... 13 more


Any hints gratefully accepted

Cheers

Tom

Re: SolrCloud, DIH, and XPathEntityProcessor

Posted by Erick Erickson <er...@gmail.com>.
Yeah, that's essentially the nature of open source, someone
gets frustrated enough with current behavior and fixes it ;)...

There's never any harm in opening a JIRA, all you need to do
is register. It's not a bad idea to open on as you _start_ writing
the code, even providing very early versions of your patch for
people to comment on or to discuss approaches. And early
comments may save you a lot of work! No guarantees of course.

If you do put up a preliminary patch, just mention the current
state in the comments.

If you haven't seen it already, here's a primer:
https://wiki.apache.org/solr/HowToContribute

Best,
Erick

On Tue, Jan 12, 2016 at 7:16 AM, Tom Evans <te...@googlemail.com> wrote:
> On Tue, Jan 12, 2016 at 3:00 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>> On 1/12/2016 7:45 AM, Tom Evans wrote:
>>> That makes no sense whatsoever. DIH loads the data_import.conf from ZK
>>> just fine, or is that provided to DIH from another module that does
>>> know about ZK?
>>
>> This is accomplished indirectly through a resource loader in the
>> SolrCore object that is responsible for config files.  Also, the
>> dataimport handler is created by the main Solr code which then hands the
>> configuration to the dataimport module.  DIH itself does not know about
>> zookeeper.
>
> ZkPropertiesWriter seems to know a little..
>
>>
>>> Either way, it is entirely sub-optimal to have SolrCloud store "all"
>>> its configuration in ZK, but still require manually storing and
>>> updating files on specific nodes in order to influence DIH. If a
>>> server is mistakenly not updated, or manually modified locally on
>>> disk, that node would start indexing documents differently than other
>>> replicas, which sounds dangerous and scary!
>>
>> The entity processor you are using accesses files through a Java
>> interface for mounted filesystems.  As already mentioned, it does not
>> know about zookeeper.
>>
>>> If there is not a ZkFileDataSource, it shouldn't be too tricky to add
>>> one... I'll see how much I dislike having config files on the host...
>>
>> Creating your own DIH class would be the only solution available right now.
>>
>> I don't know how useful this would be in practice.  Without special
>> config in multiple places, Zookeeper limits the size of the files it
>> contains to 1MB.  It is not designed to deal with a large amount of data
>> at once.
>
> This is not large amounts of data, it is a 5kb XML file containing
> configuration of what tables to query for what fields and how to map
> them in to the document.
>
>>
>> You could submit a feature request in Jira, but unless you supply a
>> complete patch that survives the review process, I do not know how
>> likely an implementation would be.
>
> We've already started implementation, basing around FileDataSource and
> using SolrZkClient, which we will deploy as an additional library
> whilst that process is ongoing or doesn't survive it.
>
> Cheers
>
> Tom

Re: SolrCloud, DIH, and XPathEntityProcessor

Posted by Tom Evans <te...@googlemail.com>.
On Tue, Jan 12, 2016 at 3:00 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 1/12/2016 7:45 AM, Tom Evans wrote:
>> That makes no sense whatsoever. DIH loads the data_import.conf from ZK
>> just fine, or is that provided to DIH from another module that does
>> know about ZK?
>
> This is accomplished indirectly through a resource loader in the
> SolrCore object that is responsible for config files.  Also, the
> dataimport handler is created by the main Solr code which then hands the
> configuration to the dataimport module.  DIH itself does not know about
> zookeeper.

ZkPropertiesWriter seems to know a little..

>
>> Either way, it is entirely sub-optimal to have SolrCloud store "all"
>> its configuration in ZK, but still require manually storing and
>> updating files on specific nodes in order to influence DIH. If a
>> server is mistakenly not updated, or manually modified locally on
>> disk, that node would start indexing documents differently than other
>> replicas, which sounds dangerous and scary!
>
> The entity processor you are using accesses files through a Java
> interface for mounted filesystems.  As already mentioned, it does not
> know about zookeeper.
>
>> If there is not a ZkFileDataSource, it shouldn't be too tricky to add
>> one... I'll see how much I dislike having config files on the host...
>
> Creating your own DIH class would be the only solution available right now.
>
> I don't know how useful this would be in practice.  Without special
> config in multiple places, Zookeeper limits the size of the files it
> contains to 1MB.  It is not designed to deal with a large amount of data
> at once.

This is not large amounts of data, it is a 5kb XML file containing
configuration of what tables to query for what fields and how to map
them in to the document.

>
> You could submit a feature request in Jira, but unless you supply a
> complete patch that survives the review process, I do not know how
> likely an implementation would be.

We've already started implementation, basing around FileDataSource and
using SolrZkClient, which we will deploy as an additional library
whilst that process is ongoing or doesn't survive it.

Cheers

Tom

Re: SolrCloud, DIH, and XPathEntityProcessor

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/12/2016 7:45 AM, Tom Evans wrote:
> That makes no sense whatsoever. DIH loads the data_import.conf from ZK
> just fine, or is that provided to DIH from another module that does
> know about ZK?

This is accomplished indirectly through a resource loader in the
SolrCore object that is responsible for config files.  Also, the
dataimport handler is created by the main Solr code which then hands the
configuration to the dataimport module.  DIH itself does not know about
zookeeper.

> Either way, it is entirely sub-optimal to have SolrCloud store "all"
> its configuration in ZK, but still require manually storing and
> updating files on specific nodes in order to influence DIH. If a
> server is mistakenly not updated, or manually modified locally on
> disk, that node would start indexing documents differently than other
> replicas, which sounds dangerous and scary!

The entity processor you are using accesses files through a Java
interface for mounted filesystems.  As already mentioned, it does not
know about zookeeper.

> If there is not a ZkFileDataSource, it shouldn't be too tricky to add
> one... I'll see how much I dislike having config files on the host...

Creating your own DIH class would be the only solution available right now.

I don't know how useful this would be in practice.  Without special
config in multiple places, Zookeeper limits the size of the files it
contains to 1MB.  It is not designed to deal with a large amount of data
at once.

You could submit a feature request in Jira, but unless you supply a
complete patch that survives the review process, I do not know how
likely an implementation would be.

Thanks,
Shawn


Re: SolrCloud, DIH, and XPathEntityProcessor

Posted by Tom Evans <te...@googlemail.com>.
On Tue, Jan 12, 2016 at 2:32 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 1/12/2016 6:05 AM, Tom Evans wrote:
>> Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having
>> some problems with a DIH config that attempts to load an XML file and
>> iterate through the nodes in that file, it trys to load the file from
>> disk instead of from zookeeper.
>>
>> <entity
>>     dataSource="lookup_conf"
>>     rootEntity="false"
>>     name="lookups"
>>     processor="XPathEntityProcessor"
>>     url="lookup_conf.xml"
>>     forEach="/lookups/lookup">
>>
>> The file exists in zookeeper, adjacent to the data_import.conf in the
>> lookups_config conf folder.
>
> SolrCloud puts all the *config* for Solr into zookeeper, and adds a new
> abstraction for indexes (the collection), but other parts of Solr like
> DIH are not really affected.  The entity processors in DIH cannot
> retrieve data from zookeeper.  They do not know how.

That makes no sense whatsoever. DIH loads the data_import.conf from ZK
just fine, or is that provided to DIH from another module that does
know about ZK?

Either way, it is entirely sub-optimal to have SolrCloud store "all"
its configuration in ZK, but still require manually storing and
updating files on specific nodes in order to influence DIH. If a
server is mistakenly not updated, or manually modified locally on
disk, that node would start indexing documents differently than other
replicas, which sounds dangerous and scary!

If there is not a ZkFileDataSource, it shouldn't be too tricky to add
one... I'll see how much I dislike having config files on the host...

Cheers

Tom

Re: SolrCloud, DIH, and XPathEntityProcessor

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/12/2016 6:05 AM, Tom Evans wrote:
> Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having
> some problems with a DIH config that attempts to load an XML file and
> iterate through the nodes in that file, it trys to load the file from
> disk instead of from zookeeper.
> 
> <entity
>     dataSource="lookup_conf"
>     rootEntity="false"
>     name="lookups"
>     processor="XPathEntityProcessor"
>     url="lookup_conf.xml"
>     forEach="/lookups/lookup">
> 
> The file exists in zookeeper, adjacent to the data_import.conf in the
> lookups_config conf folder.

SolrCloud puts all the *config* for Solr into zookeeper, and adds a new
abstraction for indexes (the collection), but other parts of Solr like
DIH are not really affected.  The entity processors in DIH cannot
retrieve data from zookeeper.  They do not know how.

Thanks,
Shawn