You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dario Rigolin <da...@comperio.it> on 2010/10/25 16:12:16 UTC

DataImporter using pure solr XML

Looking at DataImporter I'm not sure if it's possible to import using a 
standard <add><doc>... xml document representing a document add operation.
Generating <add><doc> is quite expensive in my application and I have cached 
all those documents into a text column into MySQL database.
It will be easier for me to "push" all updated documents directly from 
Database instead passing via multiple xml files posted in "stream" mode to 
Solr.

Thank you.

Dario.

Re: DataImporter using pure solr XML

Posted by Ken Stanley <do...@gmail.com>.
On Mon, Oct 25, 2010 at 10:12 AM, Dario Rigolin
<da...@comperio.it>wrote:

> Looking at DataImporter I'm not sure if it's possible to import using a
> standard <add><doc>... xml document representing a document add operation.
> Generating <add><doc> is quite expensive in my application and I have
> cached
> all those documents into a text column into MySQL database.
> It will be easier for me to "push" all updated documents directly from
> Database instead passing via multiple xml files posted in "stream" mode to
> Solr.
>
> Thank you.
>
> Dario.
>


Dario,

Technically nothing is stopping you from using the DIH to import your XML
document(s). However, note that the <doc><add></add></doc> structure is not
required. In fact, you can make up your own structure for the documents, so
long as you configure the DIH to recognize them. At minimum, you should be
able to use something to the effect of:

    <dataSource type="FileDataSource" encoding="UTF-8" />

    <document>
        <entity
            name="some_unique_name_for_the_entity"
            rootEntity="false"
            dataSource="null"
            processor="FileListEntityProcessor"
            fileName="some_regex_matching_your_files.*\.xml$"
            baseDir="/path/to/xml/files"

newerThan="${dataimporter.some_unique_name_for_the_entity.last_index_time}"
        >
            <entity
                name="another_unique_entity_name"
                dataSource="some_unique_name_for_the_entity"
                processor="XPathEntityProcessor"
                url="${some_unique_name_for_the_entity.fileAbsolutePath}"
                forEach="/XMLROOT/CHILD_NODE"
                stream="true"
            >
               <!-- An optional list of <field /> definitions if your XML
schema does not match that of SOLR -->
            </entity>
        </entity>
    </document>

The break down is as follows:

The <dataSource /> defines the document encoding that SOLR should use for
your XML files.

The top-level <entity /> creates the list of files to parse (hence why the
fileName attribute supports regex expressions). The dataSource attribute
needs to be set null here (I'm using 1.4.1, and AFAIK this is the same as
1.3 as well). The rootEntity="false"  is important to tell SOLR that it
should not try to define fields from this entity.

The second-level <entity /> is where the documents found in the file list
are processed and parsed. The dataSource attribute needs to be the name of
the top-level <entity />. The url attribute is defined as the absolute path
to the file generated by the top-level entity. The forEach is the key
component here; this is the minimum xPath needed to iterate over your
document structure. So, if by example you had:

<XMLROOT>
    <CHILD_NODE>
         <field1>data</field1>
         <field2>more data</field2>
         ...
    </CHILD_NODE>
</XMLROOT>

Also note that, in my experience, case sensitivity matters when parsing your
xpath instructions.

I hope this helps!

- Ken Stanley