You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Gunaranjan Chandraraju <ch...@apple.com> on 2009/01/21 00:45:35 UTC

Newbie Design Questions

Hi All
We are considering SOLR for a large database of XMLs.  I have some  
newbie questions - if there is a place I can go read about them do let  
me know and I will go read up :)

1. Currently we are able to pull the XMLs from a file systems using  
FileDataSource.  The DIH is convenient since I can map my XML fields  
using the XPathProcessor. This works for an initial load.    However  
after the initial load, we would like to 'post' changed xmls to SOLR  
whenever the XML is updated in a separate system.  I know we can post  
xmls with 'add' however I was not sure how to do this and maintain the  
DIH mapping I use in data-config.xml?  I don't want to save the file  
to the disk and then call the DIH - would prefer to directly post it.   
Do I need to use solrj for this?

2.  If my solr schema.xml changes then do I HAVE to reindex all the  
old documents?  Suppose in future we have newer XML documents that  
contain a new additional xml field.    The old documents that are  
already indexed don't have this field and (so) I don't need search on  
them with this field.  However the new ones need to be search-able on  
this new field.    Can I just add this new field to the SOLR schema,  
restart the servers just post the new new documents or do I need to  
reindex everything?

3. Can I backup the index directory.  So that in case of a disk crash  
- I can restore this directory and bring solr up. I realize that any  
documents indexed after this backup would be lost - I can however keep  
track of these outside and simply re-index documents 'newer' than that  
backup date.  This question is really important to me in the context  
of using a Master Server with replicated index.  I would like to run  
this backup for the 'Master'.

4.  In general what happens when the solr application is bounced?  Is  
the index affected (anything maintained in memory)?

Regards
Guna

Re: Newbie Design Questions

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

It is planned to be in an another month or so. But it is never too sure.


On Fri, Jan 23, 2009 at 3:57 AM, Gunaranjan Chandraraju
<ch...@apple.com> wrote:
> Thanks
>
> A last question - do you have any approximate date for the release of 1.4.
> If its going to be soon enough (within a month or so) then I can plan for
> our development around it.
>
> Thanks
> Guna
>
> On Jan 22, 2009, at 11:04 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>> You are out of luck if you are not using a recent version of DIH
>>
>> The sub entity will work only if you use the FieldReaderDataSource.
>> Then you do not need a ClobTransformer also.
>>
>> The trunk version of DIH can be used w/ Solr 1.3 release
>>
>> On Thu, Jan 22, 2009 at 12:59 PM, Gunaranjan Chandraraju
>> <ch...@apple.com> wrote:
>>>
>>> Hi
>>>
>>> Yes, the XML is inside the DB in a clob.     Would love to use XPath
>>> inside
>>> SQLEntityProcessor as it will save me tons of trouble for file-dumping
>>> (given that I am not able to post it).  This is how I setup my DIH for DB
>>> import.
>>>
>>> <dataConfig>
>>> <dataSource type="JdbcDataSource" name="data-source-1"
>>> driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@XXXXX"
>>> user="abc" password="***" batchSize="100"/>
>>>  <document>
>>>   <entity dataSource="data-source-1"
>>>               name ="item" processor="SqlEntityProcessor"
>>>           pk="ID"
>>>           stream="false"
>>>           rootEntity="false"
>>>           transformer="ClobTransformer"  <!-- custom clob transformer I
>>> saw and not the one from 1.4.   -->
>>>           query="select xml_col from xml_table where xml_col IS NOT NULL"
>>>>
>>>>  <!-- horrible query I need to work on making it better -->
>>>
>>>      <entity
>>>         dataSource="null"  <!-- this is my problem - if I don't give a
>>> name here it complains, if I put in null then the code seems to fail with
>>> a
>>> null pointer -->
>>>         name="record"
>>>         processor="XPathEntityProcessor"
>>>         stream="false"
>>>         url="${item.xml_col}"
>>>          forEach="/record">
>>>
>>>            <field column="ID" xpath="/record/coreinfo/@a" />
>>>            <field column="type" xpath="/record/coreinfo/@b" />
>>>            <field column="streetname" xpath="/record/address/@c" />
>>>
>>>    .. and so on
>>>      </entity>
>>>
>>>
>>>   </entity>
>>>  </document>
>>> </dataConfig>
>>>
>>>
>>> The problem with this is that it always fails with this error.  I can see
>>> that the earlier SQL entity extraction and clob transformation is working
>>> as
>>> the values show in the debug jsp (verbose mode with dataimport.jsp).
>>> However no records are extracted for entity.  When I check catalina.out
>>> file, it shows me the following errors for entity name="record". (the
>>> XPath
>>> entity on top).
>>>
>>> java.lang.NullPointerException at
>>>
>>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85).
>>>
>>> I don't have the whole stack trace right now.  If you need it I would be
>>> happy to recreate and post it.
>>>
>>> Regards,
>>> Guna
>>>
>>> On Jan 21, 2009, at 8:22 PM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>>
>>>> On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
>>>> <ch...@apple.com> wrote:
>>>>>
>>>>> Thanks
>>>>>
>>>>> Yes the source of data is a DB.  However the xml is also posted on
>>>>> updates
>>>>> via publish framework.  So I can just plug in an adapter hear to listen
>>>>> for
>>>>> changes and post to SOLR.  I was trying to use the XPathProcessor
>>>>> inside
>>>>> the
>>>>> SQLEntityProcessor and this did not work (using 1.3 - I did see support
>>>>> in
>>>>> 1.4).  That is not a show stopper for me and I can just post them via
>>>>> the
>>>>> framework and use files for the first time load.
>>>>
>>>> XPathEntityprocessor works inside SqlEntityprocessor only if a db
>>>> field contains xml.
>>>>
>>>> However ,you can have a separate entity (at the root) to read from db
>>>> for delta.
>>>> Anyway if your current solution works stick to it.
>>>>>
>>>>> Have a seen a couple of answers on the backup for crash scenarios.
>>>>>  just
>>>>> wanted to confirm - if I replace the index with the backup'ed files
>>>>> then
>>>>> I
>>>>> can simple start the up solr again and reindex the documents changed
>>>>> since
>>>>> last backup? Am I right?  The slaves will also automatically adjust to
>>>>> this.
>>>>
>>>> Yes. you can replace an archived index and Solr should work just fine.
>>>> but the docs added since the last snapshot was taken will be missing
>>>> (of course :) )
>>>>>
>>>>> THanks
>>>>> Guna
>>>>>
>>>>>
>>>>> On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>>>>
>>>>>> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
>>>>>> <ch...@apple.com> wrote:
>>>>>>>
>>>>>>> Hi All
>>>>>>> We are considering SOLR for a large database of XMLs.  I have some
>>>>>>> newbie
>>>>>>> questions - if there is a place I can go read about them do let me
>>>>>>> know
>>>>>>> and
>>>>>>> I will go read up :)
>>>>>>>
>>>>>>> 1. Currently we are able to pull the XMLs from a file systems using
>>>>>>> FileDataSource.  The DIH is convenient since I can map my XML fields
>>>>>>> using
>>>>>>> the XPathProcessor. This works for an initial load.    However after
>>>>>>> the
>>>>>>> initial load, we would like to 'post' changed xmls to SOLR whenever
>>>>>>> the
>>>>>>> XML
>>>>>>> is updated in a separate system.  I know we can post xmls with 'add'
>>>>>>> however
>>>>>>> I was not sure how to do this and maintain the DIH mapping I use in
>>>>>>> data-config.xml?  I don't want to save the file to the disk and then
>>>>>>> call
>>>>>>> the DIH - would prefer to directly post it.  Do I need to use solrj
>>>>>>> for
>>>>>>> this?
>>>>>>
>>>>>> What is the source of your new data? is it a DB?
>>>>>>
>>>>>>>
>>>>>>> 2.  If my solr schema.xml changes then do I HAVE to reindex all the
>>>>>>> old
>>>>>>> documents?  Suppose in future we have newer XML documents that
>>>>>>> contain
>>>>>>> a
>>>>>>> new
>>>>>>> additional xml field.    The old documents that are already indexed
>>>>>>> don't
>>>>>>> have this field and (so) I don't need search on them with this field.
>>>>>>> However the new ones need to be search-able on this new field.    Can
>>>>>>> I
>>>>>>> just add this new field to the SOLR schema, restart the servers just
>>>>>>> post
>>>>>>> the new new documents or do I need to reindex everything?
>>>>>>>
>>>>>>> 3. Can I backup the index directory.  So that in case of a disk crash
>>>>>>> -
>>>>>>> I
>>>>>>> can restore this directory and bring solr up. I realize that any
>>>>>>> documents
>>>>>>> indexed after this backup would be lost - I can however keep track of
>>>>>>> these
>>>>>>> outside and simply re-index documents 'newer' than that backup date.
>>>>>>> This
>>>>>>> question is really important to me in the context of using a Master
>>>>>>> Server
>>>>>>> with replicated index.  I would like to run this backup for the
>>>>>>> 'Master'.
>>>>>>
>>>>>> the snapshot script is can be used to take backups on commit.
>>>>>>>
>>>>>>> 4.  In general what happens when the solr application is bounced?  Is
>>>>>>> the
>>>>>>> index affected (anything maintained in memory)?
>>>>>>>
>>>>>>> Regards
>>>>>>> Guna
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> --Noble Paul
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --Noble Paul
>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
>



-- 
--Noble Paul

Re: Newbie Design Questions

Posted by Gunaranjan Chandraraju <ch...@apple.com>.

Thanks

A last question - do you have any approximate date for the release of  
1.4. If its going to be soon enough (within a month or so) then I can  
plan for our development around it.

Thanks
Guna

On Jan 22, 2009, at 11:04 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:

> You are out of luck if you are not using a recent version of DIH
>
> The sub entity will work only if you use the FieldReaderDataSource.
> Then you do not need a ClobTransformer also.
>
> The trunk version of DIH can be used w/ Solr 1.3 release
>
> On Thu, Jan 22, 2009 at 12:59 PM, Gunaranjan Chandraraju
> <ch...@apple.com> wrote:
>> Hi
>>
>> Yes, the XML is inside the DB in a clob.     Would love to use  
>> XPath inside
>> SQLEntityProcessor as it will save me tons of trouble for file- 
>> dumping
>> (given that I am not able to post it).  This is how I setup my DIH  
>> for DB
>> import.
>>
>> <dataConfig>
>> <dataSource type="JdbcDataSource" name="data-source-1"
>> driver="oracle.jdbc.driver.OracleDriver"  
>> url="jdbc:oracle:thin:@XXXXX"
>> user="abc" password="***" batchSize="100"/>
>>  <document>
>>    <entity dataSource="data-source-1"
>>                name ="item" processor="SqlEntityProcessor"
>>            pk="ID"
>>            stream="false"
>>            rootEntity="false"
>>            transformer="ClobTransformer"  <!-- custom clob  
>> transformer I
>> saw and not the one from 1.4.   -->
>>            query="select xml_col from xml_table where xml_col IS  
>> NOT NULL"
>>>  <!-- horrible query I need to work on making it better -->
>>
>>       <entity
>>          dataSource="null"  <!-- this is my problem - if I don't  
>> give a
>> name here it complains, if I put in null then the code seems to  
>> fail with a
>> null pointer -->
>>          name="record"
>>          processor="XPathEntityProcessor"
>>          stream="false"
>>          url="${item.xml_col}"
>>           forEach="/record">
>>
>>             <field column="ID" xpath="/record/coreinfo/@a" />
>>             <field column="type" xpath="/record/coreinfo/@b" />
>>             <field column="streetname" xpath="/record/address/@c" />
>>
>>     .. and so on
>>       </entity>
>>
>>
>>    </entity>
>>  </document>
>> </dataConfig>
>>
>>
>> The problem with this is that it always fails with this error.  I  
>> can see
>> that the earlier SQL entity extraction and clob transformation is  
>> working as
>> the values show in the debug jsp (verbose mode with dataimport.jsp).
>> However no records are extracted for entity.  When I check  
>> catalina.out
>> file, it shows me the following errors for entity name="record".  
>> (the XPath
>> entity on top).
>>
>> java.lang.NullPointerException at
>> org 
>> .apache 
>> .solr 
>> .handler 
>> .dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java: 
>> 85).
>>
>> I don't have the whole stack trace right now.  If you need it I  
>> would be
>> happy to recreate and post it.
>>
>> Regards,
>> Guna
>>
>> On Jan 21, 2009, at 8:22 PM, Noble Paul നോബിള്‍  
>> नोब्ळ् wrote:
>>
>>> On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
>>> <ch...@apple.com> wrote:
>>>>
>>>> Thanks
>>>>
>>>> Yes the source of data is a DB.  However the xml is also posted on
>>>> updates
>>>> via publish framework.  So I can just plug in an adapter hear to  
>>>> listen
>>>> for
>>>> changes and post to SOLR.  I was trying to use the XPathProcessor  
>>>> inside
>>>> the
>>>> SQLEntityProcessor and this did not work (using 1.3 - I did see  
>>>> support
>>>> in
>>>> 1.4).  That is not a show stopper for me and I can just post them  
>>>> via the
>>>> framework and use files for the first time load.
>>>
>>> XPathEntityprocessor works inside SqlEntityprocessor only if a db
>>> field contains xml.
>>>
>>> However ,you can have a separate entity (at the root) to read from  
>>> db
>>> for delta.
>>> Anyway if your current solution works stick to it.
>>>>
>>>> Have a seen a couple of answers on the backup for crash  
>>>> scenarios.  just
>>>> wanted to confirm - if I replace the index with the backup'ed  
>>>> files then
>>>> I
>>>> can simple start the up solr again and reindex the documents  
>>>> changed
>>>> since
>>>> last backup? Am I right?  The slaves will also automatically  
>>>> adjust to
>>>> this.
>>>
>>> Yes. you can replace an archived index and Solr should work just  
>>> fine.
>>> but the docs added since the last snapshot was taken will be missing
>>> (of course :) )
>>>>
>>>> THanks
>>>> Guna
>>>>
>>>>
>>>> On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്‍  
>>>> नोब्ळ् wrote:
>>>>
>>>>> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
>>>>> <ch...@apple.com> wrote:
>>>>>>
>>>>>> Hi All
>>>>>> We are considering SOLR for a large database of XMLs.  I have  
>>>>>> some
>>>>>> newbie
>>>>>> questions - if there is a place I can go read about them do let  
>>>>>> me know
>>>>>> and
>>>>>> I will go read up :)
>>>>>>
>>>>>> 1. Currently we are able to pull the XMLs from a file systems  
>>>>>> using
>>>>>> FileDataSource.  The DIH is convenient since I can map my XML  
>>>>>> fields
>>>>>> using
>>>>>> the XPathProcessor. This works for an initial load.    However  
>>>>>> after
>>>>>> the
>>>>>> initial load, we would like to 'post' changed xmls to SOLR  
>>>>>> whenever the
>>>>>> XML
>>>>>> is updated in a separate system.  I know we can post xmls with  
>>>>>> 'add'
>>>>>> however
>>>>>> I was not sure how to do this and maintain the DIH mapping I  
>>>>>> use in
>>>>>> data-config.xml?  I don't want to save the file to the disk and  
>>>>>> then
>>>>>> call
>>>>>> the DIH - would prefer to directly post it.  Do I need to use  
>>>>>> solrj for
>>>>>> this?
>>>>>
>>>>> What is the source of your new data? is it a DB?
>>>>>
>>>>>>
>>>>>> 2.  If my solr schema.xml changes then do I HAVE to reindex all  
>>>>>> the old
>>>>>> documents?  Suppose in future we have newer XML documents that  
>>>>>> contain
>>>>>> a
>>>>>> new
>>>>>> additional xml field.    The old documents that are already  
>>>>>> indexed
>>>>>> don't
>>>>>> have this field and (so) I don't need search on them with this  
>>>>>> field.
>>>>>> However the new ones need to be search-able on this new  
>>>>>> field.    Can I
>>>>>> just add this new field to the SOLR schema, restart the servers  
>>>>>> just
>>>>>> post
>>>>>> the new new documents or do I need to reindex everything?
>>>>>>
>>>>>> 3. Can I backup the index directory.  So that in case of a disk  
>>>>>> crash -
>>>>>> I
>>>>>> can restore this directory and bring solr up. I realize that any
>>>>>> documents
>>>>>> indexed after this backup would be lost - I can however keep  
>>>>>> track of
>>>>>> these
>>>>>> outside and simply re-index documents 'newer' than that backup  
>>>>>> date.
>>>>>> This
>>>>>> question is really important to me in the context of using a  
>>>>>> Master
>>>>>> Server
>>>>>> with replicated index.  I would like to run this backup for the
>>>>>> 'Master'.
>>>>>
>>>>> the snapshot script is can be used to take backups on commit.
>>>>>>
>>>>>> 4.  In general what happens when the solr application is  
>>>>>> bounced?  Is
>>>>>> the
>>>>>> index affected (anything maintained in memory)?
>>>>>>
>>>>>> Regards
>>>>>> Guna
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --Noble Paul
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> --Noble Paul
>>
>>
>
>
>
> -- 
> --Noble Paul

Re: Newbie Design Questions

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

You are out of luck if you are not using a recent version of DIH

The sub entity will work only if you use the FieldReaderDataSource.
Then you do not need a ClobTransformer also.

The trunk version of DIH can be used w/ Solr 1.3 release

On Thu, Jan 22, 2009 at 12:59 PM, Gunaranjan Chandraraju
<ch...@apple.com> wrote:
> Hi
>
> Yes, the XML is inside the DB in a clob.     Would love to use XPath inside
> SQLEntityProcessor as it will save me tons of trouble for file-dumping
> (given that I am not able to post it).  This is how I setup my DIH for DB
> import.
>
> <dataConfig>
> <dataSource type="JdbcDataSource" name="data-source-1"
> driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@XXXXX"
> user="abc" password="***" batchSize="100"/>
>   <document>
>     <entity dataSource="data-source-1"
>                 name ="item" processor="SqlEntityProcessor"
>             pk="ID"
>             stream="false"
>             rootEntity="false"
>             transformer="ClobTransformer"  <!-- custom clob transformer I
> saw and not the one from 1.4.   -->
>             query="select xml_col from xml_table where xml_col IS NOT NULL"
>>   <!-- horrible query I need to work on making it better -->
>
>        <entity
>           dataSource="null"  <!-- this is my problem - if I don't give a
> name here it complains, if I put in null then the code seems to fail with a
> null pointer -->
>           name="record"
>           processor="XPathEntityProcessor"
>           stream="false"
>           url="${item.xml_col}"
>            forEach="/record">
>
>              <field column="ID" xpath="/record/coreinfo/@a" />
>              <field column="type" xpath="/record/coreinfo/@b" />
>              <field column="streetname" xpath="/record/address/@c" />
>
>      .. and so on
>        </entity>
>
>
>     </entity>
>   </document>
> </dataConfig>
>
>
> The problem with this is that it always fails with this error.  I can see
> that the earlier SQL entity extraction and clob transformation is working as
> the values show in the debug jsp (verbose mode with dataimport.jsp).
>  However no records are extracted for entity.  When I check catalina.out
> file, it shows me the following errors for entity name="record". (the XPath
> entity on top).
>
> java.lang.NullPointerException at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85).
>
> I don't have the whole stack trace right now.  If you need it I would be
> happy to recreate and post it.
>
> Regards,
> Guna
>
> On Jan 21, 2009, at 8:22 PM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>> On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
>> <ch...@apple.com> wrote:
>>>
>>> Thanks
>>>
>>> Yes the source of data is a DB.  However the xml is also posted on
>>> updates
>>> via publish framework.  So I can just plug in an adapter hear to listen
>>> for
>>> changes and post to SOLR.  I was trying to use the XPathProcessor inside
>>> the
>>> SQLEntityProcessor and this did not work (using 1.3 - I did see support
>>> in
>>> 1.4).  That is not a show stopper for me and I can just post them via the
>>> framework and use files for the first time load.
>>
>> XPathEntityprocessor works inside SqlEntityprocessor only if a db
>> field contains xml.
>>
>> However ,you can have a separate entity (at the root) to read from db
>> for delta.
>> Anyway if your current solution works stick to it.
>>>
>>> Have a seen a couple of answers on the backup for crash scenarios.  just
>>> wanted to confirm - if I replace the index with the backup'ed files then
>>> I
>>> can simple start the up solr again and reindex the documents changed
>>> since
>>> last backup? Am I right?  The slaves will also automatically adjust to
>>> this.
>>
>> Yes. you can replace an archived index and Solr should work just fine.
>> but the docs added since the last snapshot was taken will be missing
>> (of course :) )
>>>
>>> THanks
>>> Guna
>>>
>>>
>>> On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>>
>>>> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
>>>> <ch...@apple.com> wrote:
>>>>>
>>>>> Hi All
>>>>> We are considering SOLR for a large database of XMLs.  I have some
>>>>> newbie
>>>>> questions - if there is a place I can go read about them do let me know
>>>>> and
>>>>> I will go read up :)
>>>>>
>>>>> 1. Currently we are able to pull the XMLs from a file systems using
>>>>> FileDataSource.  The DIH is convenient since I can map my XML fields
>>>>> using
>>>>> the XPathProcessor. This works for an initial load.    However after
>>>>> the
>>>>> initial load, we would like to 'post' changed xmls to SOLR whenever the
>>>>> XML
>>>>> is updated in a separate system.  I know we can post xmls with 'add'
>>>>> however
>>>>> I was not sure how to do this and maintain the DIH mapping I use in
>>>>> data-config.xml?  I don't want to save the file to the disk and then
>>>>> call
>>>>> the DIH - would prefer to directly post it.  Do I need to use solrj for
>>>>> this?
>>>>
>>>> What is the source of your new data? is it a DB?
>>>>
>>>>>
>>>>> 2.  If my solr schema.xml changes then do I HAVE to reindex all the old
>>>>> documents?  Suppose in future we have newer XML documents that contain
>>>>> a
>>>>> new
>>>>> additional xml field.    The old documents that are already indexed
>>>>> don't
>>>>> have this field and (so) I don't need search on them with this field.
>>>>> However the new ones need to be search-able on this new field.    Can I
>>>>> just add this new field to the SOLR schema, restart the servers just
>>>>> post
>>>>> the new new documents or do I need to reindex everything?
>>>>>
>>>>> 3. Can I backup the index directory.  So that in case of a disk crash -
>>>>> I
>>>>> can restore this directory and bring solr up. I realize that any
>>>>> documents
>>>>> indexed after this backup would be lost - I can however keep track of
>>>>> these
>>>>> outside and simply re-index documents 'newer' than that backup date.
>>>>> This
>>>>> question is really important to me in the context of using a Master
>>>>> Server
>>>>> with replicated index.  I would like to run this backup for the
>>>>> 'Master'.
>>>>
>>>> the snapshot script is can be used to take backups on commit.
>>>>>
>>>>> 4.  In general what happens when the solr application is bounced?  Is
>>>>> the
>>>>> index affected (anything maintained in memory)?
>>>>>
>>>>> Regards
>>>>> Guna
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --Noble Paul
>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
>



-- 
--Noble Paul

Re: Newbie Design Questions

Posted by Gunaranjan Chandraraju <ch...@apple.com>.

Hi

Yes, the XML is inside the DB in a clob.     Would love to use XPath  
inside SQLEntityProcessor as it will save me tons of trouble for file- 
dumping (given that I am not able to post it).  This is how I setup my  
DIH for DB import.

<dataConfig>
<dataSource type="JdbcDataSource" name="data-source-1"  
driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@XXXXX"  
user="abc" password="***" batchSize="100"/>
    <document>
      <entity dataSource="data-source-1"
      		 name ="item" processor="SqlEntityProcessor"
              pk="ID"
              stream="false"
              rootEntity="false"
              transformer="ClobTransformer"  <!-- custom clob  
transformer I saw and not the one from 1.4.   -->
              query="select xml_col from xml_table where xml_col IS  
NOT NULL" >   <!-- horrible query I need to work on making it better -->

         <entity
            dataSource="null"  <!-- this is my problem - if I don't  
give a name here it complains, if I put in null then the code seems to  
fail with a null pointer -->
            name="record"
	   processor="XPathEntityProcessor"
	   stream="false"
            url="${item.xml_col}"
             forEach="/record">

               <field column="ID" xpath="/record/coreinfo/@a" />
               <field column="type" xpath="/record/coreinfo/@b" />
               <field column="streetname" xpath="/record/address/@c" />

       .. and so on
         </entity>


      </entity>
    </document>
</dataConfig>


The problem with this is that it always fails with this error.  I can  
see that the earlier SQL entity extraction and clob transformation is  
working as the values show in the debug jsp (verbose mode with  
dataimport.jsp).  However no records are extracted for entity.  When I  
check catalina.out file, it shows me the following errors for entity  
name="record". (the XPath entity on top).

java.lang.NullPointerException at  
org 
.apache 
.solr 
.handler 
.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85).

I don't have the whole stack trace right now.  If you need it I would  
be happy to recreate and post it.

Regards,
Guna

On Jan 21, 2009, at 8:22 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:

> On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
> <ch...@apple.com> wrote:
>> Thanks
>>
>> Yes the source of data is a DB.  However the xml is also posted on  
>> updates
>> via publish framework.  So I can just plug in an adapter hear to  
>> listen for
>> changes and post to SOLR.  I was trying to use the XPathProcessor  
>> inside the
>> SQLEntityProcessor and this did not work (using 1.3 - I did see  
>> support in
>> 1.4).  That is not a show stopper for me and I can just post them  
>> via the
>> framework and use files for the first time load.
> XPathEntityprocessor works inside SqlEntityprocessor only if a db
> field contains xml.
>
> However ,you can have a separate entity (at the root) to read from db
> for delta.
> Anyway if your current solution works stick to it.
>>
>> Have a seen a couple of answers on the backup for crash scenarios.   
>> just
>> wanted to confirm - if I replace the index with the backup'ed files  
>> then I
>> can simple start the up solr again and reindex the documents  
>> changed since
>> last backup? Am I right?  The slaves will also automatically adjust  
>> to this.
> Yes. you can replace an archived index and Solr should work just fine.
> but the docs added since the last snapshot was taken will be missing
> (of course :) )
>>
>> THanks
>> Guna
>>
>>
>> On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്‍  
>> नोब्ळ् wrote:
>>
>>> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
>>> <ch...@apple.com> wrote:
>>>>
>>>> Hi All
>>>> We are considering SOLR for a large database of XMLs.  I have  
>>>> some newbie
>>>> questions - if there is a place I can go read about them do let  
>>>> me know
>>>> and
>>>> I will go read up :)
>>>>
>>>> 1. Currently we are able to pull the XMLs from a file systems using
>>>> FileDataSource.  The DIH is convenient since I can map my XML  
>>>> fields
>>>> using
>>>> the XPathProcessor. This works for an initial load.    However  
>>>> after the
>>>> initial load, we would like to 'post' changed xmls to SOLR  
>>>> whenever the
>>>> XML
>>>> is updated in a separate system.  I know we can post xmls with  
>>>> 'add'
>>>> however
>>>> I was not sure how to do this and maintain the DIH mapping I use in
>>>> data-config.xml?  I don't want to save the file to the disk and  
>>>> then call
>>>> the DIH - would prefer to directly post it.  Do I need to use  
>>>> solrj for
>>>> this?
>>>
>>> What is the source of your new data? is it a DB?
>>>
>>>>
>>>> 2.  If my solr schema.xml changes then do I HAVE to reindex all  
>>>> the old
>>>> documents?  Suppose in future we have newer XML documents that  
>>>> contain a
>>>> new
>>>> additional xml field.    The old documents that are already  
>>>> indexed don't
>>>> have this field and (so) I don't need search on them with this  
>>>> field.
>>>> However the new ones need to be search-able on this new field.     
>>>> Can I
>>>> just add this new field to the SOLR schema, restart the servers  
>>>> just post
>>>> the new new documents or do I need to reindex everything?
>>>>
>>>> 3. Can I backup the index directory.  So that in case of a disk  
>>>> crash - I
>>>> can restore this directory and bring solr up. I realize that any
>>>> documents
>>>> indexed after this backup would be lost - I can however keep  
>>>> track of
>>>> these
>>>> outside and simply re-index documents 'newer' than that backup  
>>>> date.
>>>> This
>>>> question is really important to me in the context of using a Master
>>>> Server
>>>> with replicated index.  I would like to run this backup for the  
>>>> 'Master'.
>>>
>>> the snapshot script is can be used to take backups on commit.
>>>>
>>>> 4.  In general what happens when the solr application is  
>>>> bounced?  Is the
>>>> index affected (anything maintained in memory)?
>>>>
>>>> Regards
>>>> Guna
>>>>
>>>
>>>
>>>
>>> --
>>> --Noble Paul
>>
>>
>
>
>
> -- 
> --Noble Paul

Re: Newbie Design Questions

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
<ch...@apple.com> wrote:
> Thanks
>
> Yes the source of data is a DB.  However the xml is also posted on updates
> via publish framework.  So I can just plug in an adapter hear to listen for
> changes and post to SOLR.  I was trying to use the XPathProcessor inside the
> SQLEntityProcessor and this did not work (using 1.3 - I did see support in
> 1.4).  That is not a show stopper for me and I can just post them via the
> framework and use files for the first time load.
XPathEntityprocessor works inside SqlEntityprocessor only if a db
field contains xml.

However ,you can have a separate entity (at the root) to read from db
for delta.
Anyway if your current solution works stick to it.
>
> Have a seen a couple of answers on the backup for crash scenarios.  just
> wanted to confirm - if I replace the index with the backup'ed files then I
> can simple start the up solr again and reindex the documents changed since
> last backup? Am I right?  The slaves will also automatically adjust to this.
Yes. you can replace an archived index and Solr should work just fine.
but the docs added since the last snapshot was taken will be missing
(of course :) )
>
> THanks
> Guna
>
>
> On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
>> <ch...@apple.com> wrote:
>>>
>>> Hi All
>>> We are considering SOLR for a large database of XMLs.  I have some newbie
>>> questions - if there is a place I can go read about them do let me know
>>> and
>>> I will go read up :)
>>>
>>> 1. Currently we are able to pull the XMLs from a file systems using
>>> FileDataSource.  The DIH is convenient since I can map my XML fields
>>> using
>>> the XPathProcessor. This works for an initial load.    However after the
>>> initial load, we would like to 'post' changed xmls to SOLR whenever the
>>> XML
>>> is updated in a separate system.  I know we can post xmls with 'add'
>>> however
>>> I was not sure how to do this and maintain the DIH mapping I use in
>>> data-config.xml?  I don't want to save the file to the disk and then call
>>> the DIH - would prefer to directly post it.  Do I need to use solrj for
>>> this?
>>
>> What is the source of your new data? is it a DB?
>>
>>>
>>> 2.  If my solr schema.xml changes then do I HAVE to reindex all the old
>>> documents?  Suppose in future we have newer XML documents that contain a
>>> new
>>> additional xml field.    The old documents that are already indexed don't
>>> have this field and (so) I don't need search on them with this field.
>>> However the new ones need to be search-able on this new field.    Can I
>>> just add this new field to the SOLR schema, restart the servers just post
>>> the new new documents or do I need to reindex everything?
>>>
>>> 3. Can I backup the index directory.  So that in case of a disk crash - I
>>> can restore this directory and bring solr up. I realize that any
>>> documents
>>> indexed after this backup would be lost - I can however keep track of
>>> these
>>> outside and simply re-index documents 'newer' than that backup date.
>>>  This
>>> question is really important to me in the context of using a Master
>>> Server
>>> with replicated index.  I would like to run this backup for the 'Master'.
>>
>> the snapshot script is can be used to take backups on commit.
>>>
>>> 4.  In general what happens when the solr application is bounced?  Is the
>>> index affected (anything maintained in memory)?
>>>
>>> Regards
>>> Guna
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
>



-- 
--Noble Paul

Re: Newbie Design Questions

Posted by Gunaranjan Chandraraju <ch...@apple.com>.

Thanks

Yes the source of data is a DB.  However the xml is also posted on  
updates via publish framework.  So I can just plug in an adapter hear  
to listen for changes and post to SOLR.  I was trying to use the  
XPathProcessor inside the SQLEntityProcessor and this did not work  
(using 1.3 - I did see support in 1.4).  That is not a show stopper  
for me and I can just post them via the framework and use files for  
the first time load.

Have a seen a couple of answers on the backup for crash scenarios.   
just wanted to confirm - if I replace the index with the backup'ed  
files then I can simple start the up solr again and reindex the  
documents changed since last backup? Am I right?  The slaves will also  
automatically adjust to this.

THanks
Guna


On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:

> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
> <ch...@apple.com> wrote:
>> Hi All
>> We are considering SOLR for a large database of XMLs.  I have some  
>> newbie
>> questions - if there is a place I can go read about them do let me  
>> know and
>> I will go read up :)
>>
>> 1. Currently we are able to pull the XMLs from a file systems using
>> FileDataSource.  The DIH is convenient since I can map my XML  
>> fields using
>> the XPathProcessor. This works for an initial load.    However  
>> after the
>> initial load, we would like to 'post' changed xmls to SOLR whenever  
>> the XML
>> is updated in a separate system.  I know we can post xmls with  
>> 'add' however
>> I was not sure how to do this and maintain the DIH mapping I use in
>> data-config.xml?  I don't want to save the file to the disk and  
>> then call
>> the DIH - would prefer to directly post it.  Do I need to use solrj  
>> for
>> this?
>
> What is the source of your new data? is it a DB?
>
>>
>> 2.  If my solr schema.xml changes then do I HAVE to reindex all the  
>> old
>> documents?  Suppose in future we have newer XML documents that  
>> contain a new
>> additional xml field.    The old documents that are already indexed  
>> don't
>> have this field and (so) I don't need search on them with this field.
>> However the new ones need to be search-able on this new field.     
>> Can I
>> just add this new field to the SOLR schema, restart the servers  
>> just post
>> the new new documents or do I need to reindex everything?
>>
>> 3. Can I backup the index directory.  So that in case of a disk  
>> crash - I
>> can restore this directory and bring solr up. I realize that any  
>> documents
>> indexed after this backup would be lost - I can however keep track  
>> of these
>> outside and simply re-index documents 'newer' than that backup  
>> date.  This
>> question is really important to me in the context of using a Master  
>> Server
>> with replicated index.  I would like to run this backup for the  
>> 'Master'.
> the snapshot script is can be used to take backups on commit.
>>
>> 4.  In general what happens when the solr application is bounced?   
>> Is the
>> index affected (anything maintained in memory)?
>>
>> Regards
>> Guna
>>
>
>
>
> -- 
> --Noble Paul

Re: Newbie Design Questions

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
<ch...@apple.com> wrote:
> Hi All
> We are considering SOLR for a large database of XMLs.  I have some newbie
> questions - if there is a place I can go read about them do let me know and
> I will go read up :)
>
> 1. Currently we are able to pull the XMLs from a file systems using
> FileDataSource.  The DIH is convenient since I can map my XML fields using
> the XPathProcessor. This works for an initial load.    However after the
> initial load, we would like to 'post' changed xmls to SOLR whenever the XML
> is updated in a separate system.  I know we can post xmls with 'add' however
> I was not sure how to do this and maintain the DIH mapping I use in
> data-config.xml?  I don't want to save the file to the disk and then call
> the DIH - would prefer to directly post it.  Do I need to use solrj for
> this?

What is the source of your new data? is it a DB?

>
> 2.  If my solr schema.xml changes then do I HAVE to reindex all the old
> documents?  Suppose in future we have newer XML documents that contain a new
> additional xml field.    The old documents that are already indexed don't
> have this field and (so) I don't need search on them with this field.
>  However the new ones need to be search-able on this new field.    Can I
> just add this new field to the SOLR schema, restart the servers just post
> the new new documents or do I need to reindex everything?
>
> 3. Can I backup the index directory.  So that in case of a disk crash - I
> can restore this directory and bring solr up. I realize that any documents
> indexed after this backup would be lost - I can however keep track of these
> outside and simply re-index documents 'newer' than that backup date.  This
> question is really important to me in the context of using a Master Server
> with replicated index.  I would like to run this backup for the 'Master'.
the snapshot script is can be used to take backups on commit.
>
> 4.  In general what happens when the solr application is bounced?  Is the
> index affected (anything maintained in memory)?
>
> Regards
> Guna
>



-- 
--Noble Paul

Re: Newbie Design Questions

Posted by Gunaranjan Chandraraju <ch...@apple.com>.

Hi Grant
Thanks for the reply.  My response below.


The data is stored as XMLs.  Each record/entity corresponds to an  
XML.  The XML is of the form

<record>
   <CoreInfo a="xyz", b="456"  c="123" ... />
   <AdditionalInfo t="xyz" y="333" ....>
   <addressinfo a="1" b="CA" ac="94087" ..>
   ...

</record>

I have currently put it in a schema.xml and DIH handler as follows

schema.xml
   <field name="id" type="string" ...>
  <field name="rectype" type="long" ..>


data-import.xml

<dataConfig>
   <dataSource type="FileDataSource" encoding="UTF-8" />
   <document>
     <entity name ="f" processor="FileListEntityProcessor"
             baseDir="/Users/guna/Applications/solr-apache/xml"
             fileName=".*xml"
             rootEntity="false"
             dataSource="null" >
        <entity
           name="rec"
	   processor="XPathEntityProcessor"
	   stream="false"
	   forEach="/record"
           url="${f.fileAbsolutePath}">
                <field column="ID" xpath="/record/coreinfo/@a" />
                <field column="type" xpath="/record/coreinfo/@b" />
                <field column="streetname" xpath="/record/address/@c" />

        .. and so on


I don't need all the fields in the XML indexed or stored. I just  
include the ones I need in the schema.xml and data-import.xml

Architecturally these XMLs are created, updated and stored in a  
separate system.  Currently I am dumping the files in a directory and  
invoking the DIH.

Actually we have a publishing channel that publishes each XML whenever  
its updated or created.  I'd really like to tap into this channel and  
directly post the xml to SOLR instead of saving it to a file and then  
invoking DIH.  I'd also like to do it leveraging definitions like in  
the data-config xml so that every time I can just post the original  
XML and the xpath configuration takes care of extracting the relevant  
fields.

I did take a look at cell in the link below.  It seems to be only for  
1.4 and currently 1.3 is the stable release.


Guna
On Jan 20, 2009, at 7:50 PM, Grant Ingersoll wrote:

>
> On Jan 20, 2009, at 6:45 PM, Gunaranjan Chandraraju wrote:
>
>> Hi All
>> We are considering SOLR for a large database of XMLs.  I have some  
>> newbie questions - if there is a place I can go read about them do  
>> let me know and I will go read up :)
>>
>> 1. Currently we are able to pull the XMLs from a file systems using  
>> FileDataSource.  The DIH is convenient since I can map my XML  
>> fields using the XPathProcessor. This works for an initial load.     
>> However after the initial load, we would like to 'post' changed  
>> xmls to SOLR whenever the XML is updated in a separate system.  I  
>> know we can post xmls with 'add' however I was not sure how to do  
>> this and maintain the DIH mapping I use in data-config.xml?  I  
>> don't want to save the file to the disk and then call the DIH -  
>> would prefer to directly post it.  Do I need to use solrj for this?
>
> You can likely use SolrJ, but then you probably need to parse the  
> XML an extra time.  You may also be able to use Solr Cell, which is  
> the Tika integration such that you can send the XML straight to Solr  
> and have it deal with it.  See http://wiki.apache.org/solr/ExtractingRequestHandler 
>   Solr Cell is a push technology, whereas DIH is a pull technology.
>
> I don't know how compatible this would be w/ DIH.  Ideally, in the  
> future, they will cooperate as much as possible, but we are not  
> there yet.
>
> As for your initial load, what if you ran a one time XSLT processor  
> over all the files and transformed them to SolrXML and then just  
> posted them the normal way?  Then, going forward, any new files  
> could just be written out as SolrXML as well.
>
> If you can give some more info about your content, I think it would  
> be helpful.
>
>>
>>
>> 2.  If my solr schema.xml changes then do I HAVE to reindex all the  
>> old documents?  Suppose in future we have newer XML documents that  
>> contain a new additional xml field.    The old documents that are  
>> already indexed don't have this field and (so) I don't need search  
>> on them with this field.  However the new ones need to be search- 
>> able on this new field.    Can I just add this new field to the  
>> SOLR schema, restart the servers just post the new new documents or  
>> do I need to reindex everything?
>
> Yes, you should be able to add new fields w/o problems.  Where you  
> can run into problems is renaming, removing, etc.
>
>>
>>
>> 3. Can I backup the index directory.  So that in case of a disk  
>> crash - I can restore this directory and bring solr up. I realize  
>> that any documents indexed after this backup would be lost - I can  
>> however keep track of these outside and simply re-index documents  
>> 'newer' than that backup date.  This question is really important  
>> to me in the context of using a Master Server with replicated  
>> index.  I would like to run this backup for the 'Master'.
>
> Yes, just use the master/slave replication approach for doing backups.
>
>>
>>
>> 4.  In general what happens when the solr application is bounced?   
>> Is the index affected (anything maintained in memory)?
>
> I would recommend doing a commit before bouncing and letting all  
> indexing operations complete.  Worst case, assuming you are using  
> Solr 1.3 or later, is that you may lose what is in memory.
>
> -Grant
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>

Re: Newbie Design Questions

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 20, 2009, at 6:45 PM, Gunaranjan Chandraraju wrote:

> Hi All
> We are considering SOLR for a large database of XMLs.  I have some  
> newbie questions - if there is a place I can go read about them do  
> let me know and I will go read up :)
>
> 1. Currently we are able to pull the XMLs from a file systems using  
> FileDataSource.  The DIH is convenient since I can map my XML fields  
> using the XPathProcessor. This works for an initial load.    However  
> after the initial load, we would like to 'post' changed xmls to SOLR  
> whenever the XML is updated in a separate system.  I know we can  
> post xmls with 'add' however I was not sure how to do this and  
> maintain the DIH mapping I use in data-config.xml?  I don't want to  
> save the file to the disk and then call the DIH - would prefer to  
> directly post it.  Do I need to use solrj for this?

You can likely use SolrJ, but then you probably need to parse the XML  
an extra time.  You may also be able to use Solr Cell, which is the  
Tika integration such that you can send the XML straight to Solr and  
have it deal with it.  See http://wiki.apache.org/solr/ExtractingRequestHandler 
   Solr Cell is a push technology, whereas DIH is a pull technology.

I don't know how compatible this would be w/ DIH.  Ideally, in the  
future, they will cooperate as much as possible, but we are not there  
yet.

As for your initial load, what if you ran a one time XSLT processor  
over all the files and transformed them to SolrXML and then just  
posted them the normal way?  Then, going forward, any new files could  
just be written out as SolrXML as well.

If you can give some more info about your content, I think it would be  
helpful.

>
>
> 2.  If my solr schema.xml changes then do I HAVE to reindex all the  
> old documents?  Suppose in future we have newer XML documents that  
> contain a new additional xml field.    The old documents that are  
> already indexed don't have this field and (so) I don't need search  
> on them with this field.  However the new ones need to be search- 
> able on this new field.    Can I just add this new field to the SOLR  
> schema, restart the servers just post the new new documents or do I  
> need to reindex everything?

Yes, you should be able to add new fields w/o problems.  Where you can  
run into problems is renaming, removing, etc.

>
>
> 3. Can I backup the index directory.  So that in case of a disk  
> crash - I can restore this directory and bring solr up. I realize  
> that any documents indexed after this backup would be lost - I can  
> however keep track of these outside and simply re-index documents  
> 'newer' than that backup date.  This question is really important to  
> me in the context of using a Master Server with replicated index.  I  
> would like to run this backup for the 'Master'.

Yes, just use the master/slave replication approach for doing backups.

>
>
> 4.  In general what happens when the solr application is bounced?   
> Is the index affected (anything maintained in memory)?

I would recommend doing a commit before bouncing and letting all  
indexing operations complete.  Worst case, assuming you are using Solr  
1.3 or later, is that you may lose what is in memory.

-Grant

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ