You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gunaranjan Chandraraju <ch...@apple.com> on 2009/01/21 00:45:35 UTC
Newbie Design Questions
Hi All
We are considering SOLR for a large database of XMLs. I have some
newbie questions - if there is a place I can go read about them do let
me know and I will go read up :)
1. Currently we are able to pull the XMLs from a file systems using
FileDataSource. The DIH is convenient since I can map my XML fields
using the XPathProcessor. This works for an initial load. However
after the initial load, we would like to 'post' changed xmls to SOLR
whenever the XML is updated in a separate system. I know we can post
xmls with 'add' however I was not sure how to do this and maintain the
DIH mapping I use in data-config.xml? I don't want to save the file
to the disk and then call the DIH - would prefer to directly post it.
Do I need to use solrj for this?
2. If my solr schema.xml changes then do I HAVE to reindex all the
old documents? Suppose in future we have newer XML documents that
contain a new additional xml field. The old documents that are
already indexed don't have this field and (so) I don't need search on
them with this field. However the new ones need to be search-able on
this new field. Can I just add this new field to the SOLR schema,
restart the servers just post the new new documents or do I need to
reindex everything?
3. Can I backup the index directory. So that in case of a disk crash
- I can restore this directory and bring solr up. I realize that any
documents indexed after this backup would be lost - I can however keep
track of these outside and simply re-index documents 'newer' than that
backup date. This question is really important to me in the context
of using a Master Server with replicated index. I would like to run
this backup for the 'Master'.
4. In general what happens when the solr application is bounced? Is
the index affected (anything maintained in memory)?
Regards
Guna
Re: Newbie Design Questions
Posted by Noble Paul നോബിള് नोब्ळ् <no...@gmail.com>.
It is planned to be in an another month or so. But it is never too sure.
On Fri, Jan 23, 2009 at 3:57 AM, Gunaranjan Chandraraju
<ch...@apple.com> wrote:
> Thanks
>
> A last question - do you have any approximate date for the release of 1.4.
> If its going to be soon enough (within a month or so) then I can plan for
> our development around it.
>
> Thanks
> Guna
>
> On Jan 22, 2009, at 11:04 AM, Noble Paul നോബിള് नोब्ळ् wrote:
>
>> You are out of luck if you are not using a recent version of DIH
>>
>> The sub entity will work only if you use the FieldReaderDataSource.
>> Then you do not need a ClobTransformer also.
>>
>> The trunk version of DIH can be used w/ Solr 1.3 release
>>
>> On Thu, Jan 22, 2009 at 12:59 PM, Gunaranjan Chandraraju
>> <ch...@apple.com> wrote:
>>>
>>> Hi
>>>
>>> Yes, the XML is inside the DB in a clob. Would love to use XPath
>>> inside
>>> SQLEntityProcessor as it will save me tons of trouble for file-dumping
>>> (given that I am not able to post it). This is how I setup my DIH for DB
>>> import.
>>>
>>> <dataConfig>
>>> <dataSource type="JdbcDataSource" name="data-source-1"
>>> driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@XXXXX"
>>> user="abc" password="***" batchSize="100"/>
>>> <document>
>>> <entity dataSource="data-source-1"
>>> name ="item" processor="SqlEntityProcessor"
>>> pk="ID"
>>> stream="false"
>>> rootEntity="false"
>>> transformer="ClobTransformer" <!-- custom clob transformer I
>>> saw and not the one from 1.4. -->
>>> query="select xml_col from xml_table where xml_col IS NOT NULL"
>>>>
>>>> <!-- horrible query I need to work on making it better -->
>>>
>>> <entity
>>> dataSource="null" <!-- this is my problem - if I don't give a
>>> name here it complains, if I put in null then the code seems to fail with
>>> a
>>> null pointer -->
>>> name="record"
>>> processor="XPathEntityProcessor"
>>> stream="false"
>>> url="${item.xml_col}"
>>> forEach="/record">
>>>
>>> <field column="ID" xpath="/record/coreinfo/@a" />
>>> <field column="type" xpath="/record/coreinfo/@b" />
>>> <field column="streetname" xpath="/record/address/@c" />
>>>
>>> .. and so on
>>> </entity>
>>>
>>>
>>> </entity>
>>> </document>
>>> </dataConfig>
>>>
>>>
>>> The problem with this is that it always fails with this error. I can see
>>> that the earlier SQL entity extraction and clob transformation is working
>>> as
>>> the values show in the debug jsp (verbose mode with dataimport.jsp).
>>> However no records are extracted for entity. When I check catalina.out
>>> file, it shows me the following errors for entity name="record". (the
>>> XPath
>>> entity on top).
>>>
>>> java.lang.NullPointerException at
>>>
>>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85).
>>>
>>> I don't have the whole stack trace right now. If you need it I would be
>>> happy to recreate and post it.
>>>
>>> Regards,
>>> Guna
>>>
>>> On Jan 21, 2009, at 8:22 PM, Noble Paul നോബിള് नोब्ळ् wrote:
>>>
>>>> On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
>>>> <ch...@apple.com> wrote:
>>>>>
>>>>> Thanks
>>>>>
>>>>> Yes the source of data is a DB. However the xml is also posted on
>>>>> updates
>>>>> via publish framework. So I can just plug in an adapter hear to listen
>>>>> for
>>>>> changes and post to SOLR. I was trying to use the XPathProcessor
>>>>> inside
>>>>> the
>>>>> SQLEntityProcessor and this did not work (using 1.3 - I did see support
>>>>> in
>>>>> 1.4). That is not a show stopper for me and I can just post them via
>>>>> the
>>>>> framework and use files for the first time load.
>>>>
>>>> XPathEntityprocessor works inside SqlEntityprocessor only if a db
>>>> field contains xml.
>>>>
>>>> However ,you can have a separate entity (at the root) to read from db
>>>> for delta.
>>>> Anyway if your current solution works stick to it.
>>>>>
>>>>> Have a seen a couple of answers on the backup for crash scenarios.
>>>>> just
>>>>> wanted to confirm - if I replace the index with the backup'ed files
>>>>> then
>>>>> I
>>>>> can simple start the up solr again and reindex the documents changed
>>>>> since
>>>>> last backup? Am I right? The slaves will also automatically adjust to
>>>>> this.
>>>>
>>>> Yes. you can replace an archived index and Solr should work just fine.
>>>> but the docs added since the last snapshot was taken will be missing
>>>> (of course :) )
>>>>>
>>>>> THanks
>>>>> Guna
>>>>>
>>>>>
>>>>> On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള് नोब्ळ् wrote:
>>>>>
>>>>>> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
>>>>>> <ch...@apple.com> wrote:
>>>>>>>
>>>>>>> Hi All
>>>>>>> We are considering SOLR for a large database of XMLs. I have some
>>>>>>> newbie
>>>>>>> questions - if there is a place I can go read about them do let me
>>>>>>> know
>>>>>>> and
>>>>>>> I will go read up :)
>>>>>>>
>>>>>>> 1. Currently we are able to pull the XMLs from a file systems using
>>>>>>> FileDataSource. The DIH is convenient since I can map my XML fields
>>>>>>> using
>>>>>>> the XPathProcessor. This works for an initial load. However after
>>>>>>> the
>>>>>>> initial load, we would like to 'post' changed xmls to SOLR whenever
>>>>>>> the
>>>>>>> XML
>>>>>>> is updated in a separate system. I know we can post xmls with 'add'
>>>>>>> however
>>>>>>> I was not sure how to do this and maintain the DIH mapping I use in
>>>>>>> data-config.xml? I don't want to save the file to the disk and then
>>>>>>> call
>>>>>>> the DIH - would prefer to directly post it. Do I need to use solrj
>>>>>>> for
>>>>>>> this?
>>>>>>
>>>>>> What is the source of your new data? is it a DB?
>>>>>>
>>>>>>>
>>>>>>> 2. If my solr schema.xml changes then do I HAVE to reindex all the
>>>>>>> old
>>>>>>> documents? Suppose in future we have newer XML documents that
>>>>>>> contain
>>>>>>> a
>>>>>>> new
>>>>>>> additional xml field. The old documents that are already indexed
>>>>>>> don't
>>>>>>> have this field and (so) I don't need search on them with this field.
>>>>>>> However the new ones need to be search-able on this new field. Can
>>>>>>> I
>>>>>>> just add this new field to the SOLR schema, restart the servers just
>>>>>>> post
>>>>>>> the new new documents or do I need to reindex everything?
>>>>>>>
>>>>>>> 3. Can I backup the index directory. So that in case of a disk crash
>>>>>>> -
>>>>>>> I
>>>>>>> can restore this directory and bring solr up. I realize that any
>>>>>>> documents
>>>>>>> indexed after this backup would be lost - I can however keep track of
>>>>>>> these
>>>>>>> outside and simply re-index documents 'newer' than that backup date.
>>>>>>> This
>>>>>>> question is really important to me in the context of using a Master
>>>>>>> Server
>>>>>>> with replicated index. I would like to run this backup for the
>>>>>>> 'Master'.
>>>>>>
>>>>>> the snapshot script is can be used to take backups on commit.
>>>>>>>
>>>>>>> 4. In general what happens when the solr application is bounced? Is
>>>>>>> the
>>>>>>> index affected (anything maintained in memory)?
>>>>>>>
>>>>>>> Regards
>>>>>>> Guna
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> --Noble Paul
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --Noble Paul
>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
>
--
--Noble Paul
Re: Newbie Design Questions
Posted by Gunaranjan Chandraraju <ch...@apple.com>.
Thanks
A last question - do you have any approximate date for the release of
1.4. If its going to be soon enough (within a month or so) then I can
plan for our development around it.
Thanks
Guna
On Jan 22, 2009, at 11:04 AM, Noble Paul നോബിള്
नोब्ळ् wrote:
> You are out of luck if you are not using a recent version of DIH
>
> The sub entity will work only if you use the FieldReaderDataSource.
> Then you do not need a ClobTransformer also.
>
> The trunk version of DIH can be used w/ Solr 1.3 release
>
> On Thu, Jan 22, 2009 at 12:59 PM, Gunaranjan Chandraraju
> <ch...@apple.com> wrote:
>> Hi
>>
>> Yes, the XML is inside the DB in a clob. Would love to use
>> XPath inside
>> SQLEntityProcessor as it will save me tons of trouble for file-
>> dumping
>> (given that I am not able to post it). This is how I setup my DIH
>> for DB
>> import.
>>
>> <dataConfig>
>> <dataSource type="JdbcDataSource" name="data-source-1"
>> driver="oracle.jdbc.driver.OracleDriver"
>> url="jdbc:oracle:thin:@XXXXX"
>> user="abc" password="***" batchSize="100"/>
>> <document>
>> <entity dataSource="data-source-1"
>> name ="item" processor="SqlEntityProcessor"
>> pk="ID"
>> stream="false"
>> rootEntity="false"
>> transformer="ClobTransformer" <!-- custom clob
>> transformer I
>> saw and not the one from 1.4. -->
>> query="select xml_col from xml_table where xml_col IS
>> NOT NULL"
>>> <!-- horrible query I need to work on making it better -->
>>
>> <entity
>> dataSource="null" <!-- this is my problem - if I don't
>> give a
>> name here it complains, if I put in null then the code seems to
>> fail with a
>> null pointer -->
>> name="record"
>> processor="XPathEntityProcessor"
>> stream="false"
>> url="${item.xml_col}"
>> forEach="/record">
>>
>> <field column="ID" xpath="/record/coreinfo/@a" />
>> <field column="type" xpath="/record/coreinfo/@b" />
>> <field column="streetname" xpath="/record/address/@c" />
>>
>> .. and so on
>> </entity>
>>
>>
>> </entity>
>> </document>
>> </dataConfig>
>>
>>
>> The problem with this is that it always fails with this error. I
>> can see
>> that the earlier SQL entity extraction and clob transformation is
>> working as
>> the values show in the debug jsp (verbose mode with dataimport.jsp).
>> However no records are extracted for entity. When I check
>> catalina.out
>> file, it shows me the following errors for entity name="record".
>> (the XPath
>> entity on top).
>>
>> java.lang.NullPointerException at
>> org
>> .apache
>> .solr
>> .handler
>> .dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:
>> 85).
>>
>> I don't have the whole stack trace right now. If you need it I
>> would be
>> happy to recreate and post it.
>>
>> Regards,
>> Guna
>>
>> On Jan 21, 2009, at 8:22 PM, Noble Paul നോബിള്
>> नोब्ळ् wrote:
>>
>>> On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
>>> <ch...@apple.com> wrote:
>>>>
>>>> Thanks
>>>>
>>>> Yes the source of data is a DB. However the xml is also posted on
>>>> updates
>>>> via publish framework. So I can just plug in an adapter hear to
>>>> listen
>>>> for
>>>> changes and post to SOLR. I was trying to use the XPathProcessor
>>>> inside
>>>> the
>>>> SQLEntityProcessor and this did not work (using 1.3 - I did see
>>>> support
>>>> in
>>>> 1.4). That is not a show stopper for me and I can just post them
>>>> via the
>>>> framework and use files for the first time load.
>>>
>>> XPathEntityprocessor works inside SqlEntityprocessor only if a db
>>> field contains xml.
>>>
>>> However ,you can have a separate entity (at the root) to read from
>>> db
>>> for delta.
>>> Anyway if your current solution works stick to it.
>>>>
>>>> Have a seen a couple of answers on the backup for crash
>>>> scenarios. just
>>>> wanted to confirm - if I replace the index with the backup'ed
>>>> files then
>>>> I
>>>> can simple start the up solr again and reindex the documents
>>>> changed
>>>> since
>>>> last backup? Am I right? The slaves will also automatically
>>>> adjust to
>>>> this.
>>>
>>> Yes. you can replace an archived index and Solr should work just
>>> fine.
>>> but the docs added since the last snapshot was taken will be missing
>>> (of course :) )
>>>>
>>>> THanks
>>>> Guna
>>>>
>>>>
>>>> On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്
>>>> नोब्ळ् wrote:
>>>>
>>>>> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
>>>>> <ch...@apple.com> wrote:
>>>>>>
>>>>>> Hi All
>>>>>> We are considering SOLR for a large database of XMLs. I have
>>>>>> some
>>>>>> newbie
>>>>>> questions - if there is a place I can go read about them do let
>>>>>> me know
>>>>>> and
>>>>>> I will go read up :)
>>>>>>
>>>>>> 1. Currently we are able to pull the XMLs from a file systems
>>>>>> using
>>>>>> FileDataSource. The DIH is convenient since I can map my XML
>>>>>> fields
>>>>>> using
>>>>>> the XPathProcessor. This works for an initial load. However
>>>>>> after
>>>>>> the
>>>>>> initial load, we would like to 'post' changed xmls to SOLR
>>>>>> whenever the
>>>>>> XML
>>>>>> is updated in a separate system. I know we can post xmls with
>>>>>> 'add'
>>>>>> however
>>>>>> I was not sure how to do this and maintain the DIH mapping I
>>>>>> use in
>>>>>> data-config.xml? I don't want to save the file to the disk and
>>>>>> then
>>>>>> call
>>>>>> the DIH - would prefer to directly post it. Do I need to use
>>>>>> solrj for
>>>>>> this?
>>>>>
>>>>> What is the source of your new data? is it a DB?
>>>>>
>>>>>>
>>>>>> 2. If my solr schema.xml changes then do I HAVE to reindex all
>>>>>> the old
>>>>>> documents? Suppose in future we have newer XML documents that
>>>>>> contain
>>>>>> a
>>>>>> new
>>>>>> additional xml field. The old documents that are already
>>>>>> indexed
>>>>>> don't
>>>>>> have this field and (so) I don't need search on them with this
>>>>>> field.
>>>>>> However the new ones need to be search-able on this new
>>>>>> field. Can I
>>>>>> just add this new field to the SOLR schema, restart the servers
>>>>>> just
>>>>>> post
>>>>>> the new new documents or do I need to reindex everything?
>>>>>>
>>>>>> 3. Can I backup the index directory. So that in case of a disk
>>>>>> crash -
>>>>>> I
>>>>>> can restore this directory and bring solr up. I realize that any
>>>>>> documents
>>>>>> indexed after this backup would be lost - I can however keep
>>>>>> track of
>>>>>> these
>>>>>> outside and simply re-index documents 'newer' than that backup
>>>>>> date.
>>>>>> This
>>>>>> question is really important to me in the context of using a
>>>>>> Master
>>>>>> Server
>>>>>> with replicated index. I would like to run this backup for the
>>>>>> 'Master'.
>>>>>
>>>>> the snapshot script is can be used to take backups on commit.
>>>>>>
>>>>>> 4. In general what happens when the solr application is
>>>>>> bounced? Is
>>>>>> the
>>>>>> index affected (anything maintained in memory)?
>>>>>>
>>>>>> Regards
>>>>>> Guna
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --Noble Paul
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> --Noble Paul
>>
>>
>
>
>
> --
> --Noble Paul
Re: Newbie Design Questions
Posted by Noble Paul നോബിള് नोब्ळ् <no...@gmail.com>.
You are out of luck if you are not using a recent version of DIH
The sub entity will work only if you use the FieldReaderDataSource.
Then you do not need a ClobTransformer also.
The trunk version of DIH can be used w/ Solr 1.3 release
On Thu, Jan 22, 2009 at 12:59 PM, Gunaranjan Chandraraju
<ch...@apple.com> wrote:
> Hi
>
> Yes, the XML is inside the DB in a clob. Would love to use XPath inside
> SQLEntityProcessor as it will save me tons of trouble for file-dumping
> (given that I am not able to post it). This is how I setup my DIH for DB
> import.
>
> <dataConfig>
> <dataSource type="JdbcDataSource" name="data-source-1"
> driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@XXXXX"
> user="abc" password="***" batchSize="100"/>
> <document>
> <entity dataSource="data-source-1"
> name ="item" processor="SqlEntityProcessor"
> pk="ID"
> stream="false"
> rootEntity="false"
> transformer="ClobTransformer" <!-- custom clob transformer I
> saw and not the one from 1.4. -->
> query="select xml_col from xml_table where xml_col IS NOT NULL"
>> <!-- horrible query I need to work on making it better -->
>
> <entity
> dataSource="null" <!-- this is my problem - if I don't give a
> name here it complains, if I put in null then the code seems to fail with a
> null pointer -->
> name="record"
> processor="XPathEntityProcessor"
> stream="false"
> url="${item.xml_col}"
> forEach="/record">
>
> <field column="ID" xpath="/record/coreinfo/@a" />
> <field column="type" xpath="/record/coreinfo/@b" />
> <field column="streetname" xpath="/record/address/@c" />
>
> .. and so on
> </entity>
>
>
> </entity>
> </document>
> </dataConfig>
>
>
> The problem with this is that it always fails with this error. I can see
> that the earlier SQL entity extraction and clob transformation is working as
> the values show in the debug jsp (verbose mode with dataimport.jsp).
> However no records are extracted for entity. When I check catalina.out
> file, it shows me the following errors for entity name="record". (the XPath
> entity on top).
>
> java.lang.NullPointerException at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85).
>
> I don't have the whole stack trace right now. If you need it I would be
> happy to recreate and post it.
>
> Regards,
> Guna
>
> On Jan 21, 2009, at 8:22 PM, Noble Paul നോബിള് नोब्ळ् wrote:
>
>> On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
>> <ch...@apple.com> wrote:
>>>
>>> Thanks
>>>
>>> Yes the source of data is a DB. However the xml is also posted on
>>> updates
>>> via publish framework. So I can just plug in an adapter hear to listen
>>> for
>>> changes and post to SOLR. I was trying to use the XPathProcessor inside
>>> the
>>> SQLEntityProcessor and this did not work (using 1.3 - I did see support
>>> in
>>> 1.4). That is not a show stopper for me and I can just post them via the
>>> framework and use files for the first time load.
>>
>> XPathEntityprocessor works inside SqlEntityprocessor only if a db
>> field contains xml.
>>
>> However ,you can have a separate entity (at the root) to read from db
>> for delta.
>> Anyway if your current solution works stick to it.
>>>
>>> Have a seen a couple of answers on the backup for crash scenarios. just
>>> wanted to confirm - if I replace the index with the backup'ed files then
>>> I
>>> can simple start the up solr again and reindex the documents changed
>>> since
>>> last backup? Am I right? The slaves will also automatically adjust to
>>> this.
>>
>> Yes. you can replace an archived index and Solr should work just fine.
>> but the docs added since the last snapshot was taken will be missing
>> (of course :) )
>>>
>>> THanks
>>> Guna
>>>
>>>
>>> On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള് नोब्ळ् wrote:
>>>
>>>> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
>>>> <ch...@apple.com> wrote:
>>>>>
>>>>> Hi All
>>>>> We are considering SOLR for a large database of XMLs. I have some
>>>>> newbie
>>>>> questions - if there is a place I can go read about them do let me know
>>>>> and
>>>>> I will go read up :)
>>>>>
>>>>> 1. Currently we are able to pull the XMLs from a file systems using
>>>>> FileDataSource. The DIH is convenient since I can map my XML fields
>>>>> using
>>>>> the XPathProcessor. This works for an initial load. However after
>>>>> the
>>>>> initial load, we would like to 'post' changed xmls to SOLR whenever the
>>>>> XML
>>>>> is updated in a separate system. I know we can post xmls with 'add'
>>>>> however
>>>>> I was not sure how to do this and maintain the DIH mapping I use in
>>>>> data-config.xml? I don't want to save the file to the disk and then
>>>>> call
>>>>> the DIH - would prefer to directly post it. Do I need to use solrj for
>>>>> this?
>>>>
>>>> What is the source of your new data? is it a DB?
>>>>
>>>>>
>>>>> 2. If my solr schema.xml changes then do I HAVE to reindex all the old
>>>>> documents? Suppose in future we have newer XML documents that contain
>>>>> a
>>>>> new
>>>>> additional xml field. The old documents that are already indexed
>>>>> don't
>>>>> have this field and (so) I don't need search on them with this field.
>>>>> However the new ones need to be search-able on this new field. Can I
>>>>> just add this new field to the SOLR schema, restart the servers just
>>>>> post
>>>>> the new new documents or do I need to reindex everything?
>>>>>
>>>>> 3. Can I backup the index directory. So that in case of a disk crash -
>>>>> I
>>>>> can restore this directory and bring solr up. I realize that any
>>>>> documents
>>>>> indexed after this backup would be lost - I can however keep track of
>>>>> these
>>>>> outside and simply re-index documents 'newer' than that backup date.
>>>>> This
>>>>> question is really important to me in the context of using a Master
>>>>> Server
>>>>> with replicated index. I would like to run this backup for the
>>>>> 'Master'.
>>>>
>>>> the snapshot script is can be used to take backups on commit.
>>>>>
>>>>> 4. In general what happens when the solr application is bounced? Is
>>>>> the
>>>>> index affected (anything maintained in memory)?
>>>>>
>>>>> Regards
>>>>> Guna
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --Noble Paul
>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
>
--
--Noble Paul
Re: Newbie Design Questions
Posted by Gunaranjan Chandraraju <ch...@apple.com>.
Hi
Yes, the XML is inside the DB in a clob. Would love to use XPath
inside SQLEntityProcessor as it will save me tons of trouble for file-
dumping (given that I am not able to post it). This is how I setup my
DIH for DB import.
<dataConfig>
<dataSource type="JdbcDataSource" name="data-source-1"
driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@XXXXX"
user="abc" password="***" batchSize="100"/>
<document>
<entity dataSource="data-source-1"
name ="item" processor="SqlEntityProcessor"
pk="ID"
stream="false"
rootEntity="false"
transformer="ClobTransformer" <!-- custom clob
transformer I saw and not the one from 1.4. -->
query="select xml_col from xml_table where xml_col IS
NOT NULL" > <!-- horrible query I need to work on making it better -->
<entity
dataSource="null" <!-- this is my problem - if I don't
give a name here it complains, if I put in null then the code seems to
fail with a null pointer -->
name="record"
processor="XPathEntityProcessor"
stream="false"
url="${item.xml_col}"
forEach="/record">
<field column="ID" xpath="/record/coreinfo/@a" />
<field column="type" xpath="/record/coreinfo/@b" />
<field column="streetname" xpath="/record/address/@c" />
.. and so on
</entity>
</entity>
</document>
</dataConfig>
The problem with this is that it always fails with this error. I can
see that the earlier SQL entity extraction and clob transformation is
working as the values show in the debug jsp (verbose mode with
dataimport.jsp). However no records are extracted for entity. When I
check catalina.out file, it shows me the following errors for entity
name="record". (the XPath entity on top).
java.lang.NullPointerException at
org
.apache
.solr
.handler
.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85).
I don't have the whole stack trace right now. If you need it I would
be happy to recreate and post it.
Regards,
Guna
On Jan 21, 2009, at 8:22 PM, Noble Paul നോബിള്
नोब्ळ् wrote:
> On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
> <ch...@apple.com> wrote:
>> Thanks
>>
>> Yes the source of data is a DB. However the xml is also posted on
>> updates
>> via publish framework. So I can just plug in an adapter hear to
>> listen for
>> changes and post to SOLR. I was trying to use the XPathProcessor
>> inside the
>> SQLEntityProcessor and this did not work (using 1.3 - I did see
>> support in
>> 1.4). That is not a show stopper for me and I can just post them
>> via the
>> framework and use files for the first time load.
> XPathEntityprocessor works inside SqlEntityprocessor only if a db
> field contains xml.
>
> However ,you can have a separate entity (at the root) to read from db
> for delta.
> Anyway if your current solution works stick to it.
>>
>> Have a seen a couple of answers on the backup for crash scenarios.
>> just
>> wanted to confirm - if I replace the index with the backup'ed files
>> then I
>> can simple start the up solr again and reindex the documents
>> changed since
>> last backup? Am I right? The slaves will also automatically adjust
>> to this.
> Yes. you can replace an archived index and Solr should work just fine.
> but the docs added since the last snapshot was taken will be missing
> (of course :) )
>>
>> THanks
>> Guna
>>
>>
>> On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്
>> नोब्ळ् wrote:
>>
>>> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
>>> <ch...@apple.com> wrote:
>>>>
>>>> Hi All
>>>> We are considering SOLR for a large database of XMLs. I have
>>>> some newbie
>>>> questions - if there is a place I can go read about them do let
>>>> me know
>>>> and
>>>> I will go read up :)
>>>>
>>>> 1. Currently we are able to pull the XMLs from a file systems using
>>>> FileDataSource. The DIH is convenient since I can map my XML
>>>> fields
>>>> using
>>>> the XPathProcessor. This works for an initial load. However
>>>> after the
>>>> initial load, we would like to 'post' changed xmls to SOLR
>>>> whenever the
>>>> XML
>>>> is updated in a separate system. I know we can post xmls with
>>>> 'add'
>>>> however
>>>> I was not sure how to do this and maintain the DIH mapping I use in
>>>> data-config.xml? I don't want to save the file to the disk and
>>>> then call
>>>> the DIH - would prefer to directly post it. Do I need to use
>>>> solrj for
>>>> this?
>>>
>>> What is the source of your new data? is it a DB?
>>>
>>>>
>>>> 2. If my solr schema.xml changes then do I HAVE to reindex all
>>>> the old
>>>> documents? Suppose in future we have newer XML documents that
>>>> contain a
>>>> new
>>>> additional xml field. The old documents that are already
>>>> indexed don't
>>>> have this field and (so) I don't need search on them with this
>>>> field.
>>>> However the new ones need to be search-able on this new field.
>>>> Can I
>>>> just add this new field to the SOLR schema, restart the servers
>>>> just post
>>>> the new new documents or do I need to reindex everything?
>>>>
>>>> 3. Can I backup the index directory. So that in case of a disk
>>>> crash - I
>>>> can restore this directory and bring solr up. I realize that any
>>>> documents
>>>> indexed after this backup would be lost - I can however keep
>>>> track of
>>>> these
>>>> outside and simply re-index documents 'newer' than that backup
>>>> date.
>>>> This
>>>> question is really important to me in the context of using a Master
>>>> Server
>>>> with replicated index. I would like to run this backup for the
>>>> 'Master'.
>>>
>>> the snapshot script is can be used to take backups on commit.
>>>>
>>>> 4. In general what happens when the solr application is
>>>> bounced? Is the
>>>> index affected (anything maintained in memory)?
>>>>
>>>> Regards
>>>> Guna
>>>>
>>>
>>>
>>>
>>> --
>>> --Noble Paul
>>
>>
>
>
>
> --
> --Noble Paul
Re: Newbie Design Questions
Posted by Noble Paul നോബിള് नोब्ळ् <no...@gmail.com>.
On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
<ch...@apple.com> wrote:
> Thanks
>
> Yes the source of data is a DB. However the xml is also posted on updates
> via publish framework. So I can just plug in an adapter hear to listen for
> changes and post to SOLR. I was trying to use the XPathProcessor inside the
> SQLEntityProcessor and this did not work (using 1.3 - I did see support in
> 1.4). That is not a show stopper for me and I can just post them via the
> framework and use files for the first time load.
XPathEntityprocessor works inside SqlEntityprocessor only if a db
field contains xml.
However ,you can have a separate entity (at the root) to read from db
for delta.
Anyway if your current solution works stick to it.
>
> Have a seen a couple of answers on the backup for crash scenarios. just
> wanted to confirm - if I replace the index with the backup'ed files then I
> can simple start the up solr again and reindex the documents changed since
> last backup? Am I right? The slaves will also automatically adjust to this.
Yes. you can replace an archived index and Solr should work just fine.
but the docs added since the last snapshot was taken will be missing
(of course :) )
>
> THanks
> Guna
>
>
> On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള് नोब्ळ् wrote:
>
>> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
>> <ch...@apple.com> wrote:
>>>
>>> Hi All
>>> We are considering SOLR for a large database of XMLs. I have some newbie
>>> questions - if there is a place I can go read about them do let me know
>>> and
>>> I will go read up :)
>>>
>>> 1. Currently we are able to pull the XMLs from a file systems using
>>> FileDataSource. The DIH is convenient since I can map my XML fields
>>> using
>>> the XPathProcessor. This works for an initial load. However after the
>>> initial load, we would like to 'post' changed xmls to SOLR whenever the
>>> XML
>>> is updated in a separate system. I know we can post xmls with 'add'
>>> however
>>> I was not sure how to do this and maintain the DIH mapping I use in
>>> data-config.xml? I don't want to save the file to the disk and then call
>>> the DIH - would prefer to directly post it. Do I need to use solrj for
>>> this?
>>
>> What is the source of your new data? is it a DB?
>>
>>>
>>> 2. If my solr schema.xml changes then do I HAVE to reindex all the old
>>> documents? Suppose in future we have newer XML documents that contain a
>>> new
>>> additional xml field. The old documents that are already indexed don't
>>> have this field and (so) I don't need search on them with this field.
>>> However the new ones need to be search-able on this new field. Can I
>>> just add this new field to the SOLR schema, restart the servers just post
>>> the new new documents or do I need to reindex everything?
>>>
>>> 3. Can I backup the index directory. So that in case of a disk crash - I
>>> can restore this directory and bring solr up. I realize that any
>>> documents
>>> indexed after this backup would be lost - I can however keep track of
>>> these
>>> outside and simply re-index documents 'newer' than that backup date.
>>> This
>>> question is really important to me in the context of using a Master
>>> Server
>>> with replicated index. I would like to run this backup for the 'Master'.
>>
>> the snapshot script is can be used to take backups on commit.
>>>
>>> 4. In general what happens when the solr application is bounced? Is the
>>> index affected (anything maintained in memory)?
>>>
>>> Regards
>>> Guna
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
>
--
--Noble Paul
Re: Newbie Design Questions
Posted by Gunaranjan Chandraraju <ch...@apple.com>.
Thanks
Yes the source of data is a DB. However the xml is also posted on
updates via publish framework. So I can just plug in an adapter hear
to listen for changes and post to SOLR. I was trying to use the
XPathProcessor inside the SQLEntityProcessor and this did not work
(using 1.3 - I did see support in 1.4). That is not a show stopper
for me and I can just post them via the framework and use files for
the first time load.
Have a seen a couple of answers on the backup for crash scenarios.
just wanted to confirm - if I replace the index with the backup'ed
files then I can simple start the up solr again and reindex the
documents changed since last backup? Am I right? The slaves will also
automatically adjust to this.
THanks
Guna
On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്
नोब्ळ् wrote:
> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
> <ch...@apple.com> wrote:
>> Hi All
>> We are considering SOLR for a large database of XMLs. I have some
>> newbie
>> questions - if there is a place I can go read about them do let me
>> know and
>> I will go read up :)
>>
>> 1. Currently we are able to pull the XMLs from a file systems using
>> FileDataSource. The DIH is convenient since I can map my XML
>> fields using
>> the XPathProcessor. This works for an initial load. However
>> after the
>> initial load, we would like to 'post' changed xmls to SOLR whenever
>> the XML
>> is updated in a separate system. I know we can post xmls with
>> 'add' however
>> I was not sure how to do this and maintain the DIH mapping I use in
>> data-config.xml? I don't want to save the file to the disk and
>> then call
>> the DIH - would prefer to directly post it. Do I need to use solrj
>> for
>> this?
>
> What is the source of your new data? is it a DB?
>
>>
>> 2. If my solr schema.xml changes then do I HAVE to reindex all the
>> old
>> documents? Suppose in future we have newer XML documents that
>> contain a new
>> additional xml field. The old documents that are already indexed
>> don't
>> have this field and (so) I don't need search on them with this field.
>> However the new ones need to be search-able on this new field.
>> Can I
>> just add this new field to the SOLR schema, restart the servers
>> just post
>> the new new documents or do I need to reindex everything?
>>
>> 3. Can I backup the index directory. So that in case of a disk
>> crash - I
>> can restore this directory and bring solr up. I realize that any
>> documents
>> indexed after this backup would be lost - I can however keep track
>> of these
>> outside and simply re-index documents 'newer' than that backup
>> date. This
>> question is really important to me in the context of using a Master
>> Server
>> with replicated index. I would like to run this backup for the
>> 'Master'.
> the snapshot script is can be used to take backups on commit.
>>
>> 4. In general what happens when the solr application is bounced?
>> Is the
>> index affected (anything maintained in memory)?
>>
>> Regards
>> Guna
>>
>
>
>
> --
> --Noble Paul
Re: Newbie Design Questions
Posted by Noble Paul നോബിള് नोब्ळ् <no...@gmail.com>.
On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
<ch...@apple.com> wrote:
> Hi All
> We are considering SOLR for a large database of XMLs. I have some newbie
> questions - if there is a place I can go read about them do let me know and
> I will go read up :)
>
> 1. Currently we are able to pull the XMLs from a file systems using
> FileDataSource. The DIH is convenient since I can map my XML fields using
> the XPathProcessor. This works for an initial load. However after the
> initial load, we would like to 'post' changed xmls to SOLR whenever the XML
> is updated in a separate system. I know we can post xmls with 'add' however
> I was not sure how to do this and maintain the DIH mapping I use in
> data-config.xml? I don't want to save the file to the disk and then call
> the DIH - would prefer to directly post it. Do I need to use solrj for
> this?
What is the source of your new data? is it a DB?
>
> 2. If my solr schema.xml changes then do I HAVE to reindex all the old
> documents? Suppose in future we have newer XML documents that contain a new
> additional xml field. The old documents that are already indexed don't
> have this field and (so) I don't need search on them with this field.
> However the new ones need to be search-able on this new field. Can I
> just add this new field to the SOLR schema, restart the servers just post
> the new new documents or do I need to reindex everything?
>
> 3. Can I backup the index directory. So that in case of a disk crash - I
> can restore this directory and bring solr up. I realize that any documents
> indexed after this backup would be lost - I can however keep track of these
> outside and simply re-index documents 'newer' than that backup date. This
> question is really important to me in the context of using a Master Server
> with replicated index. I would like to run this backup for the 'Master'.
the snapshot script is can be used to take backups on commit.
>
> 4. In general what happens when the solr application is bounced? Is the
> index affected (anything maintained in memory)?
>
> Regards
> Guna
>
--
--Noble Paul
Re: Newbie Design Questions
Posted by Gunaranjan Chandraraju <ch...@apple.com>.
Hi Grant
Thanks for the reply. My response below.
The data is stored as XMLs. Each record/entity corresponds to an
XML. The XML is of the form
<record>
<CoreInfo a="xyz", b="456" c="123" ... />
<AdditionalInfo t="xyz" y="333" ....>
<addressinfo a="1" b="CA" ac="94087" ..>
...
</record>
I have currently put it in a schema.xml and DIH handler as follows
schema.xml
<field name="id" type="string" ...>
<field name="rectype" type="long" ..>
data-import.xml
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name ="f" processor="FileListEntityProcessor"
baseDir="/Users/guna/Applications/solr-apache/xml"
fileName=".*xml"
rootEntity="false"
dataSource="null" >
<entity
name="rec"
processor="XPathEntityProcessor"
stream="false"
forEach="/record"
url="${f.fileAbsolutePath}">
<field column="ID" xpath="/record/coreinfo/@a" />
<field column="type" xpath="/record/coreinfo/@b" />
<field column="streetname" xpath="/record/address/@c" />
.. and so on
I don't need all the fields in the XML indexed or stored. I just
include the ones I need in the schema.xml and data-import.xml
Architecturally these XMLs are created, updated and stored in a
separate system. Currently I am dumping the files in a directory and
invoking the DIH.
Actually we have a publishing channel that publishes each XML whenever
its updated or created. I'd really like to tap into this channel and
directly post the xml to SOLR instead of saving it to a file and then
invoking DIH. I'd also like to do it leveraging definitions like in
the data-config xml so that every time I can just post the original
XML and the xpath configuration takes care of extracting the relevant
fields.
I did take a look at cell in the link below. It seems to be only for
1.4 and currently 1.3 is the stable release.
Guna
On Jan 20, 2009, at 7:50 PM, Grant Ingersoll wrote:
>
> On Jan 20, 2009, at 6:45 PM, Gunaranjan Chandraraju wrote:
>
>> Hi All
>> We are considering SOLR for a large database of XMLs. I have some
>> newbie questions - if there is a place I can go read about them do
>> let me know and I will go read up :)
>>
>> 1. Currently we are able to pull the XMLs from a file systems using
>> FileDataSource. The DIH is convenient since I can map my XML
>> fields using the XPathProcessor. This works for an initial load.
>> However after the initial load, we would like to 'post' changed
>> xmls to SOLR whenever the XML is updated in a separate system. I
>> know we can post xmls with 'add' however I was not sure how to do
>> this and maintain the DIH mapping I use in data-config.xml? I
>> don't want to save the file to the disk and then call the DIH -
>> would prefer to directly post it. Do I need to use solrj for this?
>
> You can likely use SolrJ, but then you probably need to parse the
> XML an extra time. You may also be able to use Solr Cell, which is
> the Tika integration such that you can send the XML straight to Solr
> and have it deal with it. See http://wiki.apache.org/solr/ExtractingRequestHandler
> Solr Cell is a push technology, whereas DIH is a pull technology.
>
> I don't know how compatible this would be w/ DIH. Ideally, in the
> future, they will cooperate as much as possible, but we are not
> there yet.
>
> As for your initial load, what if you ran a one time XSLT processor
> over all the files and transformed them to SolrXML and then just
> posted them the normal way? Then, going forward, any new files
> could just be written out as SolrXML as well.
>
> If you can give some more info about your content, I think it would
> be helpful.
>
>>
>>
>> 2. If my solr schema.xml changes then do I HAVE to reindex all the
>> old documents? Suppose in future we have newer XML documents that
>> contain a new additional xml field. The old documents that are
>> already indexed don't have this field and (so) I don't need search
>> on them with this field. However the new ones need to be search-
>> able on this new field. Can I just add this new field to the
>> SOLR schema, restart the servers just post the new new documents or
>> do I need to reindex everything?
>
> Yes, you should be able to add new fields w/o problems. Where you
> can run into problems is renaming, removing, etc.
>
>>
>>
>> 3. Can I backup the index directory. So that in case of a disk
>> crash - I can restore this directory and bring solr up. I realize
>> that any documents indexed after this backup would be lost - I can
>> however keep track of these outside and simply re-index documents
>> 'newer' than that backup date. This question is really important
>> to me in the context of using a Master Server with replicated
>> index. I would like to run this backup for the 'Master'.
>
> Yes, just use the master/slave replication approach for doing backups.
>
>>
>>
>> 4. In general what happens when the solr application is bounced?
>> Is the index affected (anything maintained in memory)?
>
> I would recommend doing a commit before bouncing and letting all
> indexing operations complete. Worst case, assuming you are using
> Solr 1.3 or later, is that you may lose what is in memory.
>
> -Grant
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
Re: Newbie Design Questions
Posted by Grant Ingersoll <gs...@apache.org>.
On Jan 20, 2009, at 6:45 PM, Gunaranjan Chandraraju wrote:
> Hi All
> We are considering SOLR for a large database of XMLs. I have some
> newbie questions - if there is a place I can go read about them do
> let me know and I will go read up :)
>
> 1. Currently we are able to pull the XMLs from a file systems using
> FileDataSource. The DIH is convenient since I can map my XML fields
> using the XPathProcessor. This works for an initial load. However
> after the initial load, we would like to 'post' changed xmls to SOLR
> whenever the XML is updated in a separate system. I know we can
> post xmls with 'add' however I was not sure how to do this and
> maintain the DIH mapping I use in data-config.xml? I don't want to
> save the file to the disk and then call the DIH - would prefer to
> directly post it. Do I need to use solrj for this?
You can likely use SolrJ, but then you probably need to parse the XML
an extra time. You may also be able to use Solr Cell, which is the
Tika integration such that you can send the XML straight to Solr and
have it deal with it. See http://wiki.apache.org/solr/ExtractingRequestHandler
Solr Cell is a push technology, whereas DIH is a pull technology.
I don't know how compatible this would be w/ DIH. Ideally, in the
future, they will cooperate as much as possible, but we are not there
yet.
As for your initial load, what if you ran a one time XSLT processor
over all the files and transformed them to SolrXML and then just
posted them the normal way? Then, going forward, any new files could
just be written out as SolrXML as well.
If you can give some more info about your content, I think it would be
helpful.
>
>
> 2. If my solr schema.xml changes then do I HAVE to reindex all the
> old documents? Suppose in future we have newer XML documents that
> contain a new additional xml field. The old documents that are
> already indexed don't have this field and (so) I don't need search
> on them with this field. However the new ones need to be search-
> able on this new field. Can I just add this new field to the SOLR
> schema, restart the servers just post the new new documents or do I
> need to reindex everything?
Yes, you should be able to add new fields w/o problems. Where you can
run into problems is renaming, removing, etc.
>
>
> 3. Can I backup the index directory. So that in case of a disk
> crash - I can restore this directory and bring solr up. I realize
> that any documents indexed after this backup would be lost - I can
> however keep track of these outside and simply re-index documents
> 'newer' than that backup date. This question is really important to
> me in the context of using a Master Server with replicated index. I
> would like to run this backup for the 'Master'.
Yes, just use the master/slave replication approach for doing backups.
>
>
> 4. In general what happens when the solr application is bounced?
> Is the index affected (anything maintained in memory)?
I would recommend doing a commit before bouncing and letting all
indexing operations complete. Worst case, assuming you are using Solr
1.3 or later, is that you may lose what is in memory.
-Grant
--------------------------
Grant Ingersoll
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ