You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Carl Roberts <ca...@gmail.com> on 2015/01/23 17:15:55 UTC

Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Hi,

I have the RSS DIH example working with my own RSS feed - here is the 
configuration for it.

<dataConfig>
     <dataSource type="URLDataSource" />
     <document>
         <entity name="nvd-rss"
                 pk="link"
                 url="https://nvd.nist.gov/download/nvd-rss.xml"
                 processor="XPathEntityProcessor"
                 forEach="/RDF/item"
                 transformer="DateFormatTransformer">

             <field column="id" xpath="/RDF/item/title" 
commonField="true" />
             <field column="link" xpath="/RDF/item/link" 
commonField="true" />
             <field column="summary" xpath="/RDF/item/description" 
commonField="true" />
             <field column="date" xpath="/RDF/item/date" 
commonField="true" />

         </entity>
     </document>
</dataConfig>

However, my problem is that I also have to load multiple XML feeds into 
the same core.  Here is one example (there are about 10 of them):

http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip


Is there any built-in functionality that would allow me to do this? 
Basically, the use-case is to load and index all the XML ZIP files 
first, and then check the RSS feed every two hours and update the 
indexes with any new ones.

Regards,

Joe

Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Unzipping things might be an issue. You may need to do that as part of
a batch job outside of Solr. For the rest, go through the
documentation first, it does answer a bunch of questions. There is
also a page on the Wiki as well, not just in the reference guide.

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 14:51, Carl Roberts <ca...@gmail.com> wrote:
> Excellent - thanks Shalin.  But how does delta-import work?  Does it do a
> clean also?  Does it require a unique Id?  Does it update existing records
> and only add when necessary?
>
> And, how would I go about unzipping the content from a URL to then import
> the unzipped XML?  Is the recommended way to extend the URLDataSource class
> or is there any built-in logic to plug in pre-processing handlers?
>
>
> And,
>
> On 1/23/15, 2:39 PM, Shalin Shekhar Mangar wrote:
>>
>> If you add clean=false as a parameter to the full-import then deletion is
>> disabled. Since you are ingesting RSS there is no need for deletion at all
>> I guess.
>>
>> On Fri, Jan 23, 2015 at 7:31 PM, Carl Roberts
>> <carl.roberts.zapata@gmail.com
>>>
>>> wrote:
>>> OK - Thanks for the doc.
>>>
>>> Is it possible to just provide an empty value to preImportDeleteQuery to
>>> disable the delete prior to import?
>>>
>>> Will the data still be deleted for each entity during a delta-import
>>> instead of full-import?
>>>
>>> Is there any capability in the handler to unzip an XML file from a URL
>>> prior to reading it or can I perhaps hook a custom pre-processing
>>> handler?
>>>
>>> Regards,
>>>
>>> Joe
>>>
>>>
>>>
>>> On 1/23/15, 1:40 PM, Alexandre Rafalovitch wrote:
>>>
>>>> https://cwiki.apache.org/confluence/display/solr/
>>>> Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
>>>>
>>>> Admin UI has the interface, so you can play there once you define it.
>>>>
>>>> You do have to use Curl, there is no built-in scheduler.
>>>>
>>>> Regards,
>>>>      Alex.
>>>> ----
>>>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>>>
>>>>
>>>> On 23 January 2015 at 13:29, Carl Roberts
>>>> <ca...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Alex,
>>>>>
>>>>> If I am understanding this correctly, I can define multiple entities
>>>>> like
>>>>> this?
>>>>>
>>>>> <document>
>>>>>       <entity/>
>>>>>       <entity/>
>>>>>       <entity/>
>>>>>       ...
>>>>> </document>
>>>>>
>>>>> How would I trigger loading certain entities during start?
>>>>>
>>>>> How would I trigger loading other entities during update?
>>>>>
>>>>> Is there a way to set an auto-update for certain entities so that I
>>>>> don't
>>>>> have to invoke an update via curl?
>>>>>
>>>>> Where / how do I specify the preImportDeleteQuery to avoid deleting
>>>>> everything upon each update?
>>>>>
>>>>> Is there an example or doc that shows how to do all this?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Joe
>>>>>
>>>>>
>>>>> On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
>>>>>
>>>>>> You can define both multiple entities in the same file and nested
>>>>>> entities if your list comes from an external source (e.g. a text file
>>>>>> of URLs).
>>>>>> You can also trigger DIH with a name of a specific entity to load just
>>>>>> that.
>>>>>> You can even pass DIH configuration file when you are triggering the
>>>>>> processing start, so you can have different files completely for
>>>>>> initial load and update. Though you can just do the same with
>>>>>> entities.
>>>>>>
>>>>>> The only thing to be aware of is that before an entity definition is
>>>>>> processed, a delete command is run. By default, it's "delete all", so
>>>>>> executing one entity will delete everything but then just populate
>>>>>> that one entity's results. You can avoid that by defining
>>>>>> preImportDeleteQuery and having a clear identifier on content
>>>>>> generated by each entity (e.g. source, either extracted or manually
>>>>>> added with TemplateTransformer).
>>>>>>
>>>>>> Regards,
>>>>>>       Alex.
>>>>>>
>>>>>> ----
>>>>>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>>>>>
>>>>>>
>>>>>> On 23 January 2015 at 11:15, Carl Roberts <
>>>>>> carl.roberts.zapata@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have the RSS DIH example working with my own RSS feed - here is the
>>>>>>> configuration for it.
>>>>>>>
>>>>>>> <dataConfig>
>>>>>>>        <dataSource type="URLDataSource" />
>>>>>>>        <document>
>>>>>>>            <entity name="nvd-rss"
>>>>>>>                    pk="link"
>>>>>>>                    url="https://nvd.nist.gov/download/nvd-rss.xml"
>>>>>>>                    processor="XPathEntityProcessor"
>>>>>>>                    forEach="/RDF/item"
>>>>>>>                    transformer="DateFormatTransformer">
>>>>>>>
>>>>>>>                <field column="id" xpath="/RDF/item/title"
>>>>>>> commonField="true" />
>>>>>>>                <field column="link" xpath="/RDF/item/link"
>>>>>>> commonField="true"
>>>>>>> />
>>>>>>>                <field column="summary" xpath="/RDF/item/description"
>>>>>>> commonField="true" />
>>>>>>>                <field column="date" xpath="/RDF/item/date"
>>>>>>> commonField="true"
>>>>>>> />
>>>>>>>
>>>>>>>            </entity>
>>>>>>>        </document>
>>>>>>> </dataConfig>
>>>>>>>
>>>>>>> However, my problem is that I also have to load multiple XML feeds
>>>>>>> into
>>>>>>> the
>>>>>>> same core.  Here is one example (there are about 10 of them):
>>>>>>>
>>>>>>> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>>>>>>>
>>>>>>>
>>>>>>> Is there any built-in functionality that would allow me to do this?
>>>>>>> Basically, the use-case is to load and index all the XML ZIP files
>>>>>>> first,
>>>>>>> and then check the RSS feed every two hours and update the indexes
>>>>>>> with
>>>>>>> any
>>>>>>> new ones.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Joe
>>>>>>>
>>>>>>>
>>>>>>>
>>
>

Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Posted by Carl Roberts <ca...@gmail.com>.

Excellent - thanks Shalin.  But how does delta-import work?  Does it do 
a clean also?  Does it require a unique Id?  Does it update existing 
records and only add when necessary?

And, how would I go about unzipping the content from a URL to then 
import the unzipped XML?  Is the recommended way to extend the 
URLDataSource class or is there any built-in logic to plug in 
pre-processing handlers?


And,
On 1/23/15, 2:39 PM, Shalin Shekhar Mangar wrote:
> If you add clean=false as a parameter to the full-import then deletion is
> disabled. Since you are ingesting RSS there is no need for deletion at all
> I guess.
>
> On Fri, Jan 23, 2015 at 7:31 PM, Carl Roberts <carl.roberts.zapata@gmail.com
>> wrote:
>> OK - Thanks for the doc.
>>
>> Is it possible to just provide an empty value to preImportDeleteQuery to
>> disable the delete prior to import?
>>
>> Will the data still be deleted for each entity during a delta-import
>> instead of full-import?
>>
>> Is there any capability in the handler to unzip an XML file from a URL
>> prior to reading it or can I perhaps hook a custom pre-processing handler?
>>
>> Regards,
>>
>> Joe
>>
>>
>>
>> On 1/23/15, 1:40 PM, Alexandre Rafalovitch wrote:
>>
>>> https://cwiki.apache.org/confluence/display/solr/
>>> Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
>>>
>>> Admin UI has the interface, so you can play there once you define it.
>>>
>>> You do have to use Curl, there is no built-in scheduler.
>>>
>>> Regards,
>>>      Alex.
>>> ----
>>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>>
>>>
>>> On 23 January 2015 at 13:29, Carl Roberts <ca...@gmail.com>
>>> wrote:
>>>
>>>> Hi Alex,
>>>>
>>>> If I am understanding this correctly, I can define multiple entities like
>>>> this?
>>>>
>>>> <document>
>>>>       <entity/>
>>>>       <entity/>
>>>>       <entity/>
>>>>       ...
>>>> </document>
>>>>
>>>> How would I trigger loading certain entities during start?
>>>>
>>>> How would I trigger loading other entities during update?
>>>>
>>>> Is there a way to set an auto-update for certain entities so that I don't
>>>> have to invoke an update via curl?
>>>>
>>>> Where / how do I specify the preImportDeleteQuery to avoid deleting
>>>> everything upon each update?
>>>>
>>>> Is there an example or doc that shows how to do all this?
>>>>
>>>> Regards,
>>>>
>>>> Joe
>>>>
>>>>
>>>> On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
>>>>
>>>>> You can define both multiple entities in the same file and nested
>>>>> entities if your list comes from an external source (e.g. a text file
>>>>> of URLs).
>>>>> You can also trigger DIH with a name of a specific entity to load just
>>>>> that.
>>>>> You can even pass DIH configuration file when you are triggering the
>>>>> processing start, so you can have different files completely for
>>>>> initial load and update. Though you can just do the same with
>>>>> entities.
>>>>>
>>>>> The only thing to be aware of is that before an entity definition is
>>>>> processed, a delete command is run. By default, it's "delete all", so
>>>>> executing one entity will delete everything but then just populate
>>>>> that one entity's results. You can avoid that by defining
>>>>> preImportDeleteQuery and having a clear identifier on content
>>>>> generated by each entity (e.g. source, either extracted or manually
>>>>> added with TemplateTransformer).
>>>>>
>>>>> Regards,
>>>>>       Alex.
>>>>>
>>>>> ----
>>>>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>>>>
>>>>>
>>>>> On 23 January 2015 at 11:15, Carl Roberts <
>>>>> carl.roberts.zapata@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have the RSS DIH example working with my own RSS feed - here is the
>>>>>> configuration for it.
>>>>>>
>>>>>> <dataConfig>
>>>>>>        <dataSource type="URLDataSource" />
>>>>>>        <document>
>>>>>>            <entity name="nvd-rss"
>>>>>>                    pk="link"
>>>>>>                    url="https://nvd.nist.gov/download/nvd-rss.xml"
>>>>>>                    processor="XPathEntityProcessor"
>>>>>>                    forEach="/RDF/item"
>>>>>>                    transformer="DateFormatTransformer">
>>>>>>
>>>>>>                <field column="id" xpath="/RDF/item/title"
>>>>>> commonField="true" />
>>>>>>                <field column="link" xpath="/RDF/item/link"
>>>>>> commonField="true"
>>>>>> />
>>>>>>                <field column="summary" xpath="/RDF/item/description"
>>>>>> commonField="true" />
>>>>>>                <field column="date" xpath="/RDF/item/date"
>>>>>> commonField="true"
>>>>>> />
>>>>>>
>>>>>>            </entity>
>>>>>>        </document>
>>>>>> </dataConfig>
>>>>>>
>>>>>> However, my problem is that I also have to load multiple XML feeds into
>>>>>> the
>>>>>> same core.  Here is one example (there are about 10 of them):
>>>>>>
>>>>>> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>>>>>>
>>>>>>
>>>>>> Is there any built-in functionality that would allow me to do this?
>>>>>> Basically, the use-case is to load and index all the XML ZIP files
>>>>>> first,
>>>>>> and then check the RSS feed every two hours and update the indexes with
>>>>>> any
>>>>>> new ones.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Joe
>>>>>>
>>>>>>
>>>>>>
>

Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

If you add clean=false as a parameter to the full-import then deletion is
disabled. Since you are ingesting RSS there is no need for deletion at all
I guess.

On Fri, Jan 23, 2015 at 7:31 PM, Carl Roberts <carl.roberts.zapata@gmail.com
> wrote:

> OK - Thanks for the doc.
>
> Is it possible to just provide an empty value to preImportDeleteQuery to
> disable the delete prior to import?
>
> Will the data still be deleted for each entity during a delta-import
> instead of full-import?
>
> Is there any capability in the handler to unzip an XML file from a URL
> prior to reading it or can I perhaps hook a custom pre-processing handler?
>
> Regards,
>
> Joe
>
>
>
> On 1/23/15, 1:40 PM, Alexandre Rafalovitch wrote:
>
>> https://cwiki.apache.org/confluence/display/solr/
>> Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
>>
>> Admin UI has the interface, so you can play there once you define it.
>>
>> You do have to use Curl, there is no built-in scheduler.
>>
>> Regards,
>>     Alex.
>> ----
>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>
>>
>> On 23 January 2015 at 13:29, Carl Roberts <ca...@gmail.com>
>> wrote:
>>
>>> Hi Alex,
>>>
>>> If I am understanding this correctly, I can define multiple entities like
>>> this?
>>>
>>> <document>
>>>      <entity/>
>>>      <entity/>
>>>      <entity/>
>>>      ...
>>> </document>
>>>
>>> How would I trigger loading certain entities during start?
>>>
>>> How would I trigger loading other entities during update?
>>>
>>> Is there a way to set an auto-update for certain entities so that I don't
>>> have to invoke an update via curl?
>>>
>>> Where / how do I specify the preImportDeleteQuery to avoid deleting
>>> everything upon each update?
>>>
>>> Is there an example or doc that shows how to do all this?
>>>
>>> Regards,
>>>
>>> Joe
>>>
>>>
>>> On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
>>>
>>>> You can define both multiple entities in the same file and nested
>>>> entities if your list comes from an external source (e.g. a text file
>>>> of URLs).
>>>> You can also trigger DIH with a name of a specific entity to load just
>>>> that.
>>>> You can even pass DIH configuration file when you are triggering the
>>>> processing start, so you can have different files completely for
>>>> initial load and update. Though you can just do the same with
>>>> entities.
>>>>
>>>> The only thing to be aware of is that before an entity definition is
>>>> processed, a delete command is run. By default, it's "delete all", so
>>>> executing one entity will delete everything but then just populate
>>>> that one entity's results. You can avoid that by defining
>>>> preImportDeleteQuery and having a clear identifier on content
>>>> generated by each entity (e.g. source, either extracted or manually
>>>> added with TemplateTransformer).
>>>>
>>>> Regards,
>>>>      Alex.
>>>>
>>>> ----
>>>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>>>
>>>>
>>>> On 23 January 2015 at 11:15, Carl Roberts <
>>>> carl.roberts.zapata@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have the RSS DIH example working with my own RSS feed - here is the
>>>>> configuration for it.
>>>>>
>>>>> <dataConfig>
>>>>>       <dataSource type="URLDataSource" />
>>>>>       <document>
>>>>>           <entity name="nvd-rss"
>>>>>                   pk="link"
>>>>>                   url="https://nvd.nist.gov/download/nvd-rss.xml"
>>>>>                   processor="XPathEntityProcessor"
>>>>>                   forEach="/RDF/item"
>>>>>                   transformer="DateFormatTransformer">
>>>>>
>>>>>               <field column="id" xpath="/RDF/item/title"
>>>>> commonField="true" />
>>>>>               <field column="link" xpath="/RDF/item/link"
>>>>> commonField="true"
>>>>> />
>>>>>               <field column="summary" xpath="/RDF/item/description"
>>>>> commonField="true" />
>>>>>               <field column="date" xpath="/RDF/item/date"
>>>>> commonField="true"
>>>>> />
>>>>>
>>>>>           </entity>
>>>>>       </document>
>>>>> </dataConfig>
>>>>>
>>>>> However, my problem is that I also have to load multiple XML feeds into
>>>>> the
>>>>> same core.  Here is one example (there are about 10 of them):
>>>>>
>>>>> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>>>>>
>>>>>
>>>>> Is there any built-in functionality that would allow me to do this?
>>>>> Basically, the use-case is to load and index all the XML ZIP files
>>>>> first,
>>>>> and then check the RSS feed every two hours and update the indexes with
>>>>> any
>>>>> new ones.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Joe
>>>>>
>>>>>
>>>>>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Posted by Carl Roberts <ca...@gmail.com>.

OK - Thanks for the doc.

Is it possible to just provide an empty value to preImportDeleteQuery to 
disable the delete prior to import?

Will the data still be deleted for each entity during a delta-import 
instead of full-import?

Is there any capability in the handler to unzip an XML file from a URL 
prior to reading it or can I perhaps hook a custom pre-processing handler?

Regards,

Joe


On 1/23/15, 1:40 PM, Alexandre Rafalovitch wrote:
> https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
>
> Admin UI has the interface, so you can play there once you define it.
>
> You do have to use Curl, there is no built-in scheduler.
>
> Regards,
>     Alex.
> ----
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 23 January 2015 at 13:29, Carl Roberts <ca...@gmail.com> wrote:
>> Hi Alex,
>>
>> If I am understanding this correctly, I can define multiple entities like
>> this?
>>
>> <document>
>>      <entity/>
>>      <entity/>
>>      <entity/>
>>      ...
>> </document>
>>
>> How would I trigger loading certain entities during start?
>>
>> How would I trigger loading other entities during update?
>>
>> Is there a way to set an auto-update for certain entities so that I don't
>> have to invoke an update via curl?
>>
>> Where / how do I specify the preImportDeleteQuery to avoid deleting
>> everything upon each update?
>>
>> Is there an example or doc that shows how to do all this?
>>
>> Regards,
>>
>> Joe
>>
>>
>> On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
>>> You can define both multiple entities in the same file and nested
>>> entities if your list comes from an external source (e.g. a text file
>>> of URLs).
>>> You can also trigger DIH with a name of a specific entity to load just
>>> that.
>>> You can even pass DIH configuration file when you are triggering the
>>> processing start, so you can have different files completely for
>>> initial load and update. Though you can just do the same with
>>> entities.
>>>
>>> The only thing to be aware of is that before an entity definition is
>>> processed, a delete command is run. By default, it's "delete all", so
>>> executing one entity will delete everything but then just populate
>>> that one entity's results. You can avoid that by defining
>>> preImportDeleteQuery and having a clear identifier on content
>>> generated by each entity (e.g. source, either extracted or manually
>>> added with TemplateTransformer).
>>>
>>> Regards,
>>>      Alex.
>>>
>>> ----
>>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>>
>>>
>>> On 23 January 2015 at 11:15, Carl Roberts <ca...@gmail.com>
>>> wrote:
>>>> Hi,
>>>>
>>>> I have the RSS DIH example working with my own RSS feed - here is the
>>>> configuration for it.
>>>>
>>>> <dataConfig>
>>>>       <dataSource type="URLDataSource" />
>>>>       <document>
>>>>           <entity name="nvd-rss"
>>>>                   pk="link"
>>>>                   url="https://nvd.nist.gov/download/nvd-rss.xml"
>>>>                   processor="XPathEntityProcessor"
>>>>                   forEach="/RDF/item"
>>>>                   transformer="DateFormatTransformer">
>>>>
>>>>               <field column="id" xpath="/RDF/item/title"
>>>> commonField="true" />
>>>>               <field column="link" xpath="/RDF/item/link"
>>>> commonField="true"
>>>> />
>>>>               <field column="summary" xpath="/RDF/item/description"
>>>> commonField="true" />
>>>>               <field column="date" xpath="/RDF/item/date"
>>>> commonField="true"
>>>> />
>>>>
>>>>           </entity>
>>>>       </document>
>>>> </dataConfig>
>>>>
>>>> However, my problem is that I also have to load multiple XML feeds into
>>>> the
>>>> same core.  Here is one example (there are about 10 of them):
>>>>
>>>> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>>>>
>>>>
>>>> Is there any built-in functionality that would allow me to do this?
>>>> Basically, the use-case is to load and index all the XML ZIP files first,
>>>> and then check the RSS feed every two hours and update the indexes with
>>>> any
>>>> new ones.
>>>>
>>>> Regards,
>>>>
>>>> Joe
>>>>
>>>>

Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler

Admin UI has the interface, so you can play there once you define it.

You do have to use Curl, there is no built-in scheduler.

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 13:29, Carl Roberts <ca...@gmail.com> wrote:
> Hi Alex,
>
> If I am understanding this correctly, I can define multiple entities like
> this?
>
> <document>
>     <entity/>
>     <entity/>
>     <entity/>
>     ...
> </document>
>
> How would I trigger loading certain entities during start?
>
> How would I trigger loading other entities during update?
>
> Is there a way to set an auto-update for certain entities so that I don't
> have to invoke an update via curl?
>
> Where / how do I specify the preImportDeleteQuery to avoid deleting
> everything upon each update?
>
> Is there an example or doc that shows how to do all this?
>
> Regards,
>
> Joe
>
>
> On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
>>
>> You can define both multiple entities in the same file and nested
>> entities if your list comes from an external source (e.g. a text file
>> of URLs).
>> You can also trigger DIH with a name of a specific entity to load just
>> that.
>> You can even pass DIH configuration file when you are triggering the
>> processing start, so you can have different files completely for
>> initial load and update. Though you can just do the same with
>> entities.
>>
>> The only thing to be aware of is that before an entity definition is
>> processed, a delete command is run. By default, it's "delete all", so
>> executing one entity will delete everything but then just populate
>> that one entity's results. You can avoid that by defining
>> preImportDeleteQuery and having a clear identifier on content
>> generated by each entity (e.g. source, either extracted or manually
>> added with TemplateTransformer).
>>
>> Regards,
>>     Alex.
>>
>> ----
>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>
>>
>> On 23 January 2015 at 11:15, Carl Roberts <ca...@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I have the RSS DIH example working with my own RSS feed - here is the
>>> configuration for it.
>>>
>>> <dataConfig>
>>>      <dataSource type="URLDataSource" />
>>>      <document>
>>>          <entity name="nvd-rss"
>>>                  pk="link"
>>>                  url="https://nvd.nist.gov/download/nvd-rss.xml"
>>>                  processor="XPathEntityProcessor"
>>>                  forEach="/RDF/item"
>>>                  transformer="DateFormatTransformer">
>>>
>>>              <field column="id" xpath="/RDF/item/title"
>>> commonField="true" />
>>>              <field column="link" xpath="/RDF/item/link"
>>> commonField="true"
>>> />
>>>              <field column="summary" xpath="/RDF/item/description"
>>> commonField="true" />
>>>              <field column="date" xpath="/RDF/item/date"
>>> commonField="true"
>>> />
>>>
>>>          </entity>
>>>      </document>
>>> </dataConfig>
>>>
>>> However, my problem is that I also have to load multiple XML feeds into
>>> the
>>> same core.  Here is one example (there are about 10 of them):
>>>
>>> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>>>
>>>
>>> Is there any built-in functionality that would allow me to do this?
>>> Basically, the use-case is to load and index all the XML ZIP files first,
>>> and then check the RSS feed every two hours and update the indexes with
>>> any
>>> new ones.
>>>
>>> Regards,
>>>
>>> Joe
>>>
>>>
>

Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Posted by Carl Roberts <ca...@gmail.com>.

Hi Alex,

If I am understanding this correctly, I can define multiple entities 
like this?

<document>
     <entity/>
     <entity/>
     <entity/>
     ...
</document>

How would I trigger loading certain entities during start?

How would I trigger loading other entities during update?

Is there a way to set an auto-update for certain entities so that I 
don't have to invoke an update via curl?

Where / how do I specify the preImportDeleteQuery to avoid deleting 
everything upon each update?

Is there an example or doc that shows how to do all this?

Regards,

Joe

On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
> You can define both multiple entities in the same file and nested
> entities if your list comes from an external source (e.g. a text file
> of URLs).
> You can also trigger DIH with a name of a specific entity to load just that.
> You can even pass DIH configuration file when you are triggering the
> processing start, so you can have different files completely for
> initial load and update. Though you can just do the same with
> entities.
>
> The only thing to be aware of is that before an entity definition is
> processed, a delete command is run. By default, it's "delete all", so
> executing one entity will delete everything but then just populate
> that one entity's results. You can avoid that by defining
> preImportDeleteQuery and having a clear identifier on content
> generated by each entity (e.g. source, either extracted or manually
> added with TemplateTransformer).
>
> Regards,
>     Alex.
>
> ----
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 23 January 2015 at 11:15, Carl Roberts <ca...@gmail.com> wrote:
>> Hi,
>>
>> I have the RSS DIH example working with my own RSS feed - here is the
>> configuration for it.
>>
>> <dataConfig>
>>      <dataSource type="URLDataSource" />
>>      <document>
>>          <entity name="nvd-rss"
>>                  pk="link"
>>                  url="https://nvd.nist.gov/download/nvd-rss.xml"
>>                  processor="XPathEntityProcessor"
>>                  forEach="/RDF/item"
>>                  transformer="DateFormatTransformer">
>>
>>              <field column="id" xpath="/RDF/item/title" commonField="true" />
>>              <field column="link" xpath="/RDF/item/link" commonField="true"
>> />
>>              <field column="summary" xpath="/RDF/item/description"
>> commonField="true" />
>>              <field column="date" xpath="/RDF/item/date" commonField="true"
>> />
>>
>>          </entity>
>>      </document>
>> </dataConfig>
>>
>> However, my problem is that I also have to load multiple XML feeds into the
>> same core.  Here is one example (there are about 10 of them):
>>
>> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>>
>>
>> Is there any built-in functionality that would allow me to do this?
>> Basically, the use-case is to load and index all the XML ZIP files first,
>> and then check the RSS feed every two hours and update the indexes with any
>> new ones.
>>
>> Regards,
>>
>> Joe
>>
>>

Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

You can define both multiple entities in the same file and nested
entities if your list comes from an external source (e.g. a text file
of URLs).
You can also trigger DIH with a name of a specific entity to load just that.
You can even pass DIH configuration file when you are triggering the
processing start, so you can have different files completely for
initial load and update. Though you can just do the same with
entities.

The only thing to be aware of is that before an entity definition is
processed, a delete command is run. By default, it's "delete all", so
executing one entity will delete everything but then just populate
that one entity's results. You can avoid that by defining
preImportDeleteQuery and having a clear identifier on content
generated by each entity (e.g. source, either extracted or manually
added with TemplateTransformer).

Regards,
   Alex.

----
Sign up for my Solr resources newsletter at http://www.solr-start.com/

On 23 January 2015 at 11:15, Carl Roberts <ca...@gmail.com> wrote:
> Hi,
>
> I have the RSS DIH example working with my own RSS feed - here is the
> configuration for it.
>
> <dataConfig>
>     <dataSource type="URLDataSource" />
>     <document>
>         <entity name="nvd-rss"
>                 pk="link"
>                 url="https://nvd.nist.gov/download/nvd-rss.xml"
>                 processor="XPathEntityProcessor"
>                 forEach="/RDF/item"
>                 transformer="DateFormatTransformer">
>
>             <field column="id" xpath="/RDF/item/title" commonField="true" />
>             <field column="link" xpath="/RDF/item/link" commonField="true"
> />
>             <field column="summary" xpath="/RDF/item/description"
> commonField="true" />
>             <field column="date" xpath="/RDF/item/date" commonField="true"
> />
>
>         </entity>
>     </document>
> </dataConfig>
>
> However, my problem is that I also have to load multiple XML feeds into the
> same core.  Here is one example (there are about 10 of them):
>
> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>
>
> Is there any built-in functionality that would allow me to do this?
> Basically, the use-case is to load and index all the XML ZIP files first,
> and then check the RSS feed every two hours and update the indexes with any
> new ones.
>
> Regards,
>
> Joe
>
>