You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Carl Roberts <ca...@gmail.com> on 2015/01/27 17:32:46 UTC

What is the recommended way to import and update index records?

Hi,

What is the recommended way to import and update index records?

I've read the documentation and I've experimented with full-import and 
delta-import and I am not seeing the desired results.

Basically, I have 15 RSS feeds that I am importing through 
rss-data-config.xml.

The first RSS feed should be a full import and the ones that follow may 
contain the same id, in which case the existing id in the index should 
be updated from the record in the new RSS feed.  Also there may be new 
records in the RSS feeds that follow the first one, in which case I want 
them added to the index.

When I try full-import for each entity, the index is cleared and I just 
end up with the records for the last import.

When I try full-import for each entity, with the clean=false parameter, 
all the records from each entity are added to the index and I end up 
with duplicate records.

When I try delta-import for the entities the follow the first one, I 
don't get any new index records.

How should I do this?

Regards,

Joe

Re: What is the recommended way to import and update index records?

Posted by Carl Roberts <ca...@gmail.com>.
Yep - it works with string.  Thanks a lot!



On 1/27/15, 7:08 PM, Alexandre Rafalovitch wrote:
>> <field name="id" type="text_general" indexed="true" stored="true"/>
> Make that id field a string and reindex. text_general is not the right
> type for a unique key.
>
> Regards,
>    Alex.


Re: What is the recommended way to import and update index records?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
> <field name="id" type="text_general" indexed="true" stored="true"/>

Make that id field a string and reindex. text_general is not the right
type for a unique key.

Regards,
  Alex.

Re: What is the recommended way to import and update index records?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
On 27 January 2015 at 18:44, Carl Roberts <ca...@gmail.com> wrote:
> OK - I did a little testing and with full-import and clean=false, I get more
> and more records when I import the same XML file. I have also checked and I
> see that my uniqueKey is defined correctly.


1) Is this a SolrCloud or a single core setup?
2) What happens if you search for everything and facet on 'id'?

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/

Re: What is the recommended way to import and update index records?

Posted by Carl Roberts <ca...@gmail.com>.
OK - I did a little testing and with full-import and clean=false, I get 
more and more records when I import the same XML file. I have also 
checked and I see that my uniqueKey is defined correctly.

Here are my fields in schema.xml:

<field name="text" type="text_general" indexed="true" stored="false" 
multiValued="true"/>
    <field name="_version_" type="long" indexed="true" stored="true"/>
    <field name="id" type="text_general" indexed="true" stored="true"/>
    <field name="cve" type="text_general" indexed="true" stored="true"/>
    <field name="cwe" type="text_general" indexed="true" stored="true"/>
    <field name="vulnerable-configuration" type="text_general" 
indexed="true" stored="true" multiValued="true"/>
    <field name="vulnerable-software" type="text_general" indexed="true" 
stored="true" multiValued="true"/>
    <field name="product" type="text_general" indexed="true" 
stored="true" multiValued="true"/>
    <field name="published" type="text_general" indexed="true" 
stored="true" />
    <field name="modified" type="text_general" indexed="true" 
stored="true" />
    <field name="summary" type="text_general" indexed="true" 
stored="true" />
    <field name="cvss-score" type="text_general" indexed="true" 
stored="true" />
    <field name="cvss-access-vector" type="text_general" indexed="true" 
stored="true" />
    <field name="cvss-access-complexity" type="text_general" 
indexed="true" stored="true" />
    <field name="cvss-authentication" type="text_general" indexed="true" 
stored="true" />
    <field name="cvss-confidentiality-impact" type="text_general" 
indexed="true" stored="true" />
    <field name="cvss-integrity-impact" type="text_general" 
indexed="true" stored="true" />
    <field name="cvss-availability-impact" type="text_general" 
indexed="true" stored="true" />
    <field name="reference" type="text_general" indexed="true" 
stored="true" multiValued="true"/>
    <field name="security-protection" type="text_general" indexed="true" 
stored="true" />

And here is uniqueKey in schema.xml:

<uniqueKey>id</uniqueKey>


Here is my rss-data-config.xml:

<dataConfig>
     <dataSource type="ZIPURLDataSource" connectionTimeout="15000" 
readTimeout="30000"/>
     <document>
         <entity name="cve-2002"
                 pk="id"
url="https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip"
                 processor="XPathEntityProcessor"
                 forEach="/nvd/entry"
                 transformer="RegexTransformer">
             <field column="id" xpath="/nvd/entry/@id" 
commonField="false" />
             <field column="cve" xpath="/nvd/entry/cve-id" 
commonField="false" />
             <field column="cwe" xpath="/nvd/entry/cwe/@id" 
commonField="false" />
             <field column="vulnerable-configuration" 
xpath="/nvd/entry/vulnerable-configuration/logical-test/fact-ref/@name" 
commonField="false"/>
             <field column="vulnerable-software" 
xpath="/nvd/entry/vulnerable-software-list/product" commonField="false"/>
             <field column="product" sourceColName="vulnerable-software" 
commonField="false" regex="cpe:/.:" replaceWith=""/>
             <field column="product" commonField="false" regex=":" 
replaceWith=" "/>
             <field column="published" 
xpath="/nvd/entry/published-datetime" commonField="false" />
             <field column="modified" 
xpath="/nvd/entry/last-modified-datetime" commonField="false" />
             <field column="summary" xpath="/nvd/entry/summary" 
commonField="false" />
             <field column="cvss-score" 
xpath="/nvd/entry/cvss/base_metrics/score" commonField="false" />
             <field column="cvss-access-vector" 
xpath="/nvd/entry/cvss/base_metrics/access-vector" commonField="false" />
             <field column="cvss-access-complexity" 
xpath="/nvd/entry/cvss/base_metrics/access-complexity" 
commonField="false" />
             <field column="cvss-authentication" 
xpath="/nvd/entry/cvss/base_metrics/authentication" commonField="false" />
             <field column="cvss-confidentiality-impact" 
xpath="/nvd/entry/cvss/base_metrics/confidentiality-impact" 
commonField="false" />
             <field column="cvss-integrity-impact" 
xpath="/nvd/entry/cvss/base_metrics/integrity-impact" commonField="false" />
             <field column="cvss-availability-impact" 
xpath="/nvd/entry/cvss/base_metrics/availability-impact" 
commonField="false" />
             <field column="reference" 
xpath="/nvd/entry/references/reference/@href" commonField="false" />
             <field column="security-protection" 
xpath="/nvd/entry/security-protection" commonField="false" />
         </entity>
     </document>
</dataConfig>

Here is the import command the first time:

*curl 
"http://127.0.0.1:8983/solr/nvd-rss/dataimport?command=full-import&entity=cve-2002&clean=true"*

Here is the command that outputs the count of records:

*curl 
"http://localhost:8983/solr/nvd-rss/select?wt=json&indent=true&q=*:*&start=0&&rows=0&fl=*"*

And here is the output:

{
   "responseHeader":{
     "status":0,
     "QTime":0,
     "params":{
       "fl":"*",
       "indent":"true",
       "start":"0",
       "q":"*:*",
       "wt":"json",
       "rows":"0"}},
   "response":{"numFound":6717,"start":0,"docs":[]
   }}

Now here is the next full-import command with clean=false:

*"http://127.0.0.1:8983/solr/nvd-rss/dataimport?command=full-import&entity=cve-2002&clean=false"*

And here is the new count:

*curl 
"http://localhost:8983/solr/nvd-rss/select?wt=json&indent=true&q=*:*&start=0&&rows=0&fl=*"*

{
   "responseHeader":{
     "status":0,
     "QTime":0,
     "params":{
       "fl":"*",
       "indent":"true",
       "start":"0",
       "q":"*:*",
       "wt":"json",
       "rows":"0"}},
   "response":{"numFound":13434,"start":0,"docs":[]
   }}

Clearly, this is just importing the same records twice.


What is even more puzzling that if I search for an id value which is 
unique in the imported XML, I get all records back:

curl 
"http://localhost:8983/solr/nvd-rss/select?wt=json&indent=true&q=id:CVE-1999-0001&start=0&&rows=0&fl=*"
{
   "responseHeader":{
     "status":0,
     "QTime":0,
     "params":{
       "fl":"*",
       "indent":"true",
       "start":"0",
       "q":"id:CVE-1999-0001",
       "wt":"json",
       "rows":"0"}},
   "response":{"numFound":13434,"start":0,"docs":[]
   }}

On 1/27/15, 2:03 PM, Carl Roberts wrote:
> HI Alex, thanks for clarifying this for me.  I'll take a look at my 
> setup of the uniqueKey.  Perhaps I did not set it right.
>
>
> On 1/27/15, 12:09 PM, Alexandre Rafalovitch wrote:
>> What do you mean by "update"? If you mean partial update, DIH does not
>> do it AFAIK. If you mean replace, it should.
>>
>> If you are getting duplicate records, maybe your uniqueKey is not set 
>> correctly?
>>
>> clean=false looks to me like the right approach for incremental updates.
>>
>> Regards,
>>     Alex.
>> ----
>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>
>>
>> On 27 January 2015 at 11:43, Carl Roberts 
>> <ca...@gmail.com> wrote:
>>> Also, if I try full-import and clean=false with the same XML file, I 
>>> end up
>>> with more records each time the import runs.  How can I make SOLR 
>>> just add
>>> the records that are new by id, and update the ones that have an id 
>>> that
>>> matches the one in the existing index?
>>>
>>>
>>>
>>> On 1/27/15, 11:32 AM, Carl Roberts wrote:
>>>> Hi,
>>>>
>>>> What is the recommended way to import and update index records?
>>>>
>>>> I've read the documentation and I've experimented with full-import and
>>>> delta-import and I am not seeing the desired results.
>>>>
>>>> Basically, I have 15 RSS feeds that I am importing through
>>>> rss-data-config.xml.
>>>>
>>>> The first RSS feed should be a full import and the ones that follow 
>>>> may
>>>> contain the same id, in which case the existing id in the index 
>>>> should be
>>>> updated from the record in the new RSS feed. Also there may be new 
>>>> records
>>>> in the RSS feeds that follow the first one, in which case I want 
>>>> them added
>>>> to the index.
>>>>
>>>> When I try full-import for each entity, the index is cleared and I 
>>>> just
>>>> end up with the records for the last import.
>>>>
>>>> When I try full-import for each entity, with the clean=false 
>>>> parameter,
>>>> all the records from each entity are added to the index and I end 
>>>> up with
>>>> duplicate records.
>>>>
>>>> When I try delta-import for the entities the follow the first one, 
>>>> I don't
>>>> get any new index records.
>>>>
>>>> How should I do this?
>>>>
>>>> Regards,
>>>>
>>>> Joe
>>>
>


Re: What is the recommended way to import and update index records?

Posted by Carl Roberts <ca...@gmail.com>.
HI Alex, thanks for clarifying this for me.  I'll take a look at my 
setup of the uniqueKey.  Perhaps I did not set it right.


On 1/27/15, 12:09 PM, Alexandre Rafalovitch wrote:
> What do you mean by "update"? If you mean partial update, DIH does not
> do it AFAIK. If you mean replace, it should.
>
> If you are getting duplicate records, maybe your uniqueKey is not set correctly?
>
> clean=false looks to me like the right approach for incremental updates.
>
> Regards,
>     Alex.
> ----
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 27 January 2015 at 11:43, Carl Roberts <ca...@gmail.com> wrote:
>> Also, if I try full-import and clean=false with the same XML file, I end up
>> with more records each time the import runs.  How can I make SOLR just add
>> the records that are new by id, and update the ones that have an id that
>> matches the one in the existing index?
>>
>>
>>
>> On 1/27/15, 11:32 AM, Carl Roberts wrote:
>>> Hi,
>>>
>>> What is the recommended way to import and update index records?
>>>
>>> I've read the documentation and I've experimented with full-import and
>>> delta-import and I am not seeing the desired results.
>>>
>>> Basically, I have 15 RSS feeds that I am importing through
>>> rss-data-config.xml.
>>>
>>> The first RSS feed should be a full import and the ones that follow may
>>> contain the same id, in which case the existing id in the index should be
>>> updated from the record in the new RSS feed. Also there may be new records
>>> in the RSS feeds that follow the first one, in which case I want them added
>>> to the index.
>>>
>>> When I try full-import for each entity, the index is cleared and I just
>>> end up with the records for the last import.
>>>
>>> When I try full-import for each entity, with the clean=false parameter,
>>> all the records from each entity are added to the index and I end up with
>>> duplicate records.
>>>
>>> When I try delta-import for the entities the follow the first one, I don't
>>> get any new index records.
>>>
>>> How should I do this?
>>>
>>> Regards,
>>>
>>> Joe
>>


Re: What is the recommended way to import and update index records?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
What do you mean by "update"? If you mean partial update, DIH does not
do it AFAIK. If you mean replace, it should.

If you are getting duplicate records, maybe your uniqueKey is not set correctly?

clean=false looks to me like the right approach for incremental updates.

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 27 January 2015 at 11:43, Carl Roberts <ca...@gmail.com> wrote:
> Also, if I try full-import and clean=false with the same XML file, I end up
> with more records each time the import runs.  How can I make SOLR just add
> the records that are new by id, and update the ones that have an id that
> matches the one in the existing index?
>
>
>
> On 1/27/15, 11:32 AM, Carl Roberts wrote:
>>
>> Hi,
>>
>> What is the recommended way to import and update index records?
>>
>> I've read the documentation and I've experimented with full-import and
>> delta-import and I am not seeing the desired results.
>>
>> Basically, I have 15 RSS feeds that I am importing through
>> rss-data-config.xml.
>>
>> The first RSS feed should be a full import and the ones that follow may
>> contain the same id, in which case the existing id in the index should be
>> updated from the record in the new RSS feed. Also there may be new records
>> in the RSS feeds that follow the first one, in which case I want them added
>> to the index.
>>
>> When I try full-import for each entity, the index is cleared and I just
>> end up with the records for the last import.
>>
>> When I try full-import for each entity, with the clean=false parameter,
>> all the records from each entity are added to the index and I end up with
>> duplicate records.
>>
>> When I try delta-import for the entities the follow the first one, I don't
>> get any new index records.
>>
>> How should I do this?
>>
>> Regards,
>>
>> Joe
>
>

Re: What is the recommended way to import and update index records?

Posted by Carl Roberts <ca...@gmail.com>.
Also, if I try full-import and clean=false with the same XML file, I end 
up with more records each time the import runs.  How can I make SOLR 
just add the records that are new by id, and update the ones that have 
an id that matches the one in the existing index?


On 1/27/15, 11:32 AM, Carl Roberts wrote:
> Hi,
>
> What is the recommended way to import and update index records?
>
> I've read the documentation and I've experimented with full-import and 
> delta-import and I am not seeing the desired results.
>
> Basically, I have 15 RSS feeds that I am importing through 
> rss-data-config.xml.
>
> The first RSS feed should be a full import and the ones that follow 
> may contain the same id, in which case the existing id in the index 
> should be updated from the record in the new RSS feed. Also there may 
> be new records in the RSS feeds that follow the first one, in which 
> case I want them added to the index.
>
> When I try full-import for each entity, the index is cleared and I 
> just end up with the records for the last import.
>
> When I try full-import for each entity, with the clean=false 
> parameter, all the records from each entity are added to the index and 
> I end up with duplicate records.
>
> When I try delta-import for the entities the follow the first one, I 
> don't get any new index records.
>
> How should I do this?
>
> Regards,
>
> Joe