You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Luís Portela Afonso <me...@gmail.com> on 2013/09/09 01:42:42 UTC

Data import

Hi,

It's possible to disable document update when running data import, full-import command?

Thanks

Re: Data import

Posted by Chris Hostetter <ho...@fucit.org>.
: Any form of indexing would always "replace" a document and never update it.

At a very low level this is true, but Solr does support "Atomic Updates" 
(aka "Partial Updates") that can be used to allow a lcient to only specify 
the values of an existing document they want to chagne and Solr will 
handle everything on the server side.

: But i still dont get one thing, if i have two indexes that i try to merge
: and both the indexes have some documents with same unique ids, they dont
: overwrite each other. Instead what i have is two documents with same unique
: id. Why does this happen? Anyone any clues?

This seems like a completley unrelated question -- please start a new 
thread and provide full details of your situation and question in ordre 
for people to try to assist you...


https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss

Re: Data import

Posted by "tamanjit.bindra@yahoo.co.in" <ta...@yahoo.co.in>.
Any form of indexing would always "replace" a document and never update it.
If you dont want replacements dont use a unique key in your schema and sort
on time/date etc.
 
But i still dont get one thing, if i have two indexes that i try to merge
and both the indexes have some documents with same unique ids, they dont
overwrite each other. Instead what i have is two documents with same unique
id. Why does this happen? Anyone any clues?



--
View this message in context: http://lucene.472066.n3.nabble.com/Data-import-tp4088789p4088921.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data import

Posted by Luís Portela Afonso <me...@gmail.com>.
OK, that makes sense, but when solr when run dataimport identifies the new an existing document with the same uniquekey that is being indexed,right?
Because when the same document exists on the source, it deletes it and creates a new one. Instead of that, is not possible to discard the new document instead of delete and create a new one?

On Sep 10, 2013, at 2:16 AM, Alexandre Rafalovitch <ar...@gmail.com> wrote:

> Sounds like you want a custom UpdateRequestProcessor chain that checks if
> the document already exists with given primary key and does not even bother
> passing it on to the next processor in the chain.
> 
> This would make sense as an optimization or as a first step in a complex
> update chain that perhaps uses a lot of external resources to pre-process
> the content (e.g. named entities extraction).
> 
> I don't think such URP exist at the moment? But it should be simple to
> write one assuming URPs can do lookups by primary IDs and have go/no-go
> decisions on individual documents. Anybody knows the details of this?
> 
> Regards,
>   Alex.
> 
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> 
> 
> On Tue, Sep 10, 2013 at 7:53 AM, Luis Portela Afonso <meligaletiko@gmail.com
>> wrote:
> 
>> But with atomic updates i need to send the information, right?
>> 
>> I want that solr automatic indexes it. And he is doing that. Can you look
>> at the solr example in the source?
>> There is an example on example-DIH folder.
>> 
>> Imagine that you run the URL to import the data every 15 minutes. If the
>> same information is already indexed, solr will update it, and by update I
>> mean delete and index again.
>> 
>> I just want that solr simple discards the information if this already
>> exists with indexed.
>> 
>> On Tuesday, September 10, 2013, Chris Hostetter wrote:
>> 
>>> 
>>> : With cron job, I do a http request using curl, to the address
>>> : http://localhost:port
>>> /solr/core/dataimport/?command=full-import&clean=false
>>> :
>>> : When it runs, if the rss source has a feed that is already indexed on
>>> solr,
>>> : it updates the existing source.
>>> : So if the source has the same information of the destiny, it updates
>> the
>>> : information on the destiny.
>>> :
>>> : I want to prevent that. Is that explicit? I may try to provide some
>>> : examples.
>>> 
>>> Yes, specific examples would be helpful -- it's not really clear what it
>>> is that you want to prevent.
>>> 
>>> Please note the URL i mentioned before and use it as a guideline for
>>> how much detail we need to understand what it is you are asking...
>>> 
>>> : > Can you please be more specific about what you would like to see
>>> happen,
>>> : > we can better understand what your actual goal is?  It's really not
>>> clear
>>> 
>>> : > https://wiki.apache.org/solr/UsingMailingLists
>>> 
>>> 
>>> 
>>> -Hoss
>>> 
>> 
>> 
>> --
>> Sent from Gmail Mobile
>> 


Re: Data import

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Sounds like you want a custom UpdateRequestProcessor chain that checks if
the document already exists with given primary key and does not even bother
passing it on to the next processor in the chain.

This would make sense as an optimization or as a first step in a complex
update chain that perhaps uses a lot of external resources to pre-process
the content (e.g. named entities extraction).

I don't think such URP exist at the moment? But it should be simple to
write one assuming URPs can do lookups by primary IDs and have go/no-go
decisions on individual documents. Anybody knows the details of this?

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Tue, Sep 10, 2013 at 7:53 AM, Luis Portela Afonso <meligaletiko@gmail.com
> wrote:

> But with atomic updates i need to send the information, right?
>
> I want that solr automatic indexes it. And he is doing that. Can you look
> at the solr example in the source?
> There is an example on example-DIH folder.
>
> Imagine that you run the URL to import the data every 15 minutes. If the
> same information is already indexed, solr will update it, and by update I
> mean delete and index again.
>
> I just want that solr simple discards the information if this already
> exists with indexed.
>
> On Tuesday, September 10, 2013, Chris Hostetter wrote:
>
> >
> > : With cron job, I do a http request using curl, to the address
> > : http://localhost:port
> > /solr/core/dataimport/?command=full-import&clean=false
> > :
> > : When it runs, if the rss source has a feed that is already indexed on
> > solr,
> > : it updates the existing source.
> > : So if the source has the same information of the destiny, it updates
> the
> > : information on the destiny.
> > :
> > : I want to prevent that. Is that explicit? I may try to provide some
> > : examples.
> >
> > Yes, specific examples would be helpful -- it's not really clear what it
> > is that you want to prevent.
> >
> > Please note the URL i mentioned before and use it as a guideline for
> > how much detail we need to understand what it is you are asking...
> >
> > : > Can you please be more specific about what you would like to see
> > happen,
> > : > we can better understand what your actual goal is?  It's really not
> > clear
> >
> > : > https://wiki.apache.org/solr/UsingMailingLists
> >
> >
> >
> > -Hoss
> >
>
>
> --
> Sent from Gmail Mobile
>

Re: Data import

Posted by Luis Portela Afonso <me...@gmail.com>.
But with atomic updates i need to send the information, right?

I want that solr automatic indexes it. And he is doing that. Can you look
at the solr example in the source?
There is an example on example-DIH folder.

Imagine that you run the URL to import the data every 15 minutes. If the
same information is already indexed, solr will update it, and by update I
mean delete and index again.

I just want that solr simple discards the information if this already
exists with indexed.

On Tuesday, September 10, 2013, Chris Hostetter wrote:

>
> : With cron job, I do a http request using curl, to the address
> : http://localhost:port
> /solr/core/dataimport/?command=full-import&clean=false
> :
> : When it runs, if the rss source has a feed that is already indexed on
> solr,
> : it updates the existing source.
> : So if the source has the same information of the destiny, it updates the
> : information on the destiny.
> :
> : I want to prevent that. Is that explicit? I may try to provide some
> : examples.
>
> Yes, specific examples would be helpful -- it's not really clear what it
> is that you want to prevent.
>
> Please note the URL i mentioned before and use it as a guideline for
> how much detail we need to understand what it is you are asking...
>
> : > Can you please be more specific about what you would like to see
> happen,
> : > we can better understand what your actual goal is?  It's really not
> clear
>
> : > https://wiki.apache.org/solr/UsingMailingLists
>
>
>
> -Hoss
>


-- 
Sent from Gmail Mobile

Re: Data import

Posted by Chris Hostetter <ho...@fucit.org>.
: With cron job, I do a http request using curl, to the address
: http://localhost:port/solr/core/dataimport/?command=full-import&clean=false
: 
: When it runs, if the rss source has a feed that is already indexed on solr,
: it updates the existing source.
: So if the source has the same information of the destiny, it updates the
: information on the destiny.
: 
: I want to prevent that. Is that explicit? I may try to provide some
: examples.

Yes, specific examples would be helpful -- it's not really clear what it 
is that you want to prevent.

Please note the URL i mentioned before and use it as a guideline for 
how much detail we need to understand what it is you are asking...

: > Can you please be more specific about what you would like to see happen,
: > we can better understand what your actual goal is?  It's really not clear

: > https://wiki.apache.org/solr/UsingMailingLists



-Hoss

Re: Data import

Posted by Luis Portela Afonso <me...@gmail.com>.
So I'm indexing RSS feeds.
I'm running the data import full-import command with a cron job. It runs
every 15 minutes and indexes a lot of RSS feeds from many sources.

With cron job, I do a http request using curl, to the address
http://localhost:port/solr/core/dataimport/?command=full-import&clean=false

When it runs, if the rss source has a feed that is already indexed on solr,
it updates the existing source.
So if the source has the same information of the destiny, it updates the
information on the destiny.

I want to prevent that. Is that explicit? I may try to provide some
examples.

Thanks

On Tuesday, September 10, 2013, Chris Hostetter wrote:

>
> : When i run "dataimport/?command=full-import&clean=false", solr add new
> : documents with the information. But if the same information already
> : exists with the same uniquekey, it replaces the existing document with a
> : new one.
> : It does not update the document, it creates a new one. It's that
> possible?
>
> I'm not certain that i'm understanding your question.
>
> It is possible using Atomic Updates, but you have to be explicit
> about what/how you wnat Solr to use the new information (ie: when to
> replace, when to add to a multivaluded field, when to increment a numeric
> field, etc...)
>
> https://wiki.apache.org/solr/Atomic_Updates
>
> I don't think DIH has any straight forward syntax for letting you
> configure this easily, but as long as you put a "map" in each
> field (ie: via ScriptTransformer perhaps) containing a single "modifier =>
> value" pair you want applied to that field, it should work.
>
> : I'm indexing rss feeds. I run the rss example that exists in the solr
> : examples, and i does that.
>
> Can you please be more specific about what you would like to see happen,
> we can better understand what your actual goal is?  It's really not clear
> if using Atomic Updates is the easiest way to achieve what you're after,
> or if I'm just completley missunderstanding your question...
>
> https://wiki.apache.org/solr/UsingMailingLists
>
> -Hoss
>


-- 
Sent from Gmail Mobile

Re: Data import

Posted by Chris Hostetter <ho...@fucit.org>.
: When i run "dataimport/?command=full-import&clean=false", solr add new 
: documents with the information. But if the same information already 
: exists with the same uniquekey, it replaces the existing document with a 
: new one.
: It does not update the document, it creates a new one. It's that possible?

I'm not certain that i'm understanding your question.

It is possible using Atomic Updates, but you have to be explicit 
about what/how you wnat Solr to use the new information (ie: when to 
replace, when to add to a multivaluded field, when to increment a numeric 
field, etc...)

https://wiki.apache.org/solr/Atomic_Updates

I don't think DIH has any straight forward syntax for letting you 
configure this easily, but as long as you put a "map" in each 
field (ie: via ScriptTransformer perhaps) containing a single "modifier => 
value" pair you want applied to that field, it should work.

: I'm indexing rss feeds. I run the rss example that exists in the solr 
: examples, and i does that.

Can you please be more specific about what you would like to see happen, 
we can better understand what your actual goal is?  It's really not clear 
if using Atomic Updates is the easiest way to achieve what you're after, 
or if I'm just completley missunderstanding your question...

https://wiki.apache.org/solr/UsingMailingLists

-Hoss

Re: Data import

Posted by Luís Portela Afonso <me...@gmail.com>.
When i run  "dataimport/?command=full-import&clean=false", solr add new documents with the information. But if the same information already exists with the same uniquekey, it replaces the existing document with a new one.
It does not update the document, it creates a new one. It's that possible?

I'm indexing rss feeds. I run the rss example that exists in the solr examples, and i does that.

On Sep 9, 2013, at 4:10 AM, Alexandre Rafalovitch <ar...@gmail.com> wrote:

> What do you specifically mean by the "disable document update"? Do you mean
> in-place update? Or do you mean you want to run the import but not actually
> populate Solr collection with processed documents?
> 
> It might help to explain the business level goal you are trying to achieve.
> Or, specific error that you are perhaps seeing and trying to avoid.
> 
> Regards,
>   Alex.
> 
> 
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> 
> 
> On Mon, Sep 9, 2013 at 6:42 AM, Luís Portela Afonso
> <me...@gmail.com>wrote:
> 
>> Hi,
>> 
>> It's possible to disable document update when running data import,
>> full-import command?
>> 
>> Thanks


Re: Data import

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
What do you specifically mean by the "disable document update"? Do you mean
in-place update? Or do you mean you want to run the import but not actually
populate Solr collection with processed documents?

It might help to explain the business level goal you are trying to achieve.
Or, specific error that you are perhaps seeing and trying to avoid.

Regards,
   Alex.


Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Mon, Sep 9, 2013 at 6:42 AM, Luís Portela Afonso
<me...@gmail.com>wrote:

> Hi,
>
> It's possible to disable document update when running data import,
> full-import command?
>
> Thanks