You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2010/03/02 01:22:07 UTC

Re: Solr Cell and Deduplication - Get ID of doc

: You could create your own unique ID and pass it in with the
: literal.field=value feature.

By which Lance means you could specify an unique value in a differnet 
field from yoru uniqueKey field, and then query on that field:value pair 
to get the doc after it's been added -- but that query will only work 
until some other version of the doc (with some other value) overwrites it.  
so you'd esentially have to query for the field:value to lookup the 
uniqueKey.

it seems like it should definitely be feasible for the 
Update RequestHandlers to return the uniqueKeyField values for all the 
added docs (regardless of wether the key was included in the request, or 
added by an UpdateProcessor -- but i'm not sure how that would fit in with 
the SolrJ API.

would you mind opening a feature request in Jira?



-Hoss


Re: Solr Cell and Deduplication - Get ID of doc

Posted by Bill Engle <bi...@gmail.com>.
Thanks for the responses.  This is exactly what I had to resort to.  I will
definitely put in a feature request to get the generated ID back from the
extract request.

I am doing this with PHP cURL for extraction and pecl php solr for
querying.  I am then saving the unique id and dupe hash in a MySQL table
which I check against after the doc is indexed in Solr.  If it is a dupe I
delete the Solr record and discard the file.  My problem now is the dupe
hash sometimes comes back NULL from Solr although when I check it through
Solr Admin it is there.  I am working through this now to isolate.

I had to set Solr to ALLOW duplicates because I have to somehow know that
the file is a dupe and then remove the duplicate files on my filesystem.
Based on the extract response I have no way of knowing this if duplicates
are disallowed.

-Bill


On Tue, Mar 2, 2010 at 2:11 AM, Chris Hostetter <ho...@fucit.org>wrote:

>
>
> : To quote from the wiki,
>        ...
> That's all true ... but Bill explicitly said he wanted to use
> SignatureUpdateProcessorFactory to generate a uniqueKey from the content
> field post-extraction so he could dedup documents with the same content
> ... his question was how to get that key after adding a doc.
>
> Using a unique literal.field value will work -- but only as the value of
> a secondary field that he can then query on to get the uniqueKeyField
> value.
>
>
> : > : You could create your own unique ID and pass it in with the
> : > : literal.field=value feature.
> : >
> : > By which Lance means you could specify an unique value in a differnet
> : > field from yoru uniqueKey field, and then query on that field:value
> pair
> : > to get the doc after it's been added -- but that query will only work
> : > until some other version of the doc (with some other value) overwrites
> it.
> : > so you'd esentially have to query for the field:value to lookup the
> : > uniqueKey.
> : >
> : > it seems like it should definitely be feasible for the
> : > Update RequestHandlers to return the uniqueKeyField values for all the
> : > added docs (regardless of wether the key was included in the request,
> or
> : > added by an UpdateProcessor -- but i'm not sure how that would fit in
> with
> : > the SolrJ API.
> : >
> : > would you mind opening a feature request in Jira?
> : >
> : >
> : >
> : > -Hoss
> : >
> : >
> :
> :
> :
> : --
> : Lance Norskog
> : goksron@gmail.com
> :
>
>
>
> -Hoss
>
>

Re: Solr Cell and Deduplication - Get ID of doc

Posted by Chris Hostetter <ho...@fucit.org>.

: To quote from the wiki,
	...
That's all true ... but Bill explicitly said he wanted to use 
SignatureUpdateProcessorFactory to generate a uniqueKey from the content 
field post-extraction so he could dedup documents with the same content 
... his question was how to get that key after adding a doc.

Using a unique literal.field value will work -- but only as the value of 
a secondary field that he can then query on to get the uniqueKeyField 
value.


: > : You could create your own unique ID and pass it in with the
: > : literal.field=value feature.
: >
: > By which Lance means you could specify an unique value in a differnet
: > field from yoru uniqueKey field, and then query on that field:value pair
: > to get the doc after it's been added -- but that query will only work
: > until some other version of the doc (with some other value) overwrites it.
: > so you'd esentially have to query for the field:value to lookup the
: > uniqueKey.
: >
: > it seems like it should definitely be feasible for the
: > Update RequestHandlers to return the uniqueKeyField values for all the
: > added docs (regardless of wether the key was included in the request, or
: > added by an UpdateProcessor -- but i'm not sure how that would fit in with
: > the SolrJ API.
: >
: > would you mind opening a feature request in Jira?
: >
: >
: >
: > -Hoss
: >
: >
: 
: 
: 
: -- 
: Lance Norskog
: goksron@gmail.com
: 



-Hoss


Re: Solr Cell and Deduplication - Get ID of doc

Posted by Lance Norskog <go...@gmail.com>.
To quote from the wiki,
http://wiki.apache.org/solr/ExtractingRequestHandler

curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true'
-F "myfile=@tutorial.html"

This runs the extractor on your input file (in this case an HTML
file). It then stores the generated document with the id field (the
uniqueKey declared in schema.xml) set to 'doc1'. This way, you do not
rely on the ExtractingRequestHandler to create a unique key for you.
This command throws away that generated key.

On Mon, Mar 1, 2010 at 4:22 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : You could create your own unique ID and pass it in with the
> : literal.field=value feature.
>
> By which Lance means you could specify an unique value in a differnet
> field from yoru uniqueKey field, and then query on that field:value pair
> to get the doc after it's been added -- but that query will only work
> until some other version of the doc (with some other value) overwrites it.
> so you'd esentially have to query for the field:value to lookup the
> uniqueKey.
>
> it seems like it should definitely be feasible for the
> Update RequestHandlers to return the uniqueKeyField values for all the
> added docs (regardless of wether the key was included in the request, or
> added by an UpdateProcessor -- but i'm not sure how that would fit in with
> the SolrJ API.
>
> would you mind opening a feature request in Jira?
>
>
>
> -Hoss
>
>



-- 
Lance Norskog
goksron@gmail.com