You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by ji...@ece.ubc.ca on 2015/01/02 21:43:49 UTC

UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

Happy New Year Everyone :)

I am trying to automatically generate document Id when indexing a csv
file that contains multiple lines of documents. The desired case: if the
csv file contains 2 lines (each line is a document), then the index
should contain 2 documents.

 What I observed: If the csv files contains 2 lines, then the index
contains 3 documents, because the 1st document is repeated once, an
example output:
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId1</str>
</doc>
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId2</str>
</doc>
<doc>
<sr name ="col1"> doc2 </str>
<sr name= "col2"> rank2 </str>
<str name="id"> randomlyGeneratedId3</str>
</doc>

And if the csv file contains 3 lines, then the index contains 6 elements,
because document 1 is repeated 3 times and document 2 is repeated twice,
as following:
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId1</str>
</doc>
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId2</str>
</doc>
<doc>
<sr name ="col1"> doc2 </str>
<sr name= "col2"> rank2 </str>
<str name="id"> randomlyGeneratedId3</str>
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId4</str>
</doc>
<doc>
<sr name ="col1"> doc2 </str>
<sr name= "col2"> rank2 </str>
<str name="id"> randomlyGeneratedId5</str>
</doc>
<doc>
<sr name ="col1"> doc3 </str>
<sr name= "col2"> rank3 </str>
<str name="id"> randomlyGeneratedId6</str>
</doc>

Here's what I have done:
1. In my solrConfig:
<updateRequestProcessorChain name="autoGenId">
		<processor class="solr.UUIDUpdateProcessorFactory">
		<str name="fieldName">doc_key</str>
		</processor>
		<processor class="solr.LogUpdateProcessorFactory" />
		<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.UpdateRequestHandler">
       <lst name="defaults">
	    <str name="update.chain">autoGenId</str>
       </lst>
  </requestHandler>
2. in schema.xml:
<field name="doc_key" type="string" indexed="true" stored="true"
required="true" multiValued="false"/>
	<field name = "col1" type="string" indexed="true" stored="true"
required="true" multiValued="false"/>
	<field name = "col2" type="string" indexed="true" stored="true"
required="true" multiValued="false"/>
 <uniqueKey>id</uniqueKey>

This problem doesn't exist when I assign an Id field, instead of using
the UUIDUpdateProcessorFactory, so I assumed the problem is there? Looks
like the csv file is processed one line at a time, and the index shows
the entire process: so we see each previous line repeated in the output.
Is there a way to not show the 'appending of previous lines', and
rather just the 'final results' - so the total number of indexed
document would match the input number of documents from the csv file?

Many thanks,
Jia

Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

Posted by jia gu <ji...@gmail.com>.

Problem solved - it's caused by a system outside of Solr. Thank you all for
the prompt replies! :)

On Thu, Jan 8, 2015 at 12:40 PM, Chris Hostetter <ho...@fucit.org>
wrote:

> : Thank you for your reply Chris :)  Solr is producing the correct result
> on
> : its own. The problem is that I am calling a dataload class to call Solr,
> : which worked for assigned ID and composite ID, but not for UUID. Is
> there a
>
> Sorry -- still confused: are you confirming that you've tracked down the
> problem you are having to a system outside of Solr?  that the problem (of
> duplicate documents) is introduced by your "dataload class" prior to
> sending the docs to Solr?
>
> : place to delete my question on the mailing list?
>
> nope - once the emails have gone out, they've gone out -- just replying
> back and confirming the resolutionn to the problem you saw is good enough.
>
>
>
> -Hoss
> http://www.lucidworks.com/
>

Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

Posted by Chris Hostetter <ho...@fucit.org>.

: Thank you for your reply Chris :)  Solr is producing the correct result on
: its own. The problem is that I am calling a dataload class to call Solr,
: which worked for assigned ID and composite ID, but not for UUID. Is there a

Sorry -- still confused: are you confirming that you've tracked down the 
problem you are having to a system outside of Solr?  that the problem (of 
duplicate documents) is introduced by your "dataload class" prior to 
sending the docs to Solr?

: place to delete my question on the mailing list?

nope - once the emails have gone out, they've gone out -- just replying 
back and confirming the resolutionn to the problem you saw is good enough.



-Hoss
http://www.lucidworks.com/

Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

Posted by jia gu <ji...@gmail.com>.

Thank you for your reply Chris :)  Solr is producing the correct result on
its own. The problem is that I am calling a dataload class to call Solr,
which worked for assigned ID and composite ID, but not for UUID. Is there a
place to delete my question on the mailing list?
Thank you,
Jia

On Wed, Jan 7, 2015 at 8:47 PM, Chris Hostetter <ho...@fucit.org>
wrote:

>
> : It's a single Solr Instance, and in my files, I used 'doc_key'
> everywhere,
> : but I changed it to "id" in the email I sent out wanting to make it
> easier
> : to read, sorry don't mean to confuse you :)
>
> https://wiki.apache.org/solr/UsingMailingLists
>
> - what version of solr?
> - how exactly are you doing the update? curl? post.jar?
> - what exactly is the HTTP response from your update?
> - what does your log file show during the update?
> - what exactly do all of your configs look like (you said you made a
> mistake in your email by trying to make the data "easier to read" that
> could easily be masking some other mistake in your actual cnfigs
>
> I did my best to try and reproduce what you describe, but i had no
> luck -- here's exactly what i did...
>
>
> hossman@frisbee:~/lucene/lucene-4.10.3_tag$ svn diff
> Index: solr/example/solr/collection1/conf/solrconfig.xml
> ===================================================================
> --- solr/example/solr/collection1/conf/solrconfig.xml   (revision 1650199)
> +++ solr/example/solr/collection1/conf/solrconfig.xml   (working copy)
> @@ -1076,7 +1076,17 @@
>           <str name="update.chain">dedupe</str>
>         </lst>
>         -->
> +    <lst name="defaults">
> +      <str name="update.chain">autoGenId</str>
> +    </lst>
>    </requestHandler>
> +  <updateRequestProcessorChain name="autoGenId">
> +    <processor class="solr.UUIDUpdateProcessorFactory">
> +      <str name="fieldName">id</str>
> +    </processor>
> +    <processor class="solr.LogUpdateProcessorFactory" />
> +    <processor class="solr.RunUpdateProcessorFactory" />
> +  </updateRequestProcessorChain>
>
>    <!-- The following are implicitly added
>    <requestHandler name="/update/json" class="solr.UpdateRequestHandler">
> hossman@frisbee:~/lucene/lucene-4.10.3_tag$ curl -X POST '
> http://localhost:8983/solr/collection1/update?commit=true' -H
> "Content-Type: application/csv" --data-binary 'foo_s,bar_s
> aaa,cat
> bbb,dog
> ccc,yak
> '
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">350</int></lst>
> </response>
> hossman@frisbee:~/lucene/lucene-4.10.3_tag$ curl '
> http://localhost:8983/solr/collection1/select?q=*:*&wt=json&indent=true'
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":7,
>     "params":{
>       "indent":"true",
>       "q":"*:*",
>       "wt":"json"}},
>   "response":{"numFound":3,"start":0,"docs":[
>       {
>         "foo_s":"aaa",
>         "bar_s":"cat",
>         "id":"025c69cd-6407-4c70-903b-dfde170d373b",
>         "_version_":1489692576651935744},
>       {
>         "foo_s":"bbb",
>         "bar_s":"dog",
>         "id":"5c7b3d65-1274-4bad-a671-4d643531e2ae",
>         "_version_":1489692576673955840},
>       {
>         "foo_s":"ccc",
>         "bar_s":"yak",
>         "id":"25a3893f-c538-4b47-aa79-1f4268d66c39",
>         "_version_":1489692576673955841}]
>   }}
>
>
>
>
>
>
>
> -Hoss
> http://www.lucidworks.com/
>

Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

Posted by Chris Hostetter <ho...@fucit.org>.

: It's a single Solr Instance, and in my files, I used 'doc_key' everywhere,
: but I changed it to "id" in the email I sent out wanting to make it easier
: to read, sorry don't mean to confuse you :)

https://wiki.apache.org/solr/UsingMailingLists

- what version of solr?
- how exactly are you doing the update? curl? post.jar?
- what exactly is the HTTP response from your update?
- what does your log file show during the update?
- what exactly do all of your configs look like (you said you made a 
mistake in your email by trying to make the data "easier to read" that 
could easily be masking some other mistake in your actual cnfigs

I did my best to try and reproduce what you describe, but i had no 
luck -- here's exactly what i did...


hossman@frisbee:~/lucene/lucene-4.10.3_tag$ svn diff
Index: solr/example/solr/collection1/conf/solrconfig.xml
===================================================================
--- solr/example/solr/collection1/conf/solrconfig.xml	(revision 1650199)
+++ solr/example/solr/collection1/conf/solrconfig.xml	(working copy)
@@ -1076,7 +1076,17 @@
          <str name="update.chain">dedupe</str>
        </lst>
        -->
+    <lst name="defaults">
+      <str name="update.chain">autoGenId</str>
+    </lst>
   </requestHandler>
+  <updateRequestProcessorChain name="autoGenId">
+    <processor class="solr.UUIDUpdateProcessorFactory">
+      <str name="fieldName">id</str>
+    </processor>
+    <processor class="solr.LogUpdateProcessorFactory" />
+    <processor class="solr.RunUpdateProcessorFactory" />
+  </updateRequestProcessorChain>
 
   <!-- The following are implicitly added
   <requestHandler name="/update/json" class="solr.UpdateRequestHandler">
hossman@frisbee:~/lucene/lucene-4.10.3_tag$ curl -X POST 'http://localhost:8983/solr/collection1/update?commit=true' -H "Content-Type: application/csv" --data-binary 'foo_s,bar_s
aaa,cat
bbb,dog
ccc,yak
'
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int 
name="QTime">350</int></lst>
</response>
hossman@frisbee:~/lucene/lucene-4.10.3_tag$ curl 'http://localhost:8983/solr/collection1/select?q=*:*&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":7,
    "params":{
      "indent":"true",
      "q":"*:*",
      "wt":"json"}},
  "response":{"numFound":3,"start":0,"docs":[
      {
        "foo_s":"aaa",
        "bar_s":"cat",
        "id":"025c69cd-6407-4c70-903b-dfde170d373b",
        "_version_":1489692576651935744},
      {
        "foo_s":"bbb",
        "bar_s":"dog",
        "id":"5c7b3d65-1274-4bad-a671-4d643531e2ae",
        "_version_":1489692576673955840},
      {
        "foo_s":"ccc",
        "bar_s":"yak",
        "id":"25a3893f-c538-4b47-aa79-1f4268d66c39",
        "_version_":1489692576673955841}]
  }}







-Hoss
http://www.lucidworks.com/

Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

Posted by jia gu <ji...@gmail.com>.

It's a single Solr Instance, and in my files, I used 'doc_key' everywhere,
but I changed it to "id" in the email I sent out wanting to make it easier
to read, sorry don't mean to confuse you :)


On Fri, Jan 2, 2015 at 4:06 PM, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> On 2 January 2015 at 15:43,  <ji...@ece.ubc.ca> wrote:
> > <uniqueKey>id</uniqueKey>
>
> Your uniqueKey does not seem to be the 'doc_key' that the URP is asked
> to generate. I wonder if that is causing the issue. Are you
> deliberately generating a field different from one defined as unique
> id?
>
> Regards,
>    Alex.
>
> ----
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>

Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

On 2 January 2015 at 15:43,  <ji...@ece.ubc.ca> wrote:
> <uniqueKey>id</uniqueKey>

Your uniqueKey does not seem to be the 'doc_key' that the URP is asked
to generate. I wonder if that is causing the issue. Are you
deliberately generating a field different from one defined as unique
id?

Regards,
   Alex.

----
Sign up for my Solr resources newsletter at http://www.solr-start.com/

Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

Posted by "Meraj A. Khan" <me...@gmail.com>.

Is this SolrCloud or single Solr Instance?
On Jan 2, 2015 3:44 PM, <ji...@ece.ubc.ca> wrote:

> Happy New Year Everyone :)
>
> I am trying to automatically generate document Id when indexing a csv
> file that contains multiple lines of documents. The desired case: if the
> csv file contains 2 lines (each line is a document), then the index
> should contain 2 documents.
>
>  What I observed: If the csv files contains 2 lines, then the index
> contains 3 documents, because the 1st document is repeated once, an
> example output:
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId1</str>
> </doc>
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId2</str>
> </doc>
> <doc>
> <sr name ="col1"> doc2 </str>
> <sr name= "col2"> rank2 </str>
> <str name="id"> randomlyGeneratedId3</str>
> </doc>
>
> And if the csv file contains 3 lines, then the index contains 6 elements,
> because document 1 is repeated 3 times and document 2 is repeated twice,
> as following:
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId1</str>
> </doc>
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId2</str>
> </doc>
> <doc>
> <sr name ="col1"> doc2 </str>
> <sr name= "col2"> rank2 </str>
> <str name="id"> randomlyGeneratedId3</str>
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId4</str>
> </doc>
> <doc>
> <sr name ="col1"> doc2 </str>
> <sr name= "col2"> rank2 </str>
> <str name="id"> randomlyGeneratedId5</str>
> </doc>
> <doc>
> <sr name ="col1"> doc3 </str>
> <sr name= "col2"> rank3 </str>
> <str name="id"> randomlyGeneratedId6</str>
> </doc>
>
> Here's what I have done:
> 1. In my solrConfig:
> <updateRequestProcessorChain name="autoGenId">
>                 <processor class="solr.UUIDUpdateProcessorFactory">
>                 <str name="fieldName">doc_key</str>
>                 </processor>
>                 <processor class="solr.LogUpdateProcessorFactory" />
>                 <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
> <requestHandler name="/update" class="solr.UpdateRequestHandler">
>        <lst name="defaults">
>             <str name="update.chain">autoGenId</str>
>        </lst>
>   </requestHandler>
> 2. in schema.xml:
> <field name="doc_key" type="string" indexed="true" stored="true"
> required="true" multiValued="false"/>
>         <field name = "col1" type="string" indexed="true" stored="true"
> required="true" multiValued="false"/>
>         <field name = "col2" type="string" indexed="true" stored="true"
> required="true" multiValued="false"/>
>  <uniqueKey>id</uniqueKey>
>
> This problem doesn't exist when I assign an Id field, instead of using
> the UUIDUpdateProcessorFactory, so I assumed the problem is there? Looks
> like the csv file is processed one line at a time, and the index shows
> the entire process: so we see each previous line repeated in the output.
> Is there a way to not show the 'appending of previous lines', and
> rather just the 'final results' - so the total number of indexed
> document would match the input number of documents from the csv file?
>
> Many thanks,
> Jia
>