You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "O'Shaughnessy, Devon" <de...@ulfoods.com> on 2017/08/07 16:57:23 UTC

Indexing a CSV that contains double quotes

Hello all,

I'm pretty new at Solr, having only worked with in a couple weeks, and I'm guessing I'm having a newbie problem of some sort. I'm a little confused about how Solr works with double quotes within strings. I'm uploading a CSV to Solr once a day containing some item data, some of which contains quotes, and I'm getting some errors. I'll do my best to explain my problem.

Here is my schema:

  <field name="Cat1_Description" type="text_en"/>
  <field name="Cat2_Description" type="text_en"/>
  <field name="Cat3_Description" type="text_en"/>
  <field name="Cat1_Facet" type="string"/>
  <field name="Cat2_Facet" type="string"/>
  <field name="Cat3_Facet" type="string"/>
  <field name="Item_Cat1" type="string"/>
  <field name="Item_Cat2" type="string"/>
  <field name="Item_Cat3" type="string"/>
  <field name="Item_Combined" type="string" indexed="false"/>
  <field name="Item_Description" type="text_en"/>
  <field name="Item_Number" type="string" indexed="true" required="true" stored="true"/>
  <field name="Item_Status" type="string"/>
  <field name="Keywords" type="text_en"/>
  <field name="_root_" type="string" docValues="false" indexed="true" stored="false"/>
  <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
  <field name="_version_" type="long" indexed="false" stored="false"/>
  <copyField source="*" dest="_text_"/>
  <copyField source="Cat1_Description" dest="Cat1_Facet"/>
  <copyField source="Cat2_Description" dest="Cat2_Facet"/>
  <copyField source="Cat3_Description" dest="Cat3_Facet"/>

The command I am using to update the data:

curl 'http://10.0.1.24:8983/solr/products/update?commit=true' --data-binary @solrItmList.csv -H 'Content-type:application/csv'

This is the error I recieve in response:

[cid:745aeeaa-f63d-4eed-8b9b-5ca9d2b258cb]

If for some reason the image doesn't show, it's an XML response indicating an IOException with the message "CSVLoader: input=null, line=2014, can't read line: 2013 values={NO LINES AVAILABLE} with a code of 400.

Is the solr.log file, the java.io.IOException is explained further:

"(line 2013) invalid char between encapsulated token end delimiter"

Here is an example of my data that is coming from the CSV that is giving me trouble.

(Headings at the top of the CSV)
Item Number,Item Description,Item Combined,Item Status,Item Cat1,Cat1 Description,Item Cat2,Cat2 Description,Item Cat3,Cat3 Description,Keywords

(Specific entry that Solr stops at.)
152600,YOGURT "PARFAIT PRO" LF,152600 YOGURT "PARFAIT PRO" LF,A,1002,Dairy,2231,Yogurt,11408,Yogurt Bulk,"PARFAIT INC FAT FOODS FREE GF GLUTEN INC LOW MILL MILLS PARFAIT PRO PRO" SMART SNACK VANILLA VAQNILLA YOGURT

Notice the double quotes in Item Description, Item Combined, and Keywords.

So the strange this is, if I remove the Keywords field from the schema and generate a CSV that does not include the Keywords data, but otherwise make no other changes, the data is able to load just fine, even though there are still double quotes in the Item Description and Item Combine fields.

I know there shouldn't be any double quotes in the data, which I am working on getting rectified, but I'm just wondering: why is this an issue with one of my fields but not others, seeing as they have the same data type?



Wow, this email ended up really long for such a simple question! Any enlightenment would be much appreciated.

Thanks,



Devon O'Shaughnessy

Developer/Analyst

Upper Lakes Foods

p: 800.879.1265 | ext: 4135

w: upperlakesfoods.com<http://upperlakesfoods.com/>



[1498580146444_PastedImage]


Re: Indexing a CSV that contains double quotes

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Devon,
I mean this:
curl 'http://10.0.1.24:8983/solr/products/update?commit=true&encapsulator="' --data-binary @solrItmList.csv -H 'Content-type:application/csv'
Ahmet

On Monday, August 7, 2017, 9:00:13 PM GMT+3, O'Shaughnessy, Devon <de...@ulfoods.com> wrote:


 
 
 
Hi Ahmet,




I'm afraid I don't understand, do you think you could clarify a little bit?




Thanks,




 
 
 

 

 Devon O'Shaughnessy

Developer/Analyst

 Upper Lakes Foods

 p: 800.879.1265 | ext: 4135

 w: upperlakesfoods.com

  

 



 

 

 



________________________________ 
From: Ahmet Arslan <io...@yahoo.com.INVALID>
Sent: Monday, August 7, 2017 12:07:35 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing a CSV that contains double quotes 
 


 
Hi Devon,
I think you need to supply encapsulator=" parameter-value pair.
Ahmet


On Monday, August 7, 2017, 7:57:45 PM GMT+3, O'Shaughnessy, Devon <de...@ulfoods.com> wrote:


  
 
 


Hello all,




I'm pretty new at Solr, having only worked with in a couple weeks, and I'm guessing I'm having a newbie problem of some sort. I'm a little confused about how Solr works with double quotes within strings. I'm uploading a CSV to Solr once a day containing some item data, some of which contains quotes, and I'm getting some errors. I'll do my best to explain my problem.




Here is my schema:




  <field name="Cat1_Description" type="text_en"/>

  <field name="Cat2_Description" type="text_en"/>

  <field name="Cat3_Description" type="text_en"/>

  <field name="Cat1_Facet" type="string"/>

  <field name="Cat2_Facet" type="string"/>

  <field name="Cat3_Facet" type="string"/>

  <field name="Item_Cat1" type="string"/>

  <field name="Item_Cat2" type="string"/>

  <field name="Item_Cat3" type="string"/>

  <field name="Item_Combined" type="string" indexed="false"/>

  <field name="Item_Description" type="text_en"/>

  <field name="Item_Number" type="string" indexed="true" required="true" stored="true"/>

  <field name="Item_Status" type="string"/>

  <field name="Keywords" type="text_en"/>

  <field name="_root_" type="string" docValues="false" indexed="true" stored="false"/>

  <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>

  <field name="_version_" type="long" indexed="false" stored="false"/>

  <copyField source="*" dest="_text_"/>

  <copyField source="Cat1_Description" dest="Cat1_Facet"/>

  <copyField source="Cat2_Description" dest="Cat2_Facet"/>

  <copyField source="Cat3_Description" dest="Cat3_Facet"/>




The command I am using to update the data:




curl 'http://10.0.1.24:8983/solr/products/update?commit=true' --data-binary @solrItmList.csv -H 'Content-type:application/csv'




This is the error I recieve in response:










If for some reason the image doesn't show, it's an XML response indicating an IOException with the message "CSVLoader: input=null, line=2014, can't read line: 2013 values={NO LINES AVAILABLE} with a code of 400.




Is the solr.log file, the java.io.IOException is explained further:




"(line 2013) invalid char between encapsulated token end delimiter"




Here is an example of my data that is coming from the CSV that is giving me trouble.




(Headings at the top of the CSV)

Item Number,Item Description,Item Combined,Item Status,Item Cat1,Cat1 Description,Item Cat2,Cat2 Description,Item Cat3,Cat3 Description,Keywords




(Specific entry that Solr stops at.)

152600,YOGURT "PARFAIT PRO" LF,152600 YOGURT "PARFAIT PRO" LF,A,1002,Dairy,2231,Yogurt,11408,Yogurt Bulk,"PARFAIT INC FAT FOODS FREE GF GLUTEN INC LOW MILL MILLS PARFAIT PRO PRO" SMART SNACK VANILLA VAQNILLA YOGURT




Notice the double quotes in Item Description, Item Combined, and Keywords.




So the strange this is, if I remove the Keywords field from the schema and generate a CSV that does not include the Keywords data, but otherwise make no other changes, the data is able to load just fine, even though there are still double quotes in the Item Description and Item Combine fields.




I know there shouldn't be any double quotes in the data, which I am working on getting rectified, but I'm just wondering: why is this an issue with one of my fields but not others, seeing as they have the same data type?










Wow, this email ended up really long for such a simple question! Any enlightenment would be much appreciated.




Thanks,







 
 
 

 

 Devon O'Shaughnessy

Developer/Analyst

 Upper Lakes Foods

 p: 800.879.1265 | ext: 4135

 w: upperlakesfoods.com

  

 



 

 

 






Re: Indexing a CSV that contains double quotes

Posted by "O'Shaughnessy, Devon" <de...@ulfoods.com>.
Hi Ahmet,


I'm afraid I don't understand, do you think you could clarify a little bit?


Thanks,


Devon O'Shaughnessy

Developer/Analyst

Upper Lakes Foods

p: 800.879.1265 | ext: 4135

w: upperlakesfoods.com<http://upperlakesfoods.com/>



[1498580146444_PastedImage]


________________________________
From: Ahmet Arslan <io...@yahoo.com.INVALID>
Sent: Monday, August 7, 2017 12:07:35 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing a CSV that contains double quotes

Hi Devon,
I think you need to supply encapsulator=" parameter-value pair.
Ahmet


On Monday, August 7, 2017, 7:57:45 PM GMT+3, O'Shaughnessy, Devon <de...@ulfoods.com> wrote:







Hello all,




I'm pretty new at Solr, having only worked with in a couple weeks, and I'm guessing I'm having a newbie problem of some sort. I'm a little confused about how Solr works with double quotes within strings. I'm uploading a CSV to Solr once a day containing some item data, some of which contains quotes, and I'm getting some errors. I'll do my best to explain my problem.




Here is my schema:




  <field name="Cat1_Description" type="text_en"/>

  <field name="Cat2_Description" type="text_en"/>

  <field name="Cat3_Description" type="text_en"/>

  <field name="Cat1_Facet" type="string"/>

  <field name="Cat2_Facet" type="string"/>

  <field name="Cat3_Facet" type="string"/>

  <field name="Item_Cat1" type="string"/>

  <field name="Item_Cat2" type="string"/>

  <field name="Item_Cat3" type="string"/>

  <field name="Item_Combined" type="string" indexed="false"/>

  <field name="Item_Description" type="text_en"/>

  <field name="Item_Number" type="string" indexed="true" required="true" stored="true"/>

  <field name="Item_Status" type="string"/>

  <field name="Keywords" type="text_en"/>

  <field name="_root_" type="string" docValues="false" indexed="true" stored="false"/>

  <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>

  <field name="_version_" type="long" indexed="false" stored="false"/>

  <copyField source="*" dest="_text_"/>

  <copyField source="Cat1_Description" dest="Cat1_Facet"/>

  <copyField source="Cat2_Description" dest="Cat2_Facet"/>

  <copyField source="Cat3_Description" dest="Cat3_Facet"/>




The command I am using to update the data:




curl 'http://10.0.1.24:8983/solr/products/update?commit=true' --data-binary @solrItmList.csv -H 'Content-type:application/csv'




This is the error I recieve in response:










If for some reason the image doesn't show, it's an XML response indicating an IOException with the message "CSVLoader: input=null, line=2014, can't read line: 2013 values={NO LINES AVAILABLE} with a code of 400.




Is the solr.log file, the java.io.IOException is explained further:




"(line 2013) invalid char between encapsulated token end delimiter"




Here is an example of my data that is coming from the CSV that is giving me trouble.




(Headings at the top of the CSV)

Item Number,Item Description,Item Combined,Item Status,Item Cat1,Cat1 Description,Item Cat2,Cat2 Description,Item Cat3,Cat3 Description,Keywords




(Specific entry that Solr stops at.)

152600,YOGURT "PARFAIT PRO" LF,152600 YOGURT "PARFAIT PRO" LF,A,1002,Dairy,2231,Yogurt,11408,Yogurt Bulk,"PARFAIT INC FAT FOODS FREE GF GLUTEN INC LOW MILL MILLS PARFAIT PRO PRO" SMART SNACK VANILLA VAQNILLA YOGURT




Notice the double quotes in Item Description, Item Combined, and Keywords.




So the strange this is, if I remove the Keywords field from the schema and generate a CSV that does not include the Keywords data, but otherwise make no other changes, the data is able to load just fine, even though there are still double quotes in the Item Description and Item Combine fields.




I know there shouldn't be any double quotes in the data, which I am working on getting rectified, but I'm just wondering: why is this an issue with one of my fields but not others, seeing as they have the same data type?










Wow, this email ended up really long for such a simple question! Any enlightenment would be much appreciated.




Thanks,













 Devon O'Shaughnessy

Developer/Analyst

 Upper Lakes Foods

 p: 800.879.1265 | ext: 4135

 w: upperlakesfoods.com

















Re: Indexing a CSV that contains double quotes

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Devon,
I think you need to supply encapsulator=" parameter-value pair.
Ahmet


On Monday, August 7, 2017, 7:57:45 PM GMT+3, O'Shaughnessy, Devon <de...@ulfoods.com> wrote:


  
 
 


Hello all,




I'm pretty new at Solr, having only worked with in a couple weeks, and I'm guessing I'm having a newbie problem of some sort. I'm a little confused about how Solr works with double quotes within strings. I'm uploading a CSV to Solr once a day containing some item data, some of which contains quotes, and I'm getting some errors. I'll do my best to explain my problem.




Here is my schema:




  <field name="Cat1_Description" type="text_en"/>

  <field name="Cat2_Description" type="text_en"/>

  <field name="Cat3_Description" type="text_en"/>

  <field name="Cat1_Facet" type="string"/>

  <field name="Cat2_Facet" type="string"/>

  <field name="Cat3_Facet" type="string"/>

  <field name="Item_Cat1" type="string"/>

  <field name="Item_Cat2" type="string"/>

  <field name="Item_Cat3" type="string"/>

  <field name="Item_Combined" type="string" indexed="false"/>

  <field name="Item_Description" type="text_en"/>

  <field name="Item_Number" type="string" indexed="true" required="true" stored="true"/>

  <field name="Item_Status" type="string"/>

  <field name="Keywords" type="text_en"/>

  <field name="_root_" type="string" docValues="false" indexed="true" stored="false"/>

  <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>

  <field name="_version_" type="long" indexed="false" stored="false"/>

  <copyField source="*" dest="_text_"/>

  <copyField source="Cat1_Description" dest="Cat1_Facet"/>

  <copyField source="Cat2_Description" dest="Cat2_Facet"/>

  <copyField source="Cat3_Description" dest="Cat3_Facet"/>




The command I am using to update the data:




curl 'http://10.0.1.24:8983/solr/products/update?commit=true' --data-binary @solrItmList.csv -H 'Content-type:application/csv'




This is the error I recieve in response:










If for some reason the image doesn't show, it's an XML response indicating an IOException with the message "CSVLoader: input=null, line=2014, can't read line: 2013 values={NO LINES AVAILABLE} with a code of 400.




Is the solr.log file, the java.io.IOException is explained further:




"(line 2013) invalid char between encapsulated token end delimiter"




Here is an example of my data that is coming from the CSV that is giving me trouble.




(Headings at the top of the CSV)

Item Number,Item Description,Item Combined,Item Status,Item Cat1,Cat1 Description,Item Cat2,Cat2 Description,Item Cat3,Cat3 Description,Keywords




(Specific entry that Solr stops at.)

152600,YOGURT "PARFAIT PRO" LF,152600 YOGURT "PARFAIT PRO" LF,A,1002,Dairy,2231,Yogurt,11408,Yogurt Bulk,"PARFAIT INC FAT FOODS FREE GF GLUTEN INC LOW MILL MILLS PARFAIT PRO PRO" SMART SNACK VANILLA VAQNILLA YOGURT




Notice the double quotes in Item Description, Item Combined, and Keywords.




So the strange this is, if I remove the Keywords field from the schema and generate a CSV that does not include the Keywords data, but otherwise make no other changes, the data is able to load just fine, even though there are still double quotes in the Item Description and Item Combine fields.




I know there shouldn't be any double quotes in the data, which I am working on getting rectified, but I'm just wondering: why is this an issue with one of my fields but not others, seeing as they have the same data type?










Wow, this email ended up really long for such a simple question! Any enlightenment would be much appreciated.




Thanks,







 
 
 

 

 Devon O'Shaughnessy

Developer/Analyst

 Upper Lakes Foods

 p: 800.879.1265 | ext: 4135

 w: upperlakesfoods.com