You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael Lackhoff <mi...@lackhoff.de> on 2008/01/04 16:25:34 UTC

Another text I cannot get into SOLR with csv

If the fields value is:
's-Gravenhage
I cannot get it into SOLR with CSV.
I tried to double the single quote/apostrophe or escape it in several
ways but I either get an error or another character (the "escape") in
front of the single quote. Is it not possible to have a field that
begins with an apostrophe/a single quote?
There is no error if the apostrophe is at the end of the field.
Is there anything I could try or do I have to use XML?

-Michael


Re: Another text I cannot get into SOLR with csv

Posted by Michael Lackhoff <mi...@lackhoff.de>.
On 08.01.2008 19:09 Yonik Seeley wrote:

> There is no shorter way, but if you update to the latest solr-dev
> (changes I checked in today), the default will be no encapsulation for
> split fields.

Many thanks, also for your patience!
Do you think the dev-version is ready for production?

-Michael

Re: Another text I cannot get into SOLR with csv

Posted by Yonik Seeley <yo...@apache.org>.
On Jan 8, 2008 12:59 PM, Michael Lackhoff <mi...@lackhoff.de> wrote:
> On 08.01.2008 16:55 Yonik Seeley wrote:
>
> >> - A literal encapsulator should be possible to add by doubling
> >>    it ' => '' but this gives the same error
> >
> > I think you would have to tripple it (the first is the encapsulator).
> > Regardless, don't use encapsulation on the split fields unless you
> > have to.
>
> I don't want to use encapsulation it is just that the character is
> _interpreted_ as encapsulation character and I need a way to tell SOLR
> that it is not.

I understand... I meant that the easiest way to avoid having that
character interpreted as encapsulation is to turn off encapsulation
(by setting it to a char that won't ever match).

The harder way is to escape the encapsulator... per standard CSV, that
requires first encapsulating the field value and then doubling the
encapsulation char where it appears in the value.  So 'a would be
'''a' (three single quotes to start the value).

> >> - is it possible to change the split field separator for all fields? The
> >>    URL is getting rather long already.
> >
> > if "f.myfield.separator" is missing, it uses "separator"  (standard
> > per-field parameters).
> > So if everything uses "," you don't have to specify a separator anywhere.
>
> Oh, sorry, I meant encapsulator of course, not separator. The
> encapsulator is the problem and I would like a way shorter than
> &f.myfield1.encapsultor=%00&f.myfield2.encapsulator=%00... for about 20
> fields in addition to the parameters that are necessary to tell SOLR
> that all these are split fields.

There is no shorter way, but if you update to the latest solr-dev
(changes I checked in today), the default will be no encapsulation for
split fields.

-Yonik

Re: Another text I cannot get into SOLR with csv

Posted by Michael Lackhoff <mi...@lackhoff.de>.
On 08.01.2008 16:55 Yonik Seeley wrote:

>> - A literal encapsulator should be possible to add by doubling
>>    it ' => '' but this gives the same error
> 
> I think you would have to tripple it (the first is the encapsulator).
> Regardless, don't use encapsulation on the split fields unless you
> have to.

I don't want to use encapsulation it is just that the character is 
_interpreted_ as encapsulation character and I need a way to tell SOLR 
that it is not.

>> - is it possible to change the split field separator for all fields? The
>>    URL is getting rather long already.
> 
> if "f.myfield.separator" is missing, it uses "separator"  (standard
> per-field parameters).
> So if everything uses "," you don't have to specify a separator anywhere.

Oh, sorry, I meant encapsulator of course, not separator. The 
encapsulator is the problem and I would like a way shorter than
&f.myfield1.encapsultor=%00&f.myfield2.encapsulator=%00... for about 20 
fields in addition to the parameters that are necessary to tell SOLR 
that all these are split fields.

-Michael

Re: Another text I cannot get into SOLR with csv

Posted by Yonik Seeley <yo...@apache.org>.
On Jan 8, 2008 10:32 AM, Michael Lackhoff <mi...@lackhoff.de> wrote:
> On 08.01.2008 16:11 Yonik Seeley wrote:
>
> > Ahh, wait, it looks a single quote as the encapsulator for split field
> > values by default.
> > Try adding f.PUBLPLACE.encapsulator=%00
> > to disable the encapsulation.
>
> Hmm. Yes, this works but:
> - I didn't find anything about it in the docs (wiki). On the contrary
>    it suggests that the single quote has to be explicitly set:
>    f.tags.encapsulator='

Right... I just changed the code to match the docs.

> (http://wiki.apache.org/solr/UpdateCSV?#head-c238cb494f800d345766acda16e08d82663127ce)
> - A literal encapsulator should be possible to add by doubling
>    it ' => '' but this gives the same error

I think you would have to tripple it (the first is the encapsulator).
Regardless, don't use encapsulation on the split fields unless you
have to.

> - is it possible to change the split field separator for all fields? The
>    URL is getting rather long already.

if "f.myfield.separator" is missing, it uses "separator"  (standard
per-field parameters).
So if everything uses "," you don't have to specify a separator anywhere.

-Yonik

Re: Another text I cannot get into SOLR with csv

Posted by Michael Lackhoff <mi...@lackhoff.de>.
On 08.01.2008 16:11 Yonik Seeley wrote:

> Ahh, wait, it looks a single quote as the encapsulator for split field
> values by default.
> Try adding f.PUBLPLACE.encapsulator=%00
> to disable the encapsulation.

Hmm. Yes, this works but:
- I didn't find anything about it in the docs (wiki). On the contrary
   it suggests that the single quote has to be explicitly set:
   f.tags.encapsulator='
 
(http://wiki.apache.org/solr/UpdateCSV?#head-c238cb494f800d345766acda16e08d82663127ce)
- A literal encapsulator should be possible to add by doubling
   it ' => '' but this gives the same error
- is it possible to change the split field separator for all fields? The
   URL is getting rather long already.


Re: Another text I cannot get into SOLR with csv

Posted by Yonik Seeley <yo...@apache.org>.
On Jan 8, 2008 9:58 AM, Yonik Seeley <yo...@apache.org> wrote:
> On Jan 8, 2008 3:07 AM, Michael Lackhoff <mi...@lackhoff.de> wrote:
> > After a long weekend I could do a deeper look into this one and it looks
> > as if the problem has to do with splitting.
> >
> > > This one works for me fine.
> > >
> > > $ cat t2.csv
> > > id,name
> > > 12345,"'s-Gravenhage"
> > > 12345,'s-Gravenhage
> > > 12345,"""s-Gravenhage"
> > >
> > > $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary
> > > @t2.csv -H 'Content-type:text/csv; charset=utf-8'
> >
> > My csv-file:
> > DBRECORDID,PUBLPLACE
> > 43298,"'s-Gravenhage"
> >
> > The URL (giving a 400 error):
> > http://localhost:8983/solr/update/csv?f.PUBLPLACE.split=true&commit=true"
> > (PUBLPLACE is defined as multivalued field)
> >
> > If I remove the "f.PUBLPLACE.split=true" parameter OR make sure that the
> > apostrophe is not the first character, everything is fine.
>
> Indeed... looks like you hit another bug.
> Could you file another bug (this time with Solr)?
> If it turns out to be a commons-csv bug, I'll file another bug there.

Ahh, wait, it looks a single quote as the encapsulator for split field
values by default.
Try adding f.PUBLPLACE.encapsulator=%00
to disable the encapsulation.

-Yonik

Re: Another text I cannot get into SOLR with csv

Posted by Yonik Seeley <yo...@apache.org>.
On Jan 8, 2008 3:07 AM, Michael Lackhoff <mi...@lackhoff.de> wrote:
> After a long weekend I could do a deeper look into this one and it looks
> as if the problem has to do with splitting.
>
> > This one works for me fine.
> >
> > $ cat t2.csv
> > id,name
> > 12345,"'s-Gravenhage"
> > 12345,'s-Gravenhage
> > 12345,"""s-Gravenhage"
> >
> > $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary
> > @t2.csv -H 'Content-type:text/csv; charset=utf-8'
>
> My csv-file:
> DBRECORDID,PUBLPLACE
> 43298,"'s-Gravenhage"
>
> The URL (giving a 400 error):
> http://localhost:8983/solr/update/csv?f.PUBLPLACE.split=true&commit=true"
> (PUBLPLACE is defined as multivalued field)
>
> If I remove the "f.PUBLPLACE.split=true" parameter OR make sure that the
> apostrophe is not the first character, everything is fine.

Indeed... looks like you hit another bug.
Could you file another bug (this time with Solr)?
If it turns out to be a commons-csv bug, I'll file another bug there.

-Yonik

Re: Another text I cannot get into SOLR with csv

Posted by Michael Lackhoff <mi...@lackhoff.de>.
After a long weekend I could do a deeper look into this one and it looks 
as if the problem has to do with splitting.

> This one works for me fine.
>
> $ cat t2.csv
> id,name
> 12345,"'s-Gravenhage"
> 12345,'s-Gravenhage
> 12345,"""s-Gravenhage"
>
> $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary
> @t2.csv -H 'Content-type:text/csv; charset=utf-8'

My csv-file:
DBRECORDID,PUBLPLACE
43298,"'s-Gravenhage"

The URL (giving a 400 error):
http://localhost:8983/solr/update/csv?f.PUBLPLACE.split=true&commit=true"
(PUBLPLACE is defined as multivalued field)

If I remove the "f.PUBLPLACE.split=true" parameter OR make sure that the 
apostrophe is not the first character, everything is fine.
But I need the field to be multivalued and thus need the split parameter 
(not for this record but for others) and as the example shows, some have 
an apostrophe as the first character. Any ideas how to deal with this?

-Michael

Re: Another text I cannot get into SOLR with csv

Posted by Yonik Seeley <yo...@apache.org>.
On Jan 4, 2008 11:18 AM, Michael Lackhoff <mi...@lackhoff.de> wrote:
> On 04.01.2008 16:55 Yonik Seeley wrote:
>
> > On Jan 4, 2008 10:25 AM, Michael Lackhoff <mi...@lackhoff.de> wrote:
> >> If the fields value is:
> >> 's-Gravenhage
> >> I cannot get it into SOLR with CSV.
> >
> > This one works for me fine.
> >
> > $ cat t2.csv
> > id,name
> > 12345,"'s-Gravenhage"
> > 12345,'s-Gravenhage
> > 12345,"""s-Gravenhage"
> >
> > $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary
> > @t2.csv -H 'Content-type:text/csv; charset=utf-8'
>
> But you are cheating ;-) This works for me too but I am using a local
> csv file for the update:
> http://localhost:8983/solr/update/csv?stream.file=t2.csv&separator=%09&f.SIGNATURE.split=true&commit=true

That works for me too if I remove the separator=%09 (since the file
uses comma as a separator and not tab)

-Yonik

Re: Another text I cannot get into SOLR with csv

Posted by Michael Lackhoff <mi...@lackhoff.de>.
On 04.01.2008 16:55 Yonik Seeley wrote:

> On Jan 4, 2008 10:25 AM, Michael Lackhoff <mi...@lackhoff.de> wrote:
>> If the fields value is:
>> 's-Gravenhage
>> I cannot get it into SOLR with CSV.
> 
> This one works for me fine.
> 
> $ cat t2.csv
> id,name
> 12345,"'s-Gravenhage"
> 12345,'s-Gravenhage
> 12345,"""s-Gravenhage"
> 
> $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary
> @t2.csv -H 'Content-type:text/csv; charset=utf-8'

But you are cheating ;-) This works for me too but I am using a local
csv file for the update:
http://localhost:8983/solr/update/csv?stream.file=t2.csv&separator=%09&f.SIGNATURE.split=true&commit=true

Perhaps the problem is that I cannot define a charset for the stream.file?

-Michael


Re: Another text I cannot get into SOLR with csv

Posted by Yonik Seeley <yo...@apache.org>.
On Jan 4, 2008 10:25 AM, Michael Lackhoff <mi...@lackhoff.de> wrote:
> If the fields value is:
> 's-Gravenhage
> I cannot get it into SOLR with CSV.

This one works for me fine.

$ cat t2.csv
id,name
12345,"'s-Gravenhage"
12345,'s-Gravenhage
12345,"""s-Gravenhage"

$ curl http://localhost:8983/solr/update/csv?commit=true --data-binary
@t2.csv -H 'Content-type:text/csv; charset=utf-8'
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">78</int></lst>
</response>

-Yonik

Re: Another text I cannot get into SOLR with csv

Posted by Ryan McKinley <ry...@gmail.com>.
Michael Lackhoff wrote:
> If the fields value is:
> 's-Gravenhage
> I cannot get it into SOLR with CSV.
> I tried to double the single quote/apostrophe or escape it in several
> ways but I either get an error or another character (the "escape") in
> front of the single quote. Is it not possible to have a field that
> begins with an apostrophe/a single quote?
> There is no error if the apostrophe is at the end of the field.
> Is there anything I could try or do I have to use XML?
> 

can you open you .csv file in excel or equivalent?  If so, that should 
handle all escaping issues for you...

ryan