You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Yonik Seeley <yo...@apache.org> on 2007/03/30 23:41:26 UTC

Re: [Solr Wiki] Update of "UpdateCSV" by YonikSeeley

Any comments on the CSV parameters, while the paint is still fresh?
Specifically, what about the default of commit=true?  Seems to make
sense for large CSV uploads, but not for small ones.  Should it be
"false" for consistency with the XML update handler???

The docs also reference a currently non-existent page about different
ways to upload data (POST binary, stream.url, stream,file, etc...)

-Yonik

On 3/30/07, Apache Wiki <wi...@apache.org> wrote:
> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
>
> The following page has been changed by YonikSeeley:
> http://wiki.apache.org/solr/UpdateCSV

Re: [Solr Wiki] Update of "UpdateCSV" by YonikSeeley

Posted by Chris Hostetter <ho...@fucit.org>.

: Any comments on the CSV parameters, while the paint is still fresh?

the params look sane to me, but two things should have hteir documentation
clarified...

1) relationship between "header=true" and "skip=N" ... are rows 0 to N
skiped first, and then row N+1 is ignored as the header; or is row 0
ignored as a header and then rows 1 to N+1 skipped; or do they overlap, so
that regardless of the value of "header" rows skip=N means rows 0 to N are
skipped (esentially making skip=1 unneccessary if header=true)

2) what escaping (if any) is understood by the CSV parser?  how does the
escaping change if an alternate encapsulator is specified?



-Hoss

Re: [Solr Wiki] Update of "UpdateCSV" by YonikSeeley

Posted by Yonik Seeley <yo...@apache.org>.

On 3/30/07, Ryan McKinley <ry...@gmail.com> wrote:
> I've been away for a couple days...  I haven't had a chance to try the
> CSV loader, but it looks good.
>
> I just updated SOLR-185 so that the Xml, CSV etc, can share the way
> they commit at the end of a request.

Cool, probably a lot of stuff should be shared and factored out going forward...
"updateable" documents should be consistent across CSV, SQL, XML, etc.

> Damn, just realized that breaks your tests testCommitFalse()/testCommitTrue()...

I'm not sure if I should default commit to "true" anyway...

-Yonik

Re: [Solr Wiki] Update of "UpdateCSV" by YonikSeeley

Posted by Ryan McKinley <ry...@gmail.com>.

I've been away for a couple days...  I haven't had a chance to try the
CSV loader, but it looks good.

I just updated SOLR-185 so that the Xml, CSV etc, can share the way
they commit at the end of a request.

Damn, just realized that breaks your tests testCommitFalse()/testCommitTrue()...

ryan



On 3/30/07, Yonik Seeley <yo...@apache.org> wrote:
> Any comments on the CSV parameters, while the paint is still fresh?
> Specifically, what about the default of commit=true?  Seems to make
> sense for large CSV uploads, but not for small ones.  Should it be
> "false" for consistency with the XML update handler???
>
> The docs also reference a currently non-existent page about different
> ways to upload data (POST binary, stream.url, stream,file, etc...)
>
> -Yonik
>
> On 3/30/07, Apache Wiki <wi...@apache.org> wrote:
> > Dear Wiki user,
> >
> > You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
> >
> > The following page has been changed by YonikSeeley:
> > http://wiki.apache.org/solr/UpdateCSV
>

Re: [Solr Wiki] Update of "UpdateCSV" by YonikSeeley

Posted by Chris Hostetter <ho...@fucit.org>.

: Think of skipLines as a way to ignore junk before the start of *any* CSV data.
: Lines are always skipped before any header/data is read.  While the
: implementations are related (skipping a header just increments
: skipLines), the concepts are independent.

yeah ... see that's not hwat i would have assumed at all ... i would have
thought that your CSV file is expected to be "clean" with the first line
being an optional header, and skipLInes is intended as a way for you to
say "ingore the first 105678 records of CSV data, i know they are old and
already indexed, start at record 105679"

: I'll re-read the wiki entries and see if I can clarify...

moving skipLInes to the top of the list and clarifying it strips off raw
lines before any header/data processing is done should do the trick.


-Hoss

Re: [Solr Wiki] Update of "UpdateCSV" by YonikSeeley

Posted by Yonik Seeley <yo...@apache.org>.

On 3/31/07, Chris Hostetter <ho...@fucit.org> wrote:
>
> : - skipLines specifies number of lines to skip before CSV data starts
>
> based purely on reading the wiki, it's still not clear to me how
> skipLines=N and header=true interact ... i assume the header is poped off
> first, but it would be good to be explicit (and mention each option in the
> description of hte other so people who see skipLines first don't assume
> that's the only way to ignore a header line)

Think of skipLines as a way to ignore junk before the start of *any* CSV data.
Lines are always skipped before any header/data is read.  While the
implementations are related (skipping a header just increments
skipLines), the concepts are independent.

I'll re-read the wiki entries and see if I can clarify...

-Yonik

Re: [Solr Wiki] Update of "UpdateCSV" by YonikSeeley

Posted by Chris Hostetter <ho...@fucit.org>.

: - skipLines specifies number of lines to skip before CSV data starts

based purely on reading the wiki, it's still not clear to me how
skipLines=N and header=true interact ... i assume the header is poped off
first, but it would be good to be explicit (and mention each option in the
description of hte other so people who see skipLines first don't assume
that's the only way to ignore a header line)


-Hoss

Re: [Solr Wiki] Update of "UpdateCSV" by YonikSeeley

Posted by Yonik Seeley <yo...@apache.org>.

Changes committed, docs updated.

http://wiki.apache.org/solr/UpdateCSV

- commit defaults to false,
- skip specifies fields to skip,
- skipLines specifies number of lines to skip before CSV data starts
- tested that zero length field names skip the field

-Yonik

Re: [Solr Wiki] Update of "UpdateCSV" by YonikSeeley

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Mar 31, 2007, at 11:48 AM, Yonik Seeley wrote:

> On 3/31/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>> On a tab-delimited file I just got from a client, I got this error:
>>
>> SEVERE: java.io.IOException: (line 119986) invalid char between
>> encapsualted token end delimiter
>>          at org.apache.commons.csv.CSVParser.encapsulatedTokenLexer
>> (CSVParser.java:499)
>>
>> This may just be a problem with the file,
>
> It sounds like there is a field that looks like it's encapsulated, but
> then has some other non-whitespace characters after that.
>
> I was able to reproduce your exception via:
> curl 'http://localhost:8983/solr/update/csv?stream.body=id,name% 
> 0A"10"oops,wow'
>
> Notice the oops after the quoted 10.
>
> Is your file a "real" CSV file?

It's tab-delimited, no encapsulation stuff going on either, just  
simply tabs separating fields.

> If there is no escaping at all (no tabs in field values, no newlines,
> etc), perhaps try setting the encapsulator to something that won't
> occur in the file.

voila!

I used &encapsulator=%1f and a few minutes later ~1.8M records were  
indexed!

>> I have another tab-delimited file to bring in, but only some of the
>> columns should be imported.  Is it possible with this loader to skip
>> over columns in the data file not desired in Solr?  Certainly I can
>> transform the file before loading, so its not a problem, just  
>> curious.
>
> LOL... I did implement that originally, and then forgot about it.
> The "skip" param already implemented skipping particular fields, and
> then I went and added code to read "skip" as skipLines.  I'll fix
> that.
>
> The other way to skip fields is to give them a zero length name.
> So if you wanted to skip the second column, use
> fieldnames=id,,title_text,qty_display,etc

Very cool!

This CSV importer will prove very handy.  Already has.

	Erik

Re: [Solr Wiki] Update of "UpdateCSV" by YonikSeeley

Posted by Yonik Seeley <yo...@apache.org>.

On 3/31/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> On a tab-delimited file I just got from a client, I got this error:
>
> SEVERE: java.io.IOException: (line 119986) invalid char between
> encapsualted token end delimiter
>          at org.apache.commons.csv.CSVParser.encapsulatedTokenLexer
> (CSVParser.java:499)
>
> This may just be a problem with the file,

It sounds like there is a field that looks like it's encapsulated, but
then has some other non-whitespace characters after that.

I was able to reproduce your exception via:
curl 'http://localhost:8983/solr/update/csv?stream.body=id,name%0A"10"oops,wow'

Notice the oops after the quoted 10.

Is your file a "real" CSV file?  How is escaping handled?
If there is no escaping at all (no tabs in field values, no newlines,
etc), perhaps try setting the encapsulator to something that won't
occur in the file.

> or perhaps I need to
> specify an encoding (not quite sure what it is on that file, but it
> doesn't appear to be UTF8 as TextEdit complained about it).  The file
> is brand new to me, and fairly large (~150MB).  The command I'm using
> to import is:
>
>         curl "http://localhost:8983/solr/update/csv?stream.file=/Users/erik/
> Desktop/data.txt&separator=%
> 09&fieldnames=id,name_text,title_text,qty_display,price_display,config_d
> isplay,category_facet"
>
> I have another tab-delimited file to bring in, but only some of the
> columns should be imported.  Is it possible with this loader to skip
> over columns in the data file not desired in Solr?  Certainly I can
> transform the file before loading, so its not a problem, just curious.

LOL... I did implement that originally, and then forgot about it.
The "skip" param already implemented skipping particular fields, and
then I went and added code to read "skip" as skipLines.  I'll fix
that.

The other way to skip fields is to give them a zero length name.
So if you wanted to skip the second column, use
fieldnames=id,,title_text,qty_display,etc

I'll document that.

Thanks for refreshing my memory :-)

-Yonik

Re: [Solr Wiki] Update of "UpdateCSV" by YonikSeeley

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

This is really great!

I agree with Hoss, keep the commit=false by default and let he client  
control how commits work.

On a tab-delimited file I just got from a client, I got this error:

SEVERE: java.io.IOException: (line 119986) invalid char between  
encapsualted token end delimiter
         at org.apache.commons.csv.CSVParser.encapsulatedTokenLexer 
(CSVParser.java:499)

This may just be a problem with the file, or perhaps I need to  
specify an encoding (not quite sure what it is on that file, but it  
doesn't appear to be UTF8 as TextEdit complained about it).  The file  
is brand new to me, and fairly large (~150MB).  The command I'm using  
to import is:

	curl "http://localhost:8983/solr/update/csv?stream.file=/Users/erik/ 
Desktop/data.txt&separator=% 
09&fieldnames=id,name_text,title_text,qty_display,price_display,config_d 
isplay,category_facet"

I have another tab-delimited file to bring in, but only some of the  
columns should be imported.  Is it possible with this loader to skip  
over columns in the data file not desired in Solr?  Certainly I can  
transform the file before loading, so its not a problem, just curious.

Thanks again for another great piece of capability in Solr.  You all  
are amazing.

	Erik

On Mar 30, 2007, at 5:41 PM, Yonik Seeley wrote:

> Any comments on the CSV parameters, while the paint is still fresh?
> Specifically, what about the default of commit=true?  Seems to make
> sense for large CSV uploads, but not for small ones.  Should it be
> "false" for consistency with the XML update handler???
>
> The docs also reference a currently non-existent page about different
> ways to upload data (POST binary, stream.url, stream,file, etc...)
>
> -Yonik
>
> On 3/30/07, Apache Wiki <wi...@apache.org> wrote:
>> Dear Wiki user,
>>
>> You have subscribed to a wiki page or wiki category on "Solr Wiki"  
>> for change notification.
>>
>> The following page has been changed by YonikSeeley:
>> http://wiki.apache.org/solr/UpdateCSV

Re: [Solr Wiki] Update of "UpdateCSV" by YonikSeeley

Posted by Chris Hostetter <ho...@fucit.org>.

: Specifically, what about the default of commit=true?  Seems to make
: sense for large CSV uploads, but not for small ones.  Should it be
: "false" for consistency with the XML update handler???

in general i think hard coded defaults should be "do less magic" ... if
people want the "default" for their instance to be commit=true, the can
add it as a default in their solrconfig.


-Hoss