You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Jason Cater <ja...@ncsmags.com> on 2007/05/29 21:41:37 UTC

Re: Commented: (SOLR-216) Improvements to solr.py

Mike, 

I've had my solr.py in production use internally for about a month now. 
So, as you can imagine, I've worked through a few oddball bugs that 
occasionally pop up.  It seems pretty stable for me.

I'm planning to upload a new file attachment to this issue containing my 
changes, plus fixing the bug reports that were filed against my open 
ticket.  But first, a few quick questions....

I would prefer to have a complete directory structure (i.e., setup.py, 
unit tests, samples, etc) instead of just the solr.py file.  Would 
anyone see a problem with this?

Also, on some of your comments:

>  - list comprehensions solely to perform looped execution are harder to parse and slower than explicitly writing a for loop
>   

List comprehensions seem to be a matter of contention for some.  
However, it's a battle I'm not interested in fighting, so changed it to 
a for loop.

>  - shadowing builtins is generally a bad idea
>   

Any shadowing of builtins was unintentional.  Did you see specific 
examples?  I run the code through pychecker and pylint to try to avoid 
such cases.

>  - SolrConnection is an old-style class, but Response is new-style 
>   

This was a holdover from the old SolrConnection class I copied from. I'm 
fixing this.

> functionality:
>
>  - why are 'status'/'QTime' returned as floats?
>   

This was just a misunderstanding on my part of what QTime was actually 
returning.  Fixing.

>  - all NamedList's appearing in the output are converted to dicts--this loses information (in particular, it will be unnecessarily hard for the user to use highlighting/debug data).  Using the python/json response format would prevent this.  Not returning highlight/debug data in the standard response format (and yet providing said parameters in the query() method) seems odd.  Am I missing something?  Oh, they are set as dynamic attributes of Response, I see.  Definitely needs documentation.
>   

Yes, this needs to be documented.  (Please c.f. to my question about 
allowing a complete directory structure.)

>  - passing fields='' to query() will return all fields, when the desired return is likely no fields
>   

I've changed the default for fields= to be '*', instead of None or "".  
This way, passing in 'fields=""' will result in 'fl=' being passed to 
the backend.  However, I still don't see the point, as passing both 
'fl=' and 'fl=*' return the exact same set of fields (i.e., "all") on my 
test setup.

>  - it might be better to settle on an api that permits doc/field boosts.  How about using a tuple as the field name in the field dict?
>
> conn.add_many([{'id': 1, ('field2', 2.33): u"some text"}])
>
> doc boosts could be handled by optionally providing the fielddict as a (<fielddict>, boost) tuple.
>   

I agree. I was not aware of field boosts at the time. I'll code this change.

> - for 2.5+, a cool addition might be:
>
> if sys.version > 2.5
>    import contextlib   
>    def batched(solrconn):
>           solrconn.begin_batch()
> 	yield solrconn
> 	solrconn.end_batch()
>   batched = contextlib.contextmanager(batched)
>
> Use as:
>
> with batched(solrconn):
>        solrconn.add(...)
>        solrconn.add(...)
>        solrconn.add(...)
>   

Adding...

Re: Commented: (SOLR-216) Improvements to solr.py

Posted by Mike Klaas <mi...@gmail.com>.
On 30-May-07, at 12:43 PM, Erik Hatcher wrote:

>
> On May 29, 2007, at 4:38 PM, Mike Klaas wrote:
>>> I agree. I was not aware of field boosts at the time. I'll code  
>>> this change.
>>
>> Unfortunately, it is still somewhat awkward.  In my python client  
>> I end up passing (<name>, <value>, <field boost or None>)  
>> everywhere, but that clutters up the api considerably.
>>
>> It might be worth taking a look at the ruby client to see what  
>> Eric's done for the api.
>
> In the ruby client, we have the ability to use a Ruby Hash as a  
> document when no boosts are needed:
>
>    doc = {:field => "value"}
>
> But also use the Field object when boosts are desired:
>
>    doc = Solr::Document.new
>    doc << Solr::Field.new(:field => "value", :boost => 3.0)
>
> The boost stuff was contributed by Coda Hale (credit where it's  
> due :) - I had punted on that initially.

I think I prefer the approach that I suggested for the python  
client.  The ruby approach makes sense if you want to define a proper  
Document class with Fields and such.  99% of the time a map is  
sufficient, and tuple fieldname keys allow maps to be slightly  
extended to support boost functionality.

-Mike

Re: Commented: (SOLR-216) Improvements to solr.py

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 29, 2007, at 4:38 PM, Mike Klaas wrote:
>> I agree. I was not aware of field boosts at the time. I'll code  
>> this change.
>
> Unfortunately, it is still somewhat awkward.  In my python client I  
> end up passing (<name>, <value>, <field boost or None>) everywhere,  
> but that clutters up the api considerably.
>
> It might be worth taking a look at the ruby client to see what  
> Eric's done for the api.

In the ruby client, we have the ability to use a Ruby Hash as a  
document when no boosts are needed:

    doc = {:field => "value"}

But also use the Field object when boosts are desired:

    doc = Solr::Document.new
    doc << Solr::Field.new(:field => "value", :boost => 3.0)

The boost stuff was contributed by Coda Hale (credit where it's  
due :) - I had punted on that initially.

	Erik



Re: Commented: (SOLR-216) Improvements to solr.py

Posted by Mike Klaas <mi...@gmail.com>.
[reposting to solr-dev as JIRA destroyed my quoting...]

On 29-May-07, at 12:41 PM, Jason Cater wrote:
> I've had my solr.py in production use internally for about a month  
> now. So, as you can imagine, I've worked through a few oddball bugs  
> that occasionally pop up.  It seems pretty stable for me.

Yes, I agree that it is looking good.  Since we would be replacing  
the existing implementation completely, I think that it is worth  
taking extra care and examining the api choices carefully so we won't  
have to replace it or deprecate things in the near future.

> I would prefer to have a complete directory structure (i.e.,  
> setup.py, unit tests, samples, etc) instead of just the solr.py  
> file.  Would anyone see a problem with this?

+1.  This would be great--a unittest that could be run against the  
solr example would be spectacular!

> Also, on some of your comments:
>
>>  - list comprehensions solely to perform looped execution are  
>> harder to parse and slower than explicitly writing a for loop
>>
>
> List comprehensions seem to be a matter of contention for some.   
> However, it's a battle I'm not interested in fighting, so changed  
> it to a for loop.

It is not a matter of contention for me for use in creating a list,  
but ISTM less clear and less efficient if the purpose is _solely_ to  
perform a loop:

$ python -m timeit '[i+i for i in xrange(10000)]'
100 loops, best of 3: 1.95 msec per loop

$ python -m timeit 'for i in xrange(10000): i+i'
100 loops, best of 3: 1.38 msec per loop

>>  - shadowing builtins is generally a bad idea
> Any shadowing of builtins was unintentional.  Did you see specific  
> examples?  I run the code through pychecker and pylint to try to  
> avoid such cases.

`id` is shadowed in a few places.
>>

>>  - all NamedList's appearing in the output are converted to dicts-- 
>> this loses information (in particular, it will be unnecessarily  
>> hard for the user to use highlighting/debug data).  Using the  
>> python/json response format would prevent this.  Not returning  
>> highlight/debug data in the standard response format (and yet  
>> providing said parameters in the query() method) seems odd.  Am I  
>> missing something?  Oh, they are set as dynamic attributes of  
>> Response, I see.  Definitely needs documentation.
>>
>
> Yes, this needs to be documented.  (Please c.f. to my question  
> about allowing a complete directory structure.)
>
>>  - passing fields='' to query() will return all fields, when the  
>> desired return is likely no fields
>>
>
> I've changed the default for fields= to be '*', instead of None or  
> "".  This way, passing in 'fields=""' will result in 'fl=' being  
> passed to the backend.  However, I still don't see the point, as  
> passing both 'fl=' and 'fl=*' return the exact same set of fields  
> (i.e., "all") on my test setup.

Hmm, what if you pass fields='', score=True?  Ideally tha would pass  
fl=score to the backend, bypassing all stored fields.

>>  - it might be better to settle on an api that permits doc/field  
>> boosts.  How about using a tuple as the field name in the field dict?
>>
>> conn.add_many([{'id': 1, ('field2', 2.33): u"some text"}])
>>
>> doc boosts could be handled by optionally providing the fielddict  
>> as a (<fielddict>, boost) tuple.
>>
>
> I agree. I was not aware of field boosts at the time. I'll code  
> this change.

Unfortunately, it is still somewhat awkward.  In my python client I  
end up passing (<name>, <value>, <field boost or None>) everywhere,  
but that clutters up the api considerably.

It might be worth taking a look at the ruby client to see what Eric's  
done for the api.

>> - for 2.5+, a cool addition might be:
>>
>> if sys.version > 2.5
>>    import contextlib      def batched(solrconn):
>>           solrconn.begin_batch()
>> 	yield solrconn
>> 	solrconn.end_batch()
>>   batched = contextlib.contextmanager(batched)
>>
>> Use as:
>>
>> with batched(solrconn):
>>        solrconn.add(...)
>>        solrconn.add(...)
>>        solrconn.add(...)
>>
>
> Adding...

Unfortunately, it does push the required python version to 2.4.   
Personally, I think that requiring 2.4 is not unreasonable, but I'm  
somewhat of a bleeding edge guy...

-Mike