You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Jason Cater <ja...@ncsmags.com> on 2007/05/29 21:41:37 UTC
Re: Commented: (SOLR-216) Improvements to solr.py
Mike,
I've had my solr.py in production use internally for about a month now.
So, as you can imagine, I've worked through a few oddball bugs that
occasionally pop up. It seems pretty stable for me.
I'm planning to upload a new file attachment to this issue containing my
changes, plus fixing the bug reports that were filed against my open
ticket. But first, a few quick questions....
I would prefer to have a complete directory structure (i.e., setup.py,
unit tests, samples, etc) instead of just the solr.py file. Would
anyone see a problem with this?
Also, on some of your comments:
> - list comprehensions solely to perform looped execution are harder to parse and slower than explicitly writing a for loop
>
List comprehensions seem to be a matter of contention for some.
However, it's a battle I'm not interested in fighting, so changed it to
a for loop.
> - shadowing builtins is generally a bad idea
>
Any shadowing of builtins was unintentional. Did you see specific
examples? I run the code through pychecker and pylint to try to avoid
such cases.
> - SolrConnection is an old-style class, but Response is new-style
>
This was a holdover from the old SolrConnection class I copied from. I'm
fixing this.
> functionality:
>
> - why are 'status'/'QTime' returned as floats?
>
This was just a misunderstanding on my part of what QTime was actually
returning. Fixing.
> - all NamedList's appearing in the output are converted to dicts--this loses information (in particular, it will be unnecessarily hard for the user to use highlighting/debug data). Using the python/json response format would prevent this. Not returning highlight/debug data in the standard response format (and yet providing said parameters in the query() method) seems odd. Am I missing something? Oh, they are set as dynamic attributes of Response, I see. Definitely needs documentation.
>
Yes, this needs to be documented. (Please c.f. to my question about
allowing a complete directory structure.)
> - passing fields='' to query() will return all fields, when the desired return is likely no fields
>
I've changed the default for fields= to be '*', instead of None or "".
This way, passing in 'fields=""' will result in 'fl=' being passed to
the backend. However, I still don't see the point, as passing both
'fl=' and 'fl=*' return the exact same set of fields (i.e., "all") on my
test setup.
> - it might be better to settle on an api that permits doc/field boosts. How about using a tuple as the field name in the field dict?
>
> conn.add_many([{'id': 1, ('field2', 2.33): u"some text"}])
>
> doc boosts could be handled by optionally providing the fielddict as a (<fielddict>, boost) tuple.
>
I agree. I was not aware of field boosts at the time. I'll code this change.
> - for 2.5+, a cool addition might be:
>
> if sys.version > 2.5
> import contextlib
> def batched(solrconn):
> solrconn.begin_batch()
> yield solrconn
> solrconn.end_batch()
> batched = contextlib.contextmanager(batched)
>
> Use as:
>
> with batched(solrconn):
> solrconn.add(...)
> solrconn.add(...)
> solrconn.add(...)
>
Adding...
Re: Commented: (SOLR-216) Improvements to solr.py
Posted by Mike Klaas <mi...@gmail.com>.
On 30-May-07, at 12:43 PM, Erik Hatcher wrote:
>
> On May 29, 2007, at 4:38 PM, Mike Klaas wrote:
>>> I agree. I was not aware of field boosts at the time. I'll code
>>> this change.
>>
>> Unfortunately, it is still somewhat awkward. In my python client
>> I end up passing (<name>, <value>, <field boost or None>)
>> everywhere, but that clutters up the api considerably.
>>
>> It might be worth taking a look at the ruby client to see what
>> Eric's done for the api.
>
> In the ruby client, we have the ability to use a Ruby Hash as a
> document when no boosts are needed:
>
> doc = {:field => "value"}
>
> But also use the Field object when boosts are desired:
>
> doc = Solr::Document.new
> doc << Solr::Field.new(:field => "value", :boost => 3.0)
>
> The boost stuff was contributed by Coda Hale (credit where it's
> due :) - I had punted on that initially.
I think I prefer the approach that I suggested for the python
client. The ruby approach makes sense if you want to define a proper
Document class with Fields and such. 99% of the time a map is
sufficient, and tuple fieldname keys allow maps to be slightly
extended to support boost functionality.
-Mike
Re: Commented: (SOLR-216) Improvements to solr.py
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 29, 2007, at 4:38 PM, Mike Klaas wrote:
>> I agree. I was not aware of field boosts at the time. I'll code
>> this change.
>
> Unfortunately, it is still somewhat awkward. In my python client I
> end up passing (<name>, <value>, <field boost or None>) everywhere,
> but that clutters up the api considerably.
>
> It might be worth taking a look at the ruby client to see what
> Eric's done for the api.
In the ruby client, we have the ability to use a Ruby Hash as a
document when no boosts are needed:
doc = {:field => "value"}
But also use the Field object when boosts are desired:
doc = Solr::Document.new
doc << Solr::Field.new(:field => "value", :boost => 3.0)
The boost stuff was contributed by Coda Hale (credit where it's
due :) - I had punted on that initially.
Erik
Re: Commented: (SOLR-216) Improvements to solr.py
Posted by Mike Klaas <mi...@gmail.com>.
[reposting to solr-dev as JIRA destroyed my quoting...]
On 29-May-07, at 12:41 PM, Jason Cater wrote:
> I've had my solr.py in production use internally for about a month
> now. So, as you can imagine, I've worked through a few oddball bugs
> that occasionally pop up. It seems pretty stable for me.
Yes, I agree that it is looking good. Since we would be replacing
the existing implementation completely, I think that it is worth
taking extra care and examining the api choices carefully so we won't
have to replace it or deprecate things in the near future.
> I would prefer to have a complete directory structure (i.e.,
> setup.py, unit tests, samples, etc) instead of just the solr.py
> file. Would anyone see a problem with this?
+1. This would be great--a unittest that could be run against the
solr example would be spectacular!
> Also, on some of your comments:
>
>> - list comprehensions solely to perform looped execution are
>> harder to parse and slower than explicitly writing a for loop
>>
>
> List comprehensions seem to be a matter of contention for some.
> However, it's a battle I'm not interested in fighting, so changed
> it to a for loop.
It is not a matter of contention for me for use in creating a list,
but ISTM less clear and less efficient if the purpose is _solely_ to
perform a loop:
$ python -m timeit '[i+i for i in xrange(10000)]'
100 loops, best of 3: 1.95 msec per loop
$ python -m timeit 'for i in xrange(10000): i+i'
100 loops, best of 3: 1.38 msec per loop
>> - shadowing builtins is generally a bad idea
> Any shadowing of builtins was unintentional. Did you see specific
> examples? I run the code through pychecker and pylint to try to
> avoid such cases.
`id` is shadowed in a few places.
>>
>> - all NamedList's appearing in the output are converted to dicts--
>> this loses information (in particular, it will be unnecessarily
>> hard for the user to use highlighting/debug data). Using the
>> python/json response format would prevent this. Not returning
>> highlight/debug data in the standard response format (and yet
>> providing said parameters in the query() method) seems odd. Am I
>> missing something? Oh, they are set as dynamic attributes of
>> Response, I see. Definitely needs documentation.
>>
>
> Yes, this needs to be documented. (Please c.f. to my question
> about allowing a complete directory structure.)
>
>> - passing fields='' to query() will return all fields, when the
>> desired return is likely no fields
>>
>
> I've changed the default for fields= to be '*', instead of None or
> "". This way, passing in 'fields=""' will result in 'fl=' being
> passed to the backend. However, I still don't see the point, as
> passing both 'fl=' and 'fl=*' return the exact same set of fields
> (i.e., "all") on my test setup.
Hmm, what if you pass fields='', score=True? Ideally tha would pass
fl=score to the backend, bypassing all stored fields.
>> - it might be better to settle on an api that permits doc/field
>> boosts. How about using a tuple as the field name in the field dict?
>>
>> conn.add_many([{'id': 1, ('field2', 2.33): u"some text"}])
>>
>> doc boosts could be handled by optionally providing the fielddict
>> as a (<fielddict>, boost) tuple.
>>
>
> I agree. I was not aware of field boosts at the time. I'll code
> this change.
Unfortunately, it is still somewhat awkward. In my python client I
end up passing (<name>, <value>, <field boost or None>) everywhere,
but that clutters up the api considerably.
It might be worth taking a look at the ruby client to see what Eric's
done for the api.
>> - for 2.5+, a cool addition might be:
>>
>> if sys.version > 2.5
>> import contextlib def batched(solrconn):
>> solrconn.begin_batch()
>> yield solrconn
>> solrconn.end_batch()
>> batched = contextlib.contextmanager(batched)
>>
>> Use as:
>>
>> with batched(solrconn):
>> solrconn.add(...)
>> solrconn.add(...)
>> solrconn.add(...)
>>
>
> Adding...
Unfortunately, it does push the required python version to 2.4.
Personally, I think that requiring 2.4 is not unreasonable, but I'm
somewhat of a bleeding edge guy...
-Mike