You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Alan Burlison <Al...@sun.com> on 2006/12/23 01:48:47 UTC

Handling disparate data sources in Solr

Hi,

I'm considering using Solr to replace an existing bare-metal Lucene 
deployment - the current Lucene setup is embedded inside an existing 
monolithic webapp, and I want to factor out the search functionality 
into a separate webapp so it can be reused more easily.

At present the content of the Lucene index comes from many different 
sources (web pages, documents, blog posts etc) and can be different 
formats (plaintext, HTML, PDF etc).  All the various content types are 
rendered to plaintext before being inserted into the Lucene index.

The net result is that the data in one field in the index (say 
"content") may have come from one of a number of source document types. 
  I'm having difficulty understanding how I might map this functionality 
onto Solr.  I understand how (for example) I could use 
HTMLStripStandardTokenizer to insert the contents of a HTML document 
into a field called "content", but (assuming I'd written a PDF analyser) 
how would I insert the content of a PDF document into the same "content" 
field?

I know I could do this by preprocessing the various document types to 
plaintext in the various Solr clients before inserting the data into the 
index, but that means that each client would need to know how to do the 
document transformation.  As well as centralising the index, I also want 
to centralise the handling of the different document types.

Another question:

What do "omitNorms" and "positionIncrementGap" mean in the schema.xml 
file?  The documentation is vague to say the least, and google wasn't 
much more helpful.

Thanks,

-- 
Alan Burlison
--

Re: detecting duplicates using the field type 'text'

Posted by Chris Hostetter <ho...@fucit.org>.

:  <uniqueKey>id</uniqueKey>
:  <defaultSearchField>document_title</defaultSearchField>
:  <copyField source="document_title" dest="id"/>

whoa... that's a pretty out there usecase ... i don't think i've ever seen
someone use their uniqueKey field as the target of a copyField.

off the top of my head, i suspect maybe the copy field is taking place
after the duplicate detection? ... but i'm not sure...

: When I add a document with a duplicate title (numeric only), it does not
: get duplicated

...and now i'm *really* not sure, that doens't make much sense to me at
all.

: I can ensure duplicates DO NOT get added when using the field type
: 'string'.

hmm... could you perhaps add the value directly to your "id" field
(string) and then copyField it into document_title ?  based one what
youv'e said, thta should work -- although i would agree, what you describe
when using your current schema definitely sounds like a bug.

it would be great if you could open a Jira issue describing this problem
... it would be even better if after posting the issue you could
make fixing it easier by attaching a test case. :)



-Hoss

detecting duplicates using the field type 'text'

Posted by Ben Incani <be...@datacomit.com.au>.

Hi Solr users,

I have the following fields set in my 'schema.xml'.

*** schema.xml ***
 <fields>
  <field name="id" type="text" indexed="true" stored="true" />
  <field name="document_title" type="text" indexed="true" stored="true"
/>
  ...
 </fields>
 <uniqueKey>id</uniqueKey>
 <defaultSearchField>document_title</defaultSearchField>
 <copyField source="document_title" dest="id"/>
*** schema.xml ***

When I add a document with a duplicate title, it gets duplicated (not
sure why)

<add>
<doc>
 <field name="document_title">duplicate</field>
</doc>
<doc>
 <field name="document_title">duplicate</field>
</doc>
</add>

When I add a document with a duplicate title (numeric only), it does not
get duplicated

<add>
<doc>
 <field name="document_title">123</field>
</doc>
<doc>
 <field name="document_title">123</field>
</doc>
</add>

I can ensure duplicates DO NOT get added when using the field type
'string'.
And I can also ensure that they DO get added when using <add allowDups =
"true">.

Why is there a disparity detecting duplicates when using the field type
'text'?

Is this merely a documentation issue or have I missed something here...

Regards,

Ben

Re: Handling disparate data sources in Solr

Posted by Alan Burlison <Al...@sun.com>.

Bertrand Delacretaz wrote:

> My "Subversion and Solr" presentation from the last Cocoon GetTogether
> might give you ideas for how to handle this, see the link at
> http://wiki.apache.org/solr/SolrResources.

Hmm, I'm beginning to think the only way to do this is to write a 
complete custom front-end to Solr - even a custom analyser won't do as 
analyzers only deal with fields, not a full document (e.g. a PDF file).

> Although it does not handle all binary formats out of the box (might
> need to write some java glue code to implement new formats), Cocoon is
> a good tool for transforming various document formats to XML and
> filter the results to generate the appropriate XML for Solr. I
> wouldn't add functionality to Solr for doing this, it's best to keep
> things loosely-coupled IMHO.

Cocoon?  Thanks for the suggestion, but the last thing I want is yet 
another "Web Framework".  I'm trying to simplify things, not add 90% 
clutter for 10% functionality.

-- 
Alan Burlison
--

Re: Handling disparate data sources in Solr

Posted by Bertrand Delacretaz <bd...@apache.org>.

On 12/23/06, Alan Burlison <Al...@sun.com> wrote:
> ...As well as centralising the index, I also want
> to centralise the handling of the different document types...

My "Subversion and Solr" presentation from the last Cocoon GetTogether
might give you ideas for how to handle this, see the link at
http://wiki.apache.org/solr/SolrResources.

Although it does not handle all binary formats out of the box (might
need to write some java glue code to implement new formats), Cocoon is
a good tool for transforming various document formats to XML and
filter the results to generate the appropriate XML for Solr. I
wouldn't add functionality to Solr for doing this, it's best to keep
things loosely-coupled IMHO.

-Bertrand

Re: Handling disparate data sources in Solr

Posted by Alan Burlison <Al...@sun.com>.

Chris Hostetter wrote:

> : Why won't cdata work?
> 
> because your binary data might the byte sequence: 0x5D 0x5D 0x3E --
> indicating hte end of the CDATA section. CDATA is short for "Charatacter
> DATA" -- you can't put arbitrary binary data in (or even arbitrary text in
> it) and be sure thta it will work.

Ok, so I have to escape ]]> - if it occurs - if I do that, why won't it 
work?

> For your purposes, if you've got a system that works and does the Document
> conversion for you, then you are probably right: Solr may not be a usefull
> addition to your architecture.  Solr doesn't really attempt to solve the
> problem of parsing differnet kinds of data streams into a unified Document
> module -- it just tries to expose all of the Lucene goodness through an
> easy to use, easy to configre, HTTP interface.  Besides the
> configuration, Solr's other means of being a value add is in it's
> IndexReader management, it's caching, and it's plugin support for mixing
> and matching request handlers, output writters, and field types as easily
> as you can mix and match Analyzers.

Yes, it's all the crunchy goodness that I'm interested in ;-)

> There has been some discussion about adding plugin support for the
> "update" side of things as well -- at a very simple level this could allow
> for messages to be sent via JSON, or CSV instead of just XML -- but
> there's no reason a more comple upate plugin couldn't read in a binary PDF
> file and parse it into it's appropriate fields ... but we aren't
> quite there yet.  Feel free to bring this up on solr-dev if you'd be
> interested in working on it.

Hmm.  That's a possibility.  It all depends on the time tradeoff between 
fixing what we have already to make it reusable versus extending Solr.

-- 
Alan Burlison
--

Re: Handling disparate data sources in Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: Hmm.  Any idea of how much work this involves?  As I said I can put time
: towards this, but I don't know the innards of Solr as well as you and
: the other folks on this list.

I really can't even guess ... i've never even really looked at the
current update code. :)


-Hoss

Re: Handling disparate data sources in Solr

Posted by Alan Burlison <Al...@sun.com>.

Chris Hostetter wrote:

> : 1. The document is already decomposed into fields before the
> : insert/update, but one or more of the fields requires special handling.
> 
> : 2. The document contains both metadata and content.  PDF is a good
> : example of such a document type.
> 
> there's a third big example: multiple documents are compused into a single
> stream of raw data, and you want Solr to extract the individual documents.
> the simplest example of this case being that you want to point Solr at a
> CSV file where each record is a document.

Or a tar file, or a zip file...  Yes, that definitely seems like 
something that should be covered as well.

> : And for both of these you'd need to be able to specify the mapping
> : between the data/metadata in the source document and the corresponding
> : Solr schema fields.  I'm not sure if you'd want this in the
> : solrconfig.xml file or in the indexing request itself.  Doing it in
> : solrconfig.xml means you could change the disposition of the indexed
> : data without changing the clients submitting the content.
> 
> right ... i think that's something that could be controlled on a per
> "parser" basis, much they way RequestHandlers can currently take in a lot
> of options at request time, but can also have default values (or
> invariant values) specified for those options in the solrconfig when they
> are registered.

Agreed.

> : That was the reasoning behind my initial suggestion:
> :
> : | Extend the <doc> and <field> element with the following attributes:
> 
> Right, i was suggesting we take it to the next level, and allow for
> plugins to handle updates that didn't have to have any XML encapsulation
> at all -- the options and the raw data stream could be expressed entirely
> in the HttpServletRequest for the update .. which would still allow us to
> add the type of syntax you are describing to some new "XmlUpdateSource"
> containing the refactored code which currently parses updates in SolrCore.

Hmm.  Any idea of how much work this involves?  As I said I can put time 
towards this, but I don't know the innards of Solr as well as you and 
the other folks on this list.

-- 
Alan Burlison
--

Re: Handling disparate data sources in Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: There's two cases I can think of:
:
: 1. The document is already decomposed into fields before the
: insert/update, but one or more of the fields requires special handling.

: 2. The document contains both metadata and content.  PDF is a good
: example of such a document type.

there's a third big example: multiple documents are compused into a single
stream of raw data, and you want Solr to extract the individual documents.
the simplest example of this case being that you want to point Solr at a
CSV file where each record is a document.

: And for both of these you'd need to be able to specify the mapping
: between the data/metadata in the source document and the corresponding
: Solr schema fields.  I'm not sure if you'd want this in the
: solrconfig.xml file or in the indexing request itself.  Doing it in
: solrconfig.xml means you could change the disposition of the indexed
: data without changing the clients submitting the content.

right ... i think that's something that could be controlled on a per
"parser" basis, much they way RequestHandlers can currently take in a lot
of options at request time, but can also have default values (or
invariant values) specified for those options in the solrconfig when they
are registered.

: That was the reasoning behind my initial suggestion:
:
: | Extend the <doc> and <field> element with the following attributes:

Right, i was suggesting we take it to the next level, and allow for
plugins to handle updates that didn't have to have any XML encapsulation
at all -- the options and the raw data stream could be expressed entirely
in the HttpServletRequest for the update .. which would still allow us to
add the type of syntax you are describing to some new "XmlUpdateSource"
containing the refactored code which currently parses updates in SolrCore.


-Hoss

Re: Handling disparate data sources in Solr

Posted by Alan Burlison <Al...@sun.com>.

Chris Hostetter wrote:

> : The design issue for this is to be clear about the schema and how
> : documents are mapped into the schema. If all document types are
> : mapped into the same schema, then one type of query will work
> : for all. If the documents have different schemas (in the search
> : index), then the query needs an expansion specific to each
> : document type.
> 
> Right, the only way to provide a general purpose solution is to make sure
> any out of the box "UpdateParsers" (using the interface names from my
> previous email) can be configured in the solrconfig.xml to map the native
> concepts in the document format to user defined schema fields.
 >
> (people writing their own custom UpdateParsers could allways hardcode
> their schema fields)
> 
> I don't know anything about PDF structure

http://en.wikipedia.org/wiki/Extensible_Metadata_Platform
http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf

> but using your RFC-2822 email
> as an example, the configuration for an Rfc2822UpdateParser would need to
> be able to specify which Headers map to which fields, and what to do with
> body text -- in theory, it could also be configured with refrences to
> other UpdateParser instances for dealing with multi-part mime messages

There's two cases I can think of:

1. The document is already decomposed into fields before the 
insert/update, but one or more of the fields requires special handling. 
For example when indexing source code you could get the author, date, 
revision etc from the SCMS, but you might want to process the code 
itself just to extract identifiers and ignore keywords.  You might want 
different handlers for different languages, but for the resulting tokens 
all to be stored in the same field, irrespective of language.

2. The document contains both metadata and content.  PDF is a good 
example of such a document type.

You therefore need to be able to specify two types of preprocessing - 
either at the whole-document level, or at the individual field level. 
And for both of these you'd need to be able to specify the mapping 
between the data/metadata in the source document and the corresponding 
Solr schema fields.  I'm not sure if you'd want this in the 
solrconfig.xml file or in the indexing request itself.  Doing it in 
solrconfig.xml means you could change the disposition of the indexed 
data without changing the clients submitting the content.

That was the reasoning behind my initial suggestion:

| Extend the <doc> and <field> element with the following attributes:
|
| mime-type Mime type of the document, e.g. application/pdf, text/html
| and so on.
|
| encoding Encoding of the document, with base64 being the standard
| implementation.
|
| href The URL of any documents that can be accessed over HTTP, instead
| of embedding them in the indexing request.  The indexer would fetch
| the document using the specified URL.
|
| There would then be entries in the configuration file that map each
| MIME type to a handler that is capable of dealing with that document
| type.

So for case 1 where the source is locally accessible you might have 
something like this:

<add>
   <doc>
     <field name="author">Alan Burlison</field>
     <field name="revision">1.2</field>
     <field name="date">08-Jan-2007</field>
     <field name="source" mime-type"text/java"
       href="file:///source/org/apache/foo/bar.java">
     </field>
   </doc>
</add>

And for case 2 where the file can't be directly accessed you might have 
something like this:

<add>
   <doc encoding="base64" mime-type"application/pdf">
[base64-encoded version of the PDF file]
   </doc>
</add>

-- 
Alan Burlison
--

Re: Handling disparate data sources in Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: The design issue for this is to be clear about the schema and how
: documents are mapped into the schema. If all document types are
: mapped into the same schema, then one type of query will work
: for all. If the documents have different schemas (in the search
: index), then the query needs an expansion specific to each
: document type.

Right, the only way to provide a general purpose solution is to make sure
any out of the box "UpdateParsers" (using the interface names from my
previous email) can be configured in the solrconfig.xml to map the native
concepts in the document format to user defined schema fields.

(people writing their own custom UpdateParsers could allways hardcode
their schema fields)

I don't know anything about PDF structure, but using your RFC-2822 email
as an example, the configuration for an Rfc2822UpdateParser would need to
be able to specify which Headers map to which fields, and what to do with
body text -- in theory, it could also be configured with refrences to
other UpdateParser instances for dealing with multi-part mime messages

(one other good out of the box UpdateParser hat i forgot to mention before
would be an XSLTUpdateParser that could take in XML in any format the user
wanted to send, along with the URL of an XSLT to apply to convert it to
the Solr Standard <add><doc> format)


-Hoss

Re: Handling disparate data sources in Solr

Posted by Walter Underwood <wu...@netflix.com>.

On 1/7/07 7:24 AM, "Erik Hatcher" <er...@ehatchersolutions.com> wrote:

> The idea of having Solr handle various document types is a good one,
> for sure.  I'm not sure what specifics would need to be implemented,
> but I at least wanted to reply and say its a good idea!

The design issue for this is to be clear about the schema and how
documents are mapped into the schema. If all document types are
mapped into the same schema, then one type of query will work
for all. If the documents have different schemas (in the search
index), then the query needs an expansion specific to each
document type.

Example: I have RFC-2822 mail messages with "Subject:" and
HTML with "<title>". If I store those in Solr as subject and
title fields, then each query needs to search both fields.
If I put them both in a "document_title" field, then the
query can search one field.

wunder
-- 
Walter Underwood
Search Guru, Netflix

Re: Handling disparate data sources in Solr

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 8, 2007, at 5:45 AM, Alan Burlison wrote:

> Erik Hatcher wrote:
>> The Lucene in Action codebase has a DocumentHandler interface that  
>> could be used for this, which has implementations for Word, PDF,  
>> HTML, RTF, and some others.  It's simplistic, so it might not be  
>> of value specifically.
>
> Do you have a pointer to the code?

Sure... http://www.lucenebook.com and "Download source code".  The  
DocumentHandler is in lia.handlingtypes.framework package.

	Erik

Re: Handling disparate data sources in Solr

Posted by Alan Burlison <Al...@sun.com>.

Erik Hatcher wrote:

> There really is no question of "if" Solr can be made to handle it. :)  

The "if" was a tuits "if", not a technical "if" ;-)

> POSTing an encoded binary document in XML will work, and it certainly 
> will work to have Solr unencode it and parse it.

Yes, but the bits aren't there to do this (yet).  And I didn't want to 
do a one-off hack just for our purposes.

> The Lucene in Action codebase has a DocumentHandler interface that could 
> be used for this, which has implementations for Word, PDF, HTML, RTF, 
> and some others.  It's simplistic, so it might not be of value 
> specifically.

Do you have a pointer to the code?

Thanks,

-- 
Alan Burlison
--

Re: Handling disparate data sources in Solr

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 8, 2007, at 4:58 AM, Alan Burlison wrote:
> I'm in the process of evaluating what we are going to do with the  
> search functionality for http://opensolaris.org, and at the moment  
> Solr is my first choice to replace what we already have - *if* it  
> can be made to handle disparate data sources.

There really is no question of "if" Solr can be made to handle  
it. :)  POSTing an encoded binary document in XML will work, and it  
certainly will work to have Solr unencode it and parse it.

The Lucene in Action codebase has a DocumentHandler interface that  
could be used for this, which has implementations for Word, PDF,  
HTML, RTF, and some others.  It's simplistic, so it might not be of  
value specifically.

	Erik

Re: Handling disparate data sources in Solr

Posted by Alan Burlison <Al...@sun.com>.

Chris Hostetter wrote:

> what do you guys think?

I'm going to spend some time today looking at the Solr source and 
matching your suggestions to it, hopefully I'll be more able to give a 
slightly more considered opinion after that ;-)

I'm in the process of evaluating what we are going to do with the search 
functionality for http://opensolaris.org, and at the moment Solr is my 
first choice to replace what we already have - *if* it can be made to 
handle disparate data sources.

If I do decide that we are going to use Solr, I'll be happy to help add 
whatever extra functionality is needed to satisfy our requirements.  We 
need this fairly quickly, so I should be able to put a significant 
amount of time towards getting it done, once a design is fleshed out. 
I'm not a Solr expert (yet! ;-) so I'm grateful for whatever guidance 
the Solr community can give on how best to go about fulfilling our 
requirements.

I'm also wondering if we could use Solr to back-end the OpenGrok 
(http://www.opensolaris.org/os/project/opengrok/) source code search 
engine that we use on opensolaris.org - having a single search index for 
both site content and code might be useful, not least because we get the 
benefits of Solr the index distribution stuff.  OpenGrok already uses 
Lucene as it's back-end, so it should be possible to do this, although I 
haven't dug through the OG codebase yet.

-- 
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> Is there any reason anyone would want/need to run multple Indexers?
> This would be the equivolent of running DirectUpdateHandler and
> DirectUpdateHandler2 at the same time.  If not, the Indexer could be
> stored in SolrCore and each plugin could talk to it directly.
>

I just realized that the updateHandler already is in SolrCore....

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> What do you folks think?
>

I like it!  I'm currently writing an update plugin that is extended
from SolrRequestHandler because it has the default parameter stuff and
is easily configurable.

Do all UpdatePlugins need to implement: <add> <commit> <delete> etc?

In practice, it seems like most plugins will override <add> (perhaps
commit or commit at the end).  If an UpdatePlugin is extended from
DirectUpdateHandler, does each plugin have its own RAM indexer? (I
hope not)

Perhaps there needs to be an 'Indexer' and an 'UpdatePlugin'.  The
indexer would contain all the indexing functionality, and
UpdatePlugins could all talk to the same indexer.

It would be nice to be able to call:

/update?ut=standard (post body: <add>...)
/update?ut=sql&sql.query=...
/update?ut=xls (post body: an uploaded excel spreadsheet)
/update?ut=standard (post body: <commit/>)

and have everything talking to the same RAM index.

Is there any reason anyone would want/need to run multple Indexers?
This would be the equivolent of running DirectUpdateHandler and
DirectUpdateHandler2 at the same time.  If not, the Indexer could be
stored in SolrCore and each plugin could talk to it directly.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Alan Burlison <Al...@sun.com>.

Yonik Seeley wrote:

> Brainstorming:
> - for errors, use HTTP error codes instead of putting it in the XML as now.

That doesn't work so well if there are multiple documents to be indexed 
in a single request.

-- 
Alan Burlison
--

Re: Java version for solr development (was Re: Update Plugins)

Posted by Bill Au <bi...@gmail.com>.

I also think it is too early to move to 1.6.  Only Sun has released their
1.6 JVM.

Bill


On 1/17/07, Bertrand Delacretaz <bd...@apache.org> wrote:
>
> On 1/17/07, Thorsten Scherler <th...@apache.org> wrote:
>
> > ...Should I use 1.6 for a patch or above mentioned libs?...
>
> IMHO moving to 1.6 is way too soon, and if it's only to save two jars
> it's not worth it.
>
> -Bertrand
>

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 16, 2007, at 3:20 AM, Bertrand Delacretaz wrote:
> On 1/16/07, Ryan McKinley <ry...@gmail.com> wrote:
>
>> ...I think a DocumentParser registry is a good way to isolate this  
>> top level task...
>
> With all this talk about plugins, registries etc., /me can't help
> thinking that this would be a good time to introduce the Spring IoC
> container to manage this stuff.

+1   that, or HiveMind.  It seems a lot of the wheel is being  
reinvented here, when solid plugin solutions already exist.

	Erik

RE: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > With all this talk about plugins, registries etc., /me can't help
: > thinking that this would be a good time to introduce the Spring IoC
: > container to manage this stuff.

I don't have a lot of familiarity with spring except for the XML
configuration file used for telling the spring context what objects you
want it to create on startup and what constructor args to pass then and
what methods to call and so on -- with an easy ability to tell it to pass
one object you had it construct as a param to another object you are hving
it construct.

on the whole, it seems really nice, and eventually using it to replace a
lot of the home-growm configuration in SOlr would probably make a lot of
sense ... but i don't think migrating to Spring is neccessary as part of
the current push to support more configurable plugins for updates ... SOlr
already has a pretty decent set of utilities for allowing class instances
to be specified in the xml config file and have configuration arguments
passed to them on initialization .. it's not as fancy as spring and it
doesn't support as many features as spring, but it works well enough that
it should be easy to use with the new plugins we start to add -- switching
to spring right now would probably only complicate the issues, and
probably wouldn't make adding Update plugins any easier.

equally important: adding a few new types of plugins now probably won't
make it any harder to switch to somehting like spring later ... which as i
said, is something i definitely anticipate happening




-Hoss

RE: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by "Cook, Jeryl" <JC...@innodata-isogen.com>.

Sorry for the "flame" , but I've used spring on 2 large projects and it
worked out great.. you should check out some of the GUIs to help manage
the XML configuration files, if that is reason your team thought it was
a nightmare because of the configuration(we broke ours up to help).. 

Jeryl Cook

-----Original Message-----
From: Alan Burlison [mailto:Alan.Burlison@sun.com] 
Sent: Tuesday, January 16, 2007 10:52 AM
To: solr-dev@lucene.apache.org
Subject: Re: Update Plugins (was Re: Handling disparate data sources in
Solr)

Bertrand Delacretaz wrote:

> With all this talk about plugins, registries etc., /me can't help
> thinking that this would be a good time to introduce the Spring IoC
> container to manage this stuff.
> 
> More info at http://www.springframework.org/docs/reference/beans.html
> for people who are not familiar with it. It's very easy to use for
> simple cases like the ones we're talking about.

Please, no.  I work on a big webapp that uses spring - it's a complete 
nightmare to figure out what's going on.

-- 
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Alan Burlison <Al...@sun.com>.

Bertrand Delacretaz wrote:

> With all this talk about plugins, registries etc., /me can't help
> thinking that this would be a good time to introduce the Spring IoC
> container to manage this stuff.
> 
> More info at http://www.springframework.org/docs/reference/beans.html
> for people who are not familiar with it. It's very easy to use for
> simple cases like the ones we're talking about.

Please, no.  I work on a big webapp that uses spring - it's a complete 
nightmare to figure out what's going on.

-- 
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Bertrand Delacretaz <bd...@apache.org>.

On 1/16/07, Ryan McKinley <ry...@gmail.com> wrote:

> ...I think a DocumentParser registry is a good way to isolate this top level task...

With all this talk about plugins, registries etc., /me can't help
thinking that this would be a good time to introduce the Spring IoC
container to manage this stuff.

More info at http://www.springframework.org/docs/reference/beans.html
for people who are not familiar with it. It's very easy to use for
simple cases like the ones we're talking about.

-Bertrand

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

> kind of like a binary stream equivilent to the way analyzers
> can be customized -- is thta kind of what you had in mind?
>

exactly.

>
>   interface SolrDocumentParser {
>     public init(NamedList args);
>     Document parse(SolrParams p, ContentStream content);
>   }
>
>

yes

Re: Java version for solr development (was Re: Update Plugins)

Posted by Bertrand Delacretaz <bd...@apache.org>.

On 1/17/07, Thorsten Scherler <th...@apache.org> wrote:

> ...Should I use 1.6 for a patch or above mentioned libs?...

IMHO moving to 1.6 is way too soon, and if it's only to save two jars
it's not worth it.

-Bertrand

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 17, 2007, at 1:41 AM, Chris Hostetter wrote:
> : The number of people writing update plugins will be small  
> compared to
> : the number of users using the external HTTP API (the URL + query
> : parameters, and the relationship URL-wise between different update
> : formats).  My main concern is making *that* as nice and  
> utilitarian as
> : possible, and any plugin stuff is implementation and a secondary
> : concern IMO.
>
> Agreed, but my point was that we should try to design the internal  
> APIs
> indepently from the URL structure ... if we have a set of APIs,
> it's easy to come up with a URL structure that will map well (we could
> theoretically have several URL structures using different servlets)  
> but if
> we worry too much about what hte URL should look like, we may  
> hamstring
> the model design.

+1

web.xml allows for servlets to be mapped however desired, and  
cleverly using a servlet filters could add in some other URL mapping  
goodness, or in the extreme must-have-certain-URLs there is always  
mod_rewrite.

I still think a microcontainer is a good way to go for solr.  It's  
exactly what microcontainers were designed for.  While not spring- 
savvy myself (but tinkered with HiveMind via Tapestry a while back),  
I know enough to reiterate that its not heavy or horrible for basic  
IoC which is what is being reinvented in a sense.

	Erik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: The number of people writing update plugins will be small compared to
: the number of users using the external HTTP API (the URL + query
: parameters, and the relationship URL-wise between different update
: formats).  My main concern is making *that* as nice and utilitarian as
: possible, and any plugin stuff is implementation and a secondary
: concern IMO.

Agreed, but my point was that we should try to design the internal APIs
indepently from the URL structure ... if we have a set of APIs,
it's easy to come up with a URL structure that will map well (we could
theoretically have several URL structures using different servlets) but if
we worry too much about what hte URL should look like, we may hamstring
the model design.


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/15/07, Chris Hostetter <ho...@fucit.org> wrote:
> : The most important issue is to nail down the external HTTP interface.
>
> I'm not sure if i agree with that statement .. i would think that figuring
> out the "model" or how updates should be handled in a generic way, what
> all of the "Plugin" types are, and what their APIs should be is the most
> important issue -- once we have those issues settled we could allways
> write a new "SolrServlet2" that made the URL structure work anyway we
> want.

The number of people writing update plugins will be small compared to
the number of users using the external HTTP API (the URL + query
parameters, and the relationship URL-wise between different update
formats).  My main concern is making *that* as nice and utilitarian as
possible, and any plugin stuff is implementation and a secondary
concern IMO.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

On 1/16/07, Chris Hostetter <ho...@fucit.org> wrote:
>
> : >I left out "micro-plugins" because i don't quite have a good answer
> : >yet :)  This may a place where a custom dispatcher servlet/filter
> : >defined in web.xml is the most appropriate solution.
> :
> : If the issue is munging HTTPServletRequest information, then a proper
> : separation of concerns suggests responsibility should lie with a Servlet
> : Filter, as Ryan suggests.
>
> I'm not making sense of this ... i don't see how the micro-plugins (aka:
> RequestParsers) could be implimented as Filters and still be plugins that
> users could provide ... don't Filters have to be specified in the web.xml

Yes.  I'm suggesting we map a filter to intercept ALL requests, then
see which ones it should handle.

Consider:

public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain) throws IOException, ServletException
{
  if(request instanceof HttpServletRequest) {
    HttpServletRequest req = (HttpServletRequest) request;
    String path = req.getServletPath();

    SolrRequestHandler handler = core.getRequestHandler( path );
    if( handler != null ) {
    	
    	HANDLE THE REQUEST
    	
    	return;
    }
  }

  // Otherwise let the webapp handle the request
  chain.doFilter(request, response);
}


> ... is there some progromatic way a Servlet or Filter can register other
> Servlets/Filters dynamicly when the application is initalized? ... if
> users have to extract the solr.war and modify the web.xml to add a
> RequestParser they've written, that doesn't seem like much of a plugin :)
>

You would not need to extract the war, just change the registered handler name.


ryan

Re: Java version for solr development (was Re: Update Plugins)

Posted by Walter Underwood <wu...@netflix.com>.

On 1/16/07 8:03 PM, "Yonik Seeley" <yo...@apache.org> wrote:

> I think it's a bit soon to move to 1.6 - I don't know how many
> platforms it's available for yet.

It is still in "early release" from IBM for their PowerPC
servers, so requiring 1.6 would be a serious problem for us.

wunder
-- 
Walter Underwood
Search Guru, Netflix

Re: Java version for solr development (was Re: Update Plugins)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/16/07, Thorsten Scherler <th...@apache.org> wrote:
> I am on 1.5 ATM and using
> |-- stax-1.2.0-dev.jar
> `-- stax-utils.jar

I don't know where those jars are from, but I guess one would need the
stax API jar, and the implementation (woodstox I would think) jar.
That's two jars instead of one, but they could go away with a move to Java6.
The API is likely to have a much longer lifetime too.

> Two more dependencies. Setting min version
>  <!-- Java Version we are compatible with -->
>   <property name="java.compat.version" value="1.6" />
> would get rid of this.
>
> Should I use 1.6 for a patch or above mentioned libs?

I think it's a bit soon to move to 1.6 - I don't know how many
platforms it's available for yet.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Alan Burlison <Al...@sun.com>.

Ryan McKinley wrote:

> In addition, consider the case where you want to index a SVN
> repository.  Yes, this could be done in SolrRequestParser that logs in
> and returns the files as a stream iterator.  But this seems like more
> 'work' then the RequestParser is supposed to do.  Not to mention you
> would need to augment the Document with svn specific attributes.

This is indeed one of the things I'd like to do - use Solr as a back-end
for OpenGrok (http://www.opensolaris.org/os/project/opengrok/)

-- 
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Alan Burlison <Al...@sun.com>.

Chris Hostetter wrote:

> i'm totally on board now ... the RequestParser decides where the streams
> come from if any (post body, file upload, local file, remote url, etc...);
> the RequestHandler decides what it wants to do with those streams, and has
> a library of DocumentProcessors it can pick from to help it parse them if
> it wants to, then it takes whatever actions it wants, and puts the
> response information in the existing Solr(Query)Response class, which the
> core hands off to any of the various OutputWriters to format according to
> the users wishes.

+1

-- 
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > : In addition to RequestProcessors, maybe there should be a general
: > : DocumentProcessor

: > : interface SolrDocumentParser
: > : {
: > :   Document parse(ContentStream content);
: > : }

: > what else would the RequestProcessor do if it was delegating all of the
: > parsing to something else?

: Parsing is just one task that a RequestProcessor may do.  It is the
: entry point for all kinds of stuff: searching, admin tasks, augment
: search results with SQL queries, writing uploaded files to the file
: system.  This is where people will do whatever suits their fancy.

ah ... i see what you mean.  so DocumentProcessors would be reusable
classes that RequestHandlers/RequestProcessers could use to parse streams
-- but instead of needing to hardcoding class dependencies in the
RequestHandler on specific DocumentProcessors, the RequestHandler could do
a "lookup" on the mime/type of the stream (or any other key it wanted to i
suppose) to parse the stream ... so you could have a
SimpleHtmlDocumentProcesser that you use, and then one day you replace it
with a CompleHtmlDocumentProcessor which you probably have to configure a
bit differnetly but you don't have to recompile your RequestHandler ...
kind of like a binary stream equivilent to the way analyzers
can be customized -- is thta kind of what you had in mind?

(i was confused and thinking that picking a DocumentProcessor would be
done by the core independent of picking the RequestHandler --- just like
hte OUtputWriter is)

: In addition, consider the case where you want to index a SVN
: repository.  Yes, this could be done in SolrRequestParser that logs in
: and returns the files as a stream iterator.  But this seems like more
: 'work' then the RequestParser is supposed to do.  Not to mention you
: would need to augment the Document with svn specific attributes.
:
: Parsing a PDF file from svn should (be able to) use the same parser if
: it were uploaded via HTTP POST.

i'm totally on board now ... the RequestParser decides where the streams
come from if any (post body, file upload, local file, remote url, etc...);
the RequestHandler decides what it wants to do with those streams, and has
a library of DocumentProcessors it can pick from to help it parse them if
it wants to, then it takes whatever actions it wants, and puts the
response information in the existing Solr(Query)Response class, which the
core hands off to any of the various OutputWriters to format according to
the users wishes.

The DocumentProcessors are the ones that are really going to need a lot of
configuration telling them how to map the chunks of data from the stream
to fields in the schema -- but in the same way that OutputWriters get the
request after the RequestHandler has had a chance to wrap the SolrParams,
it probably makes sense to let the request handler override configuration
for the DocumentProcessors as well (so i can say "normally i want the
HtmlDocumentProcessor to map these HTML elements to these schema fields
... but i have one type of HTML doc that breaks the rules, so i'll use a
seperate RequestHandler to index them, and it will override some of those
field mappings...

  interface SolrDocumentParser {
    public init(NamedList args);
    Document parse(SolrParams p, ContentStream content);
  }



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> : In addition to RequestProcessors, maybe there should be a general
> : DocumentProcessor
> :
> : interface SolrDocumentParser
> : {
> :   Document parse(ContentStream content);
> : }
> :
> : solrconfig could register "text/html" -> HtmlDocumentParser, and
> : RequestProcessors could share the same parser.
>
> what else would the RequestProcessor do if it was delegating all of the
> parsing to something else?
>
>

Parsing is just one task that a RequestProcessor may do.  It is the
entry point for all kinds of stuff: searching, admin tasks, augment
search results with SQL queries, writing uploaded files to the file
system.  This is where people will do whatever suits their fancy.

RequestHandler is probalby better name RequestProcessor, but I think
we should choose name that can live peacefully with existing
RequestHandler code.

I imagine there will be a standard 'Processor' gets a list of streams
and processes them into Documents.  Since the way these documents are
parsed depends totally on the schema, we will need some way to make
this user configurable.

In addition, consider the case where you want to index a SVN
repository.  Yes, this could be done in SolrRequestParser that logs in
and returns the files as a stream iterator.  But this seems like more
'work' then the RequestParser is supposed to do.  Not to mention you
would need to augment the Document with svn specific attributes.

Parsing a PDF file from svn should (be able to) use the same parser if
it were uploaded via HTTP POST.

I think a DocumentParser registry is a good way to isolate this top level task.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: Iterator<ContentStream> getContentStreams();
:
: Consider the case where you iterate through a local file system.

right, a fixed size in memory array can be iterated, but an unbounded
stream of objects from an external source can't allways be read into an
array effectively -- so when it doubt go with the Iterator (or my
favorite:  Iterable)

: In addition to RequestProcessors, maybe there should be a general
: DocumentProcessor
:
: interface SolrDocumentParser
: {
:   Document parse(ContentStream content);
: }
:
: solrconfig could register "text/html" -> HtmlDocumentParser, and
: RequestProcessors could share the same parser.

what else would the RequestProcessor do if it was delegating all of the
parsing to something else?



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> interface SolrRequest
> {
>   SolrParams getParams();
>   ContentStream[] getContentStreams(); // Iterator?
>   long getStartTime();
> }
>

correction:  this should be:

Iterator<ContentStream> getContentStreams();

Consider the case where you iterate through a local file system.

----------

In addition to RequestProcessors, maybe there should be a general
DocumentProcessor

interface SolrDocumentParser
{
  Document parse(ContentStream content);
}

solrconfig could register "text/html" -> HtmlDocumentParser, and
RequestProcessors could share the same parser.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

> data and wrote it out in the current update response format .. so the
> current SolrUpdateServlet could be completley replaced with a simple url
> mapping...
>
>    /update --> /select?qt=xmlupdate&wt=legacyxmlupdate
>

Using the filter method above, it could (and i think should) be mapped to:
/update

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by "J.J. Larrea" <jj...@panix.com>.

At 11:48 PM -0800 1/16/07, Chris Hostetter wrote:
>yeah ... once we have a RequestHandler doing that work, and populating a
>SolrQueryResponse with it's result info, it
>would probably be pretty trivial to make an extremely bare-bones
>LegacyUpdateOutputWRiter that only expected that simple mount of response
>data and wrote it out in the current update response format .. so the
>current SolrUpdateServlet could be completley replaced with a simple url
>mapping...
>
>   /update --> /select?qt=xmlupdate&wt=legacyxmlupdate

Yah!  But in my vision it would be

    /update -> qt=update

because pathInfo is "update".  There's no need to remap anything in the URL, the existing SolrServlet is ready for dispatch once it:
  - Prepares request params into SolrParams
  - Sets params("qt") to pathInfo
  - Somehow (perhaps with StreamIterator) prepares streams for RequestParser use

I'm still trying to conceptually maintain a separation of concerns between handling the details of HTTP (servlet-layer) and handling different payload encodings (a different layer, one I believe can be invoked after config is read).

The following is "vision" more than "proposal" or "suggestion"...

    <requestHandler name="update" class="lets.write.this.UpdateRequestHandler">
	<lst name="invariants">
	    <str name="wt">legacyxml</str>
	</lst>
	<lst name="defaults">
	    <!-- rp matches queryRequestParser -->
	    <str name="rp">xml</str>
	</lst>
    </request>

    <!-- only if standard responseWriter is not up to the task -->
    <queryResponseWriter name="legacyxml"
	class="do.we.really.need.LegacyUpdateOutputWRiter"/>

    <queryRequestParser name="xml" class="solr.XMLStreamRequestParser"/>

    <queryRequestParser name="json" class="solr.JSONStreamRequestParser"/>

So when incoming URL comes in:

    /update?rp=json

the pipeline which is established is:

    SolrServlet ->
	solr.JSONStreamRequestParser
	    |
	    |- request data carrier e.g. SolrQueryRequest
	    |
	lets.write.this.UpdateRequestHandler
	    |
	    |- response data carrier e.g. SolrQueryResponse
	    |
	do.we.really.need.LegacyUpdateOutputWRiter

I expect this is all fairly straightforward, except for one sticky question:

Is there a "universal" format which can efficiently (e.g. lazily, for stream input) convey all kinds of different request body encodings, such that the RequestHandler has no idea how it was dispatched?

Something to think about...

- J.J.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > - Revise the XML-based update code (broken out of SolrCore into a
: > RequestHandler) to use all the above.
:
: +++1, that's been needed forever.

yeah ... once we have a RequestHandler doing that work, and populating a
SolrQueryResponse with it's result info, it
would probably be pretty trivial to make an extremely bare-bones
LegacyUpdateOutputWRiter that only expected that simple mount of response
data and wrote it out in the current update response format .. so the
current SolrUpdateServlet could be completley replaced with a simple url
mapping...

   /update --> /select?qt=xmlupdate&wt=legacyxmlupdate



-Hoss

Java version for solr development (was Re: Update Plugins)

Posted by Thorsten Scherler <th...@apache.org>.

On Tue, 2007-01-16 at 15:49 -0500, Yonik Seeley wrote:
> On 1/16/07, J.J. Larrea <jj...@panix.com> wrote:
> > - Revise the XML-based update code (broken out of SolrCore into a RequestHandler) to use all the above.
> 
> +++1, that's been needed forever.
> If one has the time, I'd also advocate moving to StAX (via woodstox
> for Java5, but it's built into Java6).

I was up to have a look on this. Seeing this comment makes me think.

I am on 1.5 ATM and using 
|-- stax-1.2.0-dev.jar
`-- stax-utils.jar

Two more dependencies. Setting min version 
 
  <property name="java.compat.version" value="1.6" />
would get rid of this.

Should I use 1.6 for a patch or above mentioned libs?

wdyt?

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/16/07, J.J. Larrea <jj...@panix.com> wrote:
> - Revise the XML-based update code (broken out of SolrCore into a RequestHandler) to use all the above.

+++1, that's been needed forever.
If one has the time, I'd also advocate moving to StAX (via woodstox
for Java5, but it's built into Java6).

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/16/07, J.J. Larrea <jj...@panix.com> wrote:
> >POST:
> > if( multipart ) {
> >  read all form fields into parameter map.
>
> This should use the same req.getParameterMap as for GET, which Servlet 2.4 says is suppose to be automatically by the servlet container if the payload is application/x-www-form-urlencoded; in that case the input stream should be null.

Unfortunately, curl puts application/x-www-form-urlencoded in there by
default.  Our current implementation of updates always ignores that
and treats the stream as binary.
An alternative for non-multipart posts could check the URL for args,
and if they are there, treat the body as the input instead of params.

$ curl http://localhost:5000/a/b?foo=bar --data-binary "hi there"

$ nc -l -p 5000
POST /a/b?foo=bar HTTP/1.1
User-Agent: curl/7.15.4 (i686-pc-cygwin) libcurl/7.15.4 OpenSSL/0.9.8d zlib/1.2.
3
Host: localhost:5000
Accept: */*
Content-Length: 8
Content-Type: application/x-www-form-urlencoded

hi there

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: >I left out "micro-plugins" because i don't quite have a good answer
: >yet :)  This may a place where a custom dispatcher servlet/filter
: >defined in web.xml is the most appropriate solution.
:
: If the issue is munging HTTPServletRequest information, then a proper
: separation of concerns suggests responsibility should lie with a Servlet
: Filter, as Ryan suggests.

I'm not making sense of this ... i don't see how the micro-plugins (aka:
RequestParsers) could be implimented as Filters and still be plugins that
users could provide ... don't Filters have to be specified in the web.xml
... is there some progromatic way a Servlet or Filter can register other
Servlets/Filters dynamicly when the application is initalized? ... if
users have to extract the solr.war and modify the web.xml to add a
RequestParser they've written, that doesn't seem like much of a plugin :)

In general i'm not too worried about what the URL structure looks like ...
i agree it makes the most sense for the RequestParser to be determinede
using the path, but beyond that i don't think it matters much -- the
existing servlet could stay arround as is with a hardcoded use of a
"DefaultRequestParser" that doesn't provide any streams and gets the
params from HttpServletRequest while a new Servlet could get the qt and wt
from the path info as well.




-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by "J.J. Larrea" <jj...@panix.com>.

I'm in frantic deadline mode so I'm just going to throw in some (hopefully) short comments...

At 11:02 PM -0800 1/15/07, Ryan McKinley wrote:
>>the one thing that still seems missing is those "micro-plugins" i was
>> [SNIP]
>>
>>  interface SolrRequestParser {
>>     SolrRequest process( HttpServletRequest req );
>>  }
>>
>
>
>I left out "micro-plugins" because i don't quite have a good answer
>yet :)  This may a place where a custom dispatcher servlet/filter
>defined in web.xml is the most appropriate solution.

If the issue is munging HTTPServletRequest information, then a proper separation of concerns suggests responsibility should lie with a Servlet Filter, as Ryan suggests.

For example, while the Servlet 2.4 spec doesn't have specifications for how the servlet container can/should "burst" a multipart-MIME payload into separate files or streams, there are a number of 3rd party Filters which do this.

The Iterator<ContentStream> is a great idea because if each stream is read to completion before the next is opened it doesn't impose any limitation on individual stream length and doesn't require disk buffering.

(Of course some handlers may require access to more than one stream at a time; each time next() is called on the iterator before the current stream is closed, the remainder of that stream will have to be buffered in memory or on disk, depending on the part length.  Nonetheless that detail can be entirely hidden from the handler, as it should be.  I am not sure if any available ServletFilter implementations work this way, but it's certainly doable.)

But that detail is irrelevant for now; as I suggest below, using this API lets one immediately implement it with only next() value of the entire POST stream; that would answer the needs of the existing update request handling code, but establish an API to handle multi-part.  Whenever someone wants to write a multi-stream handler, they can write or find a better Iterator<ContentStream> implementation, which would best be cast as a ServletFilter.

>I like the SolrRequestParser suggestion.

Me too.  It answers a hole in my vision for how this can all fit together.

>Consider:
>qt='RequestHandler'
>wt='ResponseWriter'
>rp='RequestParser ' (rb='SolrBuilder'?)
>
>To avoid possible POST read-ahead stream mungling: qt,wt, and rp
>should be defined by the URL, not parameters.  (We can add special
>logic to allow /query?qt=xxx)
>
>For qt, I like J.J. Larrea's suggestion on SOLR-104 to let people
>define arbitrary path mapping for qt.
>
>We could append 'wt', 'rb', and arbitrary arbitrary text to the
>registered path, something like
> /registered/path/wt:json/rb:standard/more/stuff/in/the/path?params...
>
>(any other syntax ideas?)

No need for new syntax, I think.  The pathInfo or qt or other source resolves to a requestHandler CONFIG name.  The handler config is read to determine the handler class name.  It also can be consulted (with URL or form-POST params overriding if allowed by the  config) to decide which RequestParser to invoke BEFORE IT IS CALLED and which ResponseWriter to invoke AFTER.  Once those objects are set up, the request body gets executed.

Handler config inheritance (as I proposed in SOLR-104 point #2) would greatly simplify, for example, creating a dozen query handlers which used a particular invariant combination of qt, wt, and rp

>The 'standard' RequestParser would:
>GET:
> fill up SolrParams directly with req.getParameterMap()
>if there is a 'post' parameter (post=XXX)
>  return a stream with XXX as its content
>else
>  empty iterator.
>Perhaps add a standard way to reference a remote URI stream.
>
>POST:
> if( multipart ) {
>  read all form fields into parameter map.

This should use the same req.getParameterMap as for GET, which Servlet 2.4 says is suppose to be automatically by the servlet container if the payload is application/x-www-form-urlencoded; in that case the input stream should be null.

>  return an iterator over the collection of files

Collection of streams, per Hoss.

>}
>else {
>  no parameters? parse parameters from the URL? /name:value/
>  return the body stream

As above, this introduces unneeded complexity and should be avoided.

>}
>DEL:
> throw unsupported exception?
>
>
>Maybe each RequestHandler could have a default RequestParser.  If we
>limited the 'arbitrary path' to one level, this could be used to
>generate more RESTful URLs. Consider:
>
>/myadder/1111/2222/3333/
>
>/myadder maps to MyCustomHandler and that gives you
>MyCustomRequestBuilder that maps /1111/2222/3333 to SolrParams

I think these are best left for an extra-SOLR layer, especially since SOLR URLs are meant for interprogram communication and not direct use by non-developer end users.  For example, for my org's website I have hundreds of Apache mod_rewrite rules which do URL munging such as
	/journals/abc/7/3/192a.pdf
into
	/journalroot/index.cfm?journal=abc&volume=7&issue=3
		&page=192&seq=a&format=pdf

Or someone could custom-code a subclass of SolrServlet which handles application-specific URL requirments.  But the base implementation should be as simple as possible - or perhaps more accurately, complex only where complexity is really called for: query caching, faceting, and the like.

>>one last thought: while the interfaces you outlined would make a lot
>>of sense if we were starting from scratch, there are probably several
>>cases where not having those exact names/APIs doesn't really hurt, and
>>would allow backwards compatibility with more of the current code (and
>>current SolrRequestHandler plugin people have written) ... just something
>>we should keep in mind: we don't want to go hog wild renaming a lot of
>>stuff and alienating our existing "plugin" user base. (nor do we want to
>>make a bunch of unneccessary config file format changes)
>>
>
>I totally understand and agree.
>
>Perhaps the best approach is to offer a SolrRequestProcessor framework
>that can sit next to the existing SolrRequestHandler without affecting
>it much (if at all).  For what i have suggested, i *think* it could
>all be done with simple additions to solrschema.xml that would still
>work on an unedited 1.1.0 solrconfig.xml

Yah

>
>If we use a servletFilter for the dispatcher, this can sit next to the
>current /query?xxx servlet without problem.  When the
>SolrRequestProcessor framework is rock solid, we would @Deprecate
>SolrRequestHandler and change the default solrconfig.xml to map /query
>to the new framework.

I think there are some small to moderate steps which should be done within the SOLR 1.x where x < 5 framework, where one wants to be non-API breaking as much as possible, and some bigger improvements which should be contemplated for an API-incompatible release.

For the short term, I think the following make sense:

- Refactor the request handlers (based in part on what Ryan has already done in SOLR-102/20) to unify query and update requests.

- Move scattered functionality to UpdateCommand and xxxUpdateCommmand implementations per my comments #3 SOLR

- It would make sense to establish the Iterator<ContentStream> API in but initially only provide a single stream, the raw POST when not form-urlencoded. 

- Revise the XML-based update code (broken out of SolrCore into a RequestHandler) to use all the above.

-  To keep existing configs unmodified, we would want to continue to use qt as the arg even for update.

- Add handler config inheritance and structure the sample solrconfig to suggest structuring handlers as a hierarchy

- Change the servlet (or add a filter) to use pathInfo to select the handler. For the moment, the servlet could simply take any pathINFO and store it in qt.  So
	/<base>?qt=standard
  and	/<base>/standard
are identical, and
	/<base>/query/products/instock would be the same as
	/base?qt=query/products/instock
and match a response handler name="query/products/instock"

- Provide config examples which exercise the new mechanisms.

This would have immediate benefits for plugin writers (both query and update), config file writers (shorter files, request namespace structuring), command issuers (having the request part of the pathInfo rather than qt seems to make sense to everyone), and so forth.

With that done, as a second step, having Request Parsing plugins makes great sense.  I'll save further commentary on that for another email.

- J.J.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> the one thing that still seems missing is those "micro-plugins" i was
>  [SNIP]
>
>   interface SolrRequestParser {
>      SolrRequest process( HttpServletRequest req );
>   }
>


I left out "micro-plugins" because i don't quite have a good answer
yet :)  This may a place where a custom dispatcher servlet/filter
defined in web.xml is the most appropriate solution.

I like the SolrRequestParser suggestion.

Consider:
qt='RequestHandler'
wt='ResponseWriter'
rp='RequestParser ' (rb='SolrBuilder'?)

To avoid possible POST read-ahead stream mungling: qt,wt, and rp
should be defined by the URL, not parameters.  (We can add special
logic to allow /query?qt=xxx)

For qt, I like J.J. Larrea's suggestion on SOLR-104 to let people
define arbitrary path mapping for qt.

We could append 'wt', 'rb', and arbitrary arbitrary text to the
registered path, something like
  /registered/path/wt:json/rb:standard/more/stuff/in/the/path?params...

(any other syntax ideas?)


The 'standard' RequestParser would:
GET:
  fill up SolrParams directly with req.getParameterMap()
 if there is a 'post' parameter (post=XXX)
   return a stream with XXX as its content
 else
   empty iterator.
 Perhaps add a standard way to reference a remote URI stream.

POST:
  if( multipart ) {
   read all form fields into parameter map.
   return an iterator over the collection of files
 }
 else {
   no parameters? parse parameters from the URL? /name:value/
   return the body stream
 }
DEL:
  throw unsupported exception?


Maybe each RequestHandler could have a default RequestParser.  If we
limited the 'arbitrary path' to one level, this could be used to
generate more RESTful URLs. Consider:

/myadder/1111/2222/3333/

/myadder maps to MyCustomHandler and that gives you
MyCustomRequestBuilder that maps /1111/2222/3333 to SolrParams


> :
> : Thoughts?
>
> one last thought: while the interfaces you outlined would make a lot
> of sense if we were starting from scratch, there are probably several
> cases where not having those exact names/APIs doesn't really hurt, and
> would allow backwards compatibility with more of the current code (and
> current SolrRequestHandler plugin people have written) ... just something
> we should keep in mind: we don't want to go hog wild renaming a lot of
> stuff and alienating our existing "plugin" user base. (nor do we want to
> make a bunch of unneccessary config file format changes)
>

I totally understand and agree.

Perhaps the best approach is to offer a SolrRequestProcessor framework
that can sit next to the existing SolrRequestHandler without affecting
it much (if at all).  For what i have suggested, i *think* it could
all be done with simple additions to solrschema.xml that would still
work on an unedited 1.1.0 solrconfig.xml

If we use a servletFilter for the dispatcher, this can sit next to the
current /query?xxx servlet without problem.  When the
SolrRequestProcessor framework is rock solid, we would @Deprecate
SolrRequestHandler and change the default solrconfig.xml to map /query
to the new framework.

The stuff I *DO* think should get refactored/deprecated ASAP is to
extract the constants from the functionality in SolrParams.  While we
are at it, it may be good to restructure the code to something like:
  http://issues.apache.org/jira/browse/SOLR-20#action_12464648


ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > (the trick being that the servlet would need to parse the "st" info out
: > of the URL (either from the path or from the QueryString) directly without
: > using any of the HttpServletRequest.getParameter*() methods...
:
: I haven't followed all of the discussion, but wouldn't it be easier to
: use the request path, instead of parameters, to select these
: RequestParsers?

absolutely (hence my comment "either from the path or from the
QueryString") ... my point is just that if we go this route, any servlets
Solr has (there's no reason we can't have several -- changing the URL
struture can be orthoginal to adding update plugins) have to be careful
about dealing with the request to determine the plugin to use.




-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Bertrand Delacretaz <bd...@apache.org>.

On 1/16/07, Chris Hostetter <ho...@fucit.org> wrote:

> ....  interface SolrRequestParser {
>      SolrRequest process( HttpServletRequest req );
>   }
>
> (the trick being that the servlet would need to parse the "st" info out
> of the URL (either from the path or from the QueryString) directly without
> using any of the HttpServletRequest.getParameter*() methods...

I haven't followed all of the discussion, but wouldn't it be easier to
use the request path, instead of parameters, to select these
RequestParsers?

i.e. solr/update/pdf-parser, solr/update/hssf-parser,
solr/update/my-custom-parser, etc.

-Bertrand

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

:
: I hate to inundate you with more code, but it seems like the best way
: to describe a possible interface.

...

the one thing that still seems missing is those "micro-plugins" i was
talking about that can act independent of the SolrRequestProcessor used to
decide where the data streams come from.  if you consider the current
query request hanlding model, there's "qt" that picks a SolrRequestHandler
(what you've called SolrRequestProcessor), and "wt" which independently
determines the QueryResponseWriter (aka: SolrResponseWriter) .. i think we
need an "st" (stream type) that the servlet uses to pick a
"SolrRequestParser" to decide how to generate the SolrRequest and it's
underlying ContentStreams....

  interface SolrRequestParser {
     SolrRequest process( HttpServletRequest req );
  }

(the trick being that the servlet would need to parse the "st" info out
of the URL (either from the path or from the QueryString) directly without
using any of the HttpServletRequest.getParameter*() methods which might
"read ahead" into the ServletInputStream)

: interface SolrRequest
: {
:   SolrParams getParams();
:   ContentStream[] getContentStreams(); // Iterator?
:   long getStartTime();
: }

I'm not understanding why that wouldn't make sense as an
Iterable<ContentStream> ... then it could be an array if the
SolrRequestParser wanted, or it could be something more lazy-loaded.

:
: Thoughts?

one last thought: while the interfaces you outlined would make a lot
of sense if we were starting from scratch, there are probably several
cases where not having those exact names/APIs doesn't really hurt, and
would allow backwards compatibility with more of the current code (and
current SolrRequestHandler plugin people have written) ... just something
we should keep in mind: we don't want to go hog wild renaming a lot of
stuff and alienating our existing "plugin" user base. (nor do we want to
make a bunch of unneccessary config file format changes)



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

On 1/15/07, Chris Hostetter <ho...@fucit.org> wrote:
>
> : The most important issue is to nail down the external HTTP interface.
>
> I'm not sure if i agree with that statement .. i would think that figuring
> out the "model" or how updates should be handled in a generic way, what
> all of the "Plugin" types are, and what their APIs should be is the most
> important issue -- once we have those issues settled we could allways
> write a new "SolrServlet2" that made the URL structure work anyway we
> want.
>
>
>
> -Hoss
>

I hate to inundate you with more code, but it seems like the best way
to describe a possible interface.

//-----------------------------------------------

interface ContentStream
{
  String getName();
  String getContentType();
  InputStream getStream();
}

interface SolrParams
{
  String getParam( String name );
  String[] getParams( String name );
}

//-----------------------------

interface SolrRequest
{
  SolrParams getParams();
  ContentStream[] getContentStreams(); // Iterator?
  long getStartTime();
}

interface SolrResponse
{
  int getStatus(); // ???
  NamedList getProps(); // ???
}

//-----------------------------

interface SolrRequestProcessor
{
  SolrResponse process( SolrRequest req );
  SolrResponseWriter getWriter( SolrRequest req ); // default
}

interface SolrResponseWriter
{
  void write(Writer writer, SolrRequest request, SolrResponse response);
  String getContentType(SolrRequest request, SolrResponse response);
}

//-----------------------------

Then a servlet (or filter) could be in charge of parsing URL/params
into a request.  It would pick a Processor and send the output to a
writer.  If someone wanted a custom URL scheme, they would overide the
servlet/filter.

Perhaps SolrRequest should have an object for solrCore.  It would be
better if it does not need to go to the static
SolrCore.getUpdateHandler().

I am proposing ContentStream[] getContentStreams() because it would be
simpler then an iterator.  In the case of multipart upload, if you
offered an API closer to:
http://jakarta.apache.org/commons/fileupload/streaming.html
You would not have any parameters until after you read each Item and
convert the form fields to parameters.

Thoughts?

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: The most important issue is to nail down the external HTTP interface.

I'm not sure if i agree with that statement .. i would think that figuring
out the "model" or how updates should be handled in a generic way, what
all of the "Plugin" types are, and what their APIs should be is the most
important issue -- once we have those issues settled we could allways
write a new "SolrServlet2" that made the URL structure work anyway we
want.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> Cool.  I think i need more examples... concrete is good :-)
>
> I don't quite grok your format below... is it one line or two?
> /path/defined/in/solrconfig:parser?params
> /${handler}:${parser}
>
> Is that simply
>
> /${handler}:${parser}?params
>

yes.  the ${} is just to show what is extracted from the request URI,
not a specific example

Imagine you have a CsvUpdateHander defined in solrconfig.xml with a
"name"="my/update/csv".

The standard RequestParser could extract the parameters and
Iterable<ContentStream> for each of the following requests:

POST: /my/update/csv/?separator=,&fields=foo,bar,baz
(body) "10,20,30"

POST:/my/update/csv/
multipart post with 5 files and 6 form fields defining
(unlike the previous example this the handle would get 5 input streams
rather then 1)

GET: /my/update/csv/?post.remoteURL=http://..&separator=,&fields=foo,bar,baz&...
fill the stream with the content from a remote URL

GET: /my/update/csv/?post.body=bodycontent,&fields=foo,bar,baz&...
use 'bodycontent' as the input stream.  (note, this does not make much
sense for csv, but is a useful example)

POST: /my/update/csv:remoteurls/?separator=,&fields=foo,bar,baz
(body) http://url1,http://url2,http:/url3...
In this case we would use a custom RequestParser ("remoteurls") that
would read the post body and convert it to a stream of content urls.

- - - - - - -

The URL path (everything before the ':') would be entirely defined and
configured by solrconfig.xml  A filter would see if the request path
matches a registered handler - if not it will pass it up the filter
chain.  This would allow custom filters and servlets to co-exist in
the top level URL path.  Consider:

solrconfig.xml
  <handler name="delete" class="DeleteHandler" />

web.xml:
  <servlet-mapping>
    <servlet-name>MyRestfulDelete</servlet-name>
    <url-pattern>/mydelete/*</url-pattern>
  </servlet-mapping>

POST: /delete?id=AAA   would be sent to DeleteHandler
POST: /mydelete/AAA/ would be sent to MyRestfulDelete

Alternativly, you could have:


solrconfig.xml
  <handler name="standard/delete" class="DeleteHandler" />

web.xml:
  <servlet-mapping>
    <servlet-name>MyRestfulDelete</servlet-name>
    <url-pattern>/delete/*</url-pattern>
  </servlet-mapping>

POST: /standard/delete?id=AAA   would be sent to DeleteHandler
POST: /delete/AAA/ would be sent to MyRestfulDelete

I am suggesting we do not try have the default request servlet/filter
support extracting parameters from the URL.  I think this is a
reasonable tradeoff to be able to have the request path easily user
configurable using the *existing* plugin configuration.

- - - - - - - -

In a previous email, you mentioned changing the URL structure.  With
this proposal, we would continue to support:
/select?wt=XXX

for the Csv example, you would also be able to call:
GET: /select?qt=/my/update/csv/&post.remoteURL=http://..&sepa...

ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/18/07, Ryan McKinley <ry...@gmail.com> wrote:
> On 1/18/07, Yonik Seeley <yo...@apache.org> wrote:
> > On 1/18/07, Ryan McKinley <ry...@gmail.com> wrote:
> > > Yes, this proposal would fix the URL structure to be
> > > /path/defined/in/solrconfig:parser?params
> > > /${handler}:${parser}
> > >
> > > I *think* this cleanly handles most cases cleanly and simply.  The
> > > only exception is where you want to extract variables from the URL
> > > path.
> >
> > But that's not a hypothetical case, extracting variables from the URL
> > path is something I need now (to add metadata about the data in the
> > raw post body, like the CSV separator).
> >
> > POST to http://localhost:8983/solr/csv?separator=,&fields=foo,bar,baz
> > with a body of "10,20,30"
> >
>
> Sorry, by "in the URL" I mean "in the URL path." The RequestParser can
> extract whatever it likes from getQueryString()
>
> The url you list above could absolutely be handled with the proposed
> format.

Cool.  I think i need more examples... concrete is good :-)

I don't quite grok your format below... is it one line or two?
/path/defined/in/solrconfig:parser?params
/${handler}:${parser}

Is that simply

/${handler}:${parser}?params

Or is it all one line where you actually have params twice?

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

On 1/18/07, Yonik Seeley <yo...@apache.org> wrote:
> On 1/18/07, Ryan McKinley <ry...@gmail.com> wrote:
> > Yes, this proposal would fix the URL structure to be
> > /path/defined/in/solrconfig:parser?params
> > /${handler}:${parser}
> >
> > I *think* this cleanly handles most cases cleanly and simply.  The
> > only exception is where you want to extract variables from the URL
> > path.
>
> But that's not a hypothetical case, extracting variables from the URL
> path is something I need now (to add metadata about the data in the
> raw post body, like the CSV separator).
>
> POST to http://localhost:8983/solr/csv?separator=,&fields=foo,bar,baz
> with a body of "10,20,30"
>

Sorry, by "in the URL" I mean "in the URL path." The RequestParser can
extract whatever it likes from getQueryString()

The url you list above could absolutely be handled with the proposed
format.  The thing that could not be handled is:
http://localhost:8983/solr/csv/foo/bar/baz/
with body "10,20,30"


> > There are pleanty of ways to rewrite RESTfull urls into a
> > path+params structure.  If someone absolutly needs RESTfull urls, it
> > can easily be implemented with a new Filter/Servlet that picks the
> > 'handler' and directly creates a SolrRequest from the URL path.
>
> While being able to customize something is good, having really good
> defaults is better IMO :-)  We should also be focused on exactly what
> we want our standard update URLs to look like in parallel with the
> design of how to support them.
>

again, i totally agree.  My point is that I don't think we need to
make the dispatch filter handle *all* possible ways someone may want
to structure their request.  It should offer the best defaults
possible.  If that is not sufficient, someone can extend it.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > This would give people a relativly easy way to implement 'restful'
: > URLs if they need to.  (but they would have to edit web.xml)
:
: A handler could alternately get the rest of the path (absent params), right?

only if the RequestParser adds it to the SolrRequest as a SolrParam.

: > Unit tests should be handled by execute( handler, req, res )
:
: How does the unit test get the handler?

i think ryans point is that when testing a handler, you should know which
handler you are testing, so construct it and execute it directly.

: > I am proposing we have a single interface to do this:
: >   SolrRequest r = RequestParser.parse( HttpServletRequest  )
:
: That's currently what new SolrServletRequest(HttpServletRequest) does.
: We just need to figure out how to get InputStreams, Readers, etc.

we start by adding "Iterable<ContentStream> getStreams()" to the
SolrRequest interface, with a setter on all of the Impls that's not part
of the interface.  then i suspect what we'll see is two classes that look
like this..

  public class NoStreamRequestParser implements RequestParser {
    public SolrRequest parse(HttpServletRequest req) {
     return new SolrServletRequest(HttpServletRequest);
    }
  }
  public class RawPostStreamRequestParser extends NoStreamRequestParser {
    public SolrRequest parse(HttpServletRequest req) {
     ContentStream c = makeContentStream(req.getInputStream())
     SolrServletRequest s = super.parse(req);
     s.setStreams(new SinglItemCollection(c))
     return s;
    }
  }

: So, the hander needs to be able to get an InputStream, and HTTP headers.
: Other plugins (CSV) will ask for a Reader and expect the details to be
: ironed out for it.
:
: Method1: come up with ways to expose all this info through an
: interface... a "headers" object could be added to the SolrRequest
: context (see getContext())

this is why Ryan and i have been talking in terms of a "ContentStream"
interface instead of just "InputStream" .. at some point we talked about
the ContentStream having getters for mime type, and charset that might be
null if unknown.


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

On 1/19/07, Chris Hostetter <ho...@fucit.org> wrote:
>
> : First Ryan, thank you for your patience on this *very* long hash
>
> I could not agree more ... as i was leaving work this afternoon, it
> occured to me "I really hope Ryan realizes i like all of his ideas, i'm
> just wondering if they can be better" -- most people I work with don't
> have the stamina to deal with my design reviews :)
>

Thank you both!  This is the first time I've taken the time and effort
to contribute to an open source project.  I'm learning the
pace/etiquette etc as I go along :)   Honestly your critique is
refreshing - I'm used to working alone or directing others.

I *think* we are close to something we will all be happy with.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: First Ryan, thank you for your patience on this *very* long hash

I could not agree more ... as i was leaving work this afternoon, it
occured to me "I really hope Ryan realizes i like all of his ideas, i'm
just wondering if they can be better" -- most people I work with don't
have the stamina to deal with my design reviews :)

What occured to me as i was *getting* home was that since I seem to be the
only one that's (overly) worried about the RequestParser/HTTP abstraction
-- and since i haven't managed to convince Ryan after all of my badgering
-- it's probably just me being paranoid.

I think in general, the approach you've outlined should work great -- i'll
reply to some of your more recent comments directly.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

First Ryan, thank you for your patience on this *very* long hash
session.  Most wouldn't last that long unless it were a flame war ;-)
And thanks to Hoss, who seems to have the highest read+response
bandwidth of anyone I've ever seen (I'll admit I've only been
selectively reading this thread, with good intentions of coming back
to it).

On 1/19/07, Ryan McKinley <ry...@gmail.com> wrote:
> It would not be the most 'pluggable' of plugins, but I am still having
> trouble imagining anything beyond a single default RequestParser.
> Assuming anything doing *really* complex ways of extracting
> ContentStreams will do it in the Handler not the request parser.

Agreed... a custom handler opening various streams not covered by the
default will most easily be handled by the handler opening the streams
themselves.

> This would give people a relativly easy way to implement 'restful'
> URLs if they need to.  (but they would have to edit web.xml)

A handler could alternately get the rest of the path (absent params), right?

> Correct, SolrCore shoudl not care what the request path is.  That is
> why I want to deprecate the execute( ) function that assumes the
> handler is defined by 'qt'
>
> Unit tests should be handled by execute( handler, req, res )

How does the unit test get the handler?

> If I had my druthers, It would be:
>   res = handler.execute( req )
> but that is too big of leap for now :)

Yep... esp since the response writers now need the request for
parameters, for the searcher (streaming docs, etc).

> You guys made a lot of good
> choices and solr is an amazing platform for it.

I just wish I had known Lucene when I *started* Sol(a)r ;-)

> I am proposing we have a single interface to do this:
>   SolrRequest r = RequestParser.parse( HttpServletRequest  )

That's currently what new SolrServletRequest(HttpServletRequest) does.
We just need to figure out how to get InputStreams, Readers, etc.

> I agree.  This is why i suggest the RequestParsers is not a core part
> of the API, just a helper class for Servlets and Filters.

Sounds good to as a practical starting point to me.  If we need more
in the future, we can add it then.

USECASE: The XML update plugin using the woodstox XML parser:
Woodstox docs say to give the parser an InputStream (with char
encoding, if available) for best performance.  This is also preferable
since if the char isn't specified, the parser can try to snoop it from
the stream.

So, the hander needs to be able to get an InputStream, and HTTP headers.
Other plugins (CSV) will ask for a Reader and expect the details to be
ironed out for it.

Method1: come up with ways to expose all this info through an
interface... a "headers" object could be added to the SolrRequest
context (see getContext())
Method2: consider it a more special case, have an XML update servlet
that puts that info into the SolrRequest (perhaps via the context
again)

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

> >
> > I'm not sure what "it" is in the above sentence ... i believe from the
> > context of the rest of hte message you are you refering to
> > using a ServletFilter instead of a Servlet -- i honestly have no opinion
> > about that either way.
>
> I thought a filter required you to open up the WAR file and change
> web.xml, or am I misunderstanding?
>

If your question is do you need to edit web.xml to change the URL it
will apply to, my suggestion is to may /* to the DispatchFilter and
have it decide weather or not to handle the requests.  With a filter,
you can handle the request directly or pass it up the chain.  This
would allow us to have the URL structures defined by solrconfig.xml
(without a need to edit web.xml)

If your question is about configuring the RequestParser,  Yes, you
would need to edit web.xml

My (our?) reasons for suggesting this are
1) I think we only have one RequestParser that will handle all normal
requests.  Unless you have extreemly specialized needs, this is not
something you would change.
2) Since the RequestParser is tied so closely to HttpServletRequest
and your desired URL structure, it seems appropriate to configure it
in web.xml.  A RequestParser is just a utility class for
servlets/filters
3) We don't want to add RequestParser to 'core' unless it really needs
to be a pluggable interface.  I don't see the need for it just yet.

ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: To be clear, (with the current implementation in SOLR-104) you would
: have to put this in your solrconfig.xml
:
: <requestHandler name="/select" class="solr.StandardRequestHandler">
:
: Notice the preceding '/'.  I think this is a strong indication that
: someone *wants* /select to behave distinctly.

crap ... i totally misread that ... so if people have a requestHandler
registered with a name that doesn't start with a slash, they can't use the
new URL structure and they have to use the old one.

DAMN! ... that is slick dude ... okay, i agree with you, the odds of that
causing problems are pretty fucking low.

I'm still hung up on this "parse" logic thing ... i really think it needs
to be in the path .. or at the very least, there needs to be a way to
specify it in the path to force one behavior or another, and if it's not
in the path then we can guess based on the Content-Type.

Putting it in a query arg would make getting it without contaminating the
POST body kludgy, putting it at the start of the path doesn't work well
for supporting a default if it isn't there, and putting it at the end of
the PATH messes up the nice work you've done letting RequestHandlers have
extra path info for encoding info they want.

Hmmmm...

What if we did soemthing like this...

   /exec/handler/name:extra/path?param1=val1
   /raw/handler/name:extra/path?param1=val1
   /url/handler/name:extra/path?param1=val1&url=...&url=...
   /file/handler/name:extra/path?param1=val1&file=...&file=...

where "exec" means guess based on the Content-TYpe, "raw" means use the
POST body as a single stream regardless of Content-Type, etc...

thoughts?


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> easy thing to deal with just by scoping the URLs .. put something,
> ANYTHING, in front of these urls, that isn't "select" or "update" and

I'll let you and Yonik decide this one.  I'm fine either way, but I
really don't see a problem letting people easily override URLs.  I
actually think it is a good thing.


>
> consider the case where a user today has this in his solrconfig...
>
>   <requestHandler name="select" class="solr.StandardRequestHandler">
>

To be clear, (with the current implementation in SOLR-104) you would
have to put this in your solrconfig.xml

<requestHandler name="/select" class="solr.StandardRequestHandler">

Notice the preceding '/'.  I think this is a strong indication that
someone *wants* /select to behave distinctly.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/19/07, Ryan McKinley <ry...@gmail.com> wrote:
> All that said, this could just as cleanly map everything to:
>   /solr/dispatch/update/xml
>   /solr/cmd/update/xml
>   /solr/handle/update/xml
>   /solr/do/update/xml
>
> thoughts?

That was my original assumption (because I was thinking of using
servlets, not a filter),
but I see little advantage to scoping under additional path elements.
I also agree with the other points you make.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > A user should be confident that they can pick anyname they possily want
: > for their plugin, and it won't collide with any future addition we might
: > add to Solr.
:
: But that doesn't seem possible unless we make user plugins
: second-class citizens by scoping them differently.  In the event there
: is a collision in the future, the user could rename one of the
: plugins.

when it comes to URLs, our plugins currently are second class citizens --
plugin names appear in the "qt" or "wt" params -- users can pick any names
they want and they are totally legal, they don't have to worry about any
possibility that a name they pick will collide with a path we have mapped
to a servlet.

Users shouldn't have the change the names of requestHandlers juse because
SOlr adds a new feature with the same name -- changing a requestHandler
name could be a heavy burden for a Solr user to make depending on how many
clients *they* have using that requestHandler with that name.  i wouldn't
make a big deal out of this if it was unavoidable -- but it is such an
easy thing to deal with just by scoping the URLs .. put something,
ANYTHING, in front of these urls, that isn't "select" or "update" and
then put the requestHandler name and we've now protected ourself and our
users.

consider the case where a user today has this in his solrconfig...

  <requestHandler name="select" class="solr.StandardRequestHandler">

..with the URL structure you guys are talking about, with the
DispatchFilter matching on /* and interpreting the first part of hte path
as a posisble requestHandler name, that user can't upgrade Solr
because he's relying on the old "/select?qt=select" style URLs to
work ... he has to change the name of his requestHandler and all of his
clients, then upgrade, then change all of his clients againt to take
advantage of the new URL structure (and the new features it provides for
updates)



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/20/07, Chris Hostetter <ho...@fucit.org> wrote:
> the thing about Solr, is there really aren't a lot of "defaults" in the
> sense you mean ... there is just an example -- people might copy the
> example, but if they don't have something in their solrconfig, most things
> just aren't there....

I expect that most users will fall into that category though.  A
minority use custom request handlers and I expect a vast minority to
use custom update handlers.

> A user should be confident that they can pick anyname they possily want
> for their plugin, and it won't collide with any future addition we might
> add to Solr.

But that doesn't seem possible unless we make user plugins
second-class citizens by scoping them differently.  In the event there
is a collision in the future, the user could rename one of the
plugins.

The same type of collision can happen today with our current request
handler framework, but I don't think it's worth uglifying URLs over.
It will be very rare and there are ways to easily work around it.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> i would relaly feel a lot happier with something like these that you
> mentioned...
>

If it will make you happier, then I think its a good idea!  (even if i
don't see it as a Problem)

> :   /solr/dispatch/update/xml
> :   /solr/cmd/update/xml
> :   /solr/handle/update/xml
> :   /solr/do/update/xml
>
> http://${host}:${port}/${context}/do/${parser}/${handler/with/optional/slashes}?${params}
>

(assuming the number of parsers is <3 and solr.war would only have 1) How about:

http://${host}:${port}/${context}/${parser}/${handler/with/optional/slashes}?${params}

Thoughts for the default parser name.  'do' gives me the struts he-be-je-bes :)

>
> we can still handle...
>
> http://${host}:${port}/${context}/select/?qt=${handler}&${params}
>
> ..with a really simple ServletFilter (that has no risk of collision, with
> the new URL structure one, so it can go anywhere in the FilterChain)
>

yes.  likewise with /update

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.


: > then all is fine and dandy ... but what happens if someone tries to
: > configure a plugin with the name "admin" ... now all of the existing admin

: that is exactly what you would expect to happen if you map a handler
: to /admin.  The person configuring solrconfig.xml is saying "Hey, use
: this instead of the default /admin.  I want mine to make sure you are
: logged in using my custom authentication method."  In addition, It may
: be reasonable (sometime in the future) to implement /admin as a
: RequestHandler.  This could be a clean way to address SOLR-58  (xml
: with stylesheets, or JSON, etc...)

yeah i guess that wouldn't be too horrible ... i think what i was trying
to point out was that if we'd roll out these super simple urls containg
just the plugin name and someone did register a plugin overriding the
admin pages, we'd screw them over later when we did get arround to
replacing the admin pages with a plugin if added it as a special override
ServletFilter mapping

: > also: what happens a year from now when we add some completely new
: > Servlet/ServletFilter to Solr, and want to give it a unique URL...
: >
: >   http://host:9999/solr/bar/

: obviously, I think the default solr settings should be prudent about
: selecting URLs.  The standard configuration should probably map most
: things to /select/xxx or /update/xxx.

the thing about Solr, is there really aren't a lot of "defaults" in the
sense you mean ... there is just an example -- people might copy the
example, but if they don't have something in their solrconfig, most things
just aren't there....

: > ...we could put it earlier in the processing chain before the existing
: > ServletFilter, but then we break any users that have registered a plugin
: > with the name "bar".
:
: Even if we move this to have a prefix path, we run into the exact same
: issue when sometime down the line solr has a default handler mapped to
: 'bar'

the point i was trying to make is that the "namespaces" that Solr uses
should be unique -- the piece of the URL path that is used to pick the
Servlet or filter for dispatching the request, should be uniquely
distinguishable from the piece of the URL that is used to lookup a plugin.
A user should be confident that they can pick anyname they possily want
for their plugin, and it won't collide with any future addition we might
add to Solr.

if the new and improved solr URLs (minus host:port/context) are
just /${plugin}/... with a dispatcher that matches on any URL and checks
that path for a plugin matching that name then we have no way of ever
adding any other URL for a new in the future without running the risk that
whatever bsae path we pick for that new features URLs, we might screw over
a user who just so happened to pick that features name when registering a
plugin -- either becuase we put the new feature earlier in the FilterChain
and it circumvents requests the user expects to to that plugin, or because
we put that feature later in the FilterChain and that user doesn't ge to
take advantage of it unless he changes the name he registered the plugin
with (and changes all of his clients)

i would relaly feel a lot happier with something like these that you
mentioned...

:   /solr/dispatch/update/xml
:   /solr/cmd/update/xml
:   /solr/handle/update/xml
:   /solr/do/update/xml

http://${host}:${port}/${context}/do/${parser}/${handler/with/optional/slashes}?${params}

....sounds great to me... just as long as we have some constant prefix in
there so that later on we can use something else.

we can still handle...

http://${host}:${port}/${context}/select/?qt=${handler}&${params}

..with a really simple ServletFilter (that has no risk of collision, with
the new URL structure one, so it can go anywhere in the FilterChain)



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

> then all is fine and dandy ... but what happens if someone tries to
> configure a plugin with the name "admin" ... now all of the existing admin
> pages break.
>

that is exactly what you would expect to happen if you map a handler
to /admin.  The person configuring solrconfig.xml is saying "Hey, use
this instead of the default /admin.  I want mine to make sure you are
logged in using my custom authentication method."  In addition, It may
be reasonable (sometime in the future) to implement /admin as a
RequestHandler.  This could be a clean way to address SOLR-58  (xml
with stylesheets, or JSON, etc...)


> also: what happens a year from now when we add some completely new
> Servlet/ServletFilter to Solr, and want to give it a unique URL...
>
>   http://host:9999/solr/bar/
>

obviously, I think the default solr settings should be prudent about
selecting URLs.  The standard configuration should probably map most
things to /select/xxx or /update/xxx.

> ...we could put it earlier in the processing chain before the existing
> ServletFilter, but then we break any users that have registered a plugin
> with the name "bar".

Even if we move this to have a prefix path, we run into the exact same
issue when sometime down the line solr has a default handler mapped to
'bar'

/solr/dispatcher/bar

But, if it ever becomes a problem, we can add an "excludes" pattern to
the filter-config that would  skip processing even if it maps to a
known handler.

>
> more short term: if there is no prefix that the ervletFilter requires,
> then supporting the legacy "http://host:9999/solr/update" and
> "http://host:9999/solr/select" URLs becomes harder,

I don't think /update or /select need to be legacy URLs.  They can
(and should) continue work as they currently do using a new framework.

The reason I was suggesting that the Handler interface adds support to
ask for the default RequestParser and/or ResponseWriter is to support
this exact issue.  (However in the case of path="/select" the filter
would need to get the handler from ?qt=xxx)

- - - - -

All that said, this could just as cleanly map everything to:
  /solr/dispatch/update/xml
  /solr/cmd/update/xml
  /solr/handle/update/xml
  /solr/do/update/xml

thoughts?

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > On 1/19/07, Chris Hostetter <ho...@fucit.org> wrote:
: > > whoa ... hold on a minute, even if we use a ServletFilter do do all of the
: > > dispatching instead of a Servlet we still need a base path right?

: > I thought that's what the filter gave you... the ability to filter any
: > URL to the /solr webapp, and Ryan was doing a lookup on the next
: > element for a request handler.

: yes, this is the beauty of a Filter.  It *can* process the request
: and/or it can pass it along.  There is no problem at all with mapping
: a filter to all requests and a servlet to some paths.  The filter will
: only handle paths declared in solrconfig.xml everything else will be
: handled however it is defined in web.xml

sorry ... i kow that a ServletFilter can look at a request, choose to
process it, or choose to ignore it ... my point was that if we use a
Filter, we still should put in that fiter logic to only look at requests
starting with a fixed prefix.

consider this URL...

  http://host:9999/solr/foo/

...where "solr" is the webapp name as usual.

if the filter matches on "/*" and then does a lookup in the solrconfig for
"foo" to find the Plugin to use for that request, and ignores the request
and passesit down the chain if one isn't configured with the name "foo"
then all is fine and dandy ... but what happens if someone tries to
configure a plugin with the name "admin" ... now all of the existing admin
pages break.

also: what happens a year from now when we add some completely new
Servlet/ServletFilter to Solr, and want to give it a unique URL...

  http://host:9999/solr/bar/

...we could put it earlier in the processing chain before the existing
ServletFilter, but then we break any users that have registered a plugin
with the name "bar".

more short term: if there is no prefix that the ervletFilter requires,
then supporting the legacy "http://host:9999/solr/update" and
"http://host:9999/solr/select" URLs becomes harder, because how do we
safely tell if the remote client is expecting the legacy behavior of those
URLs, or if we are trying to support some plugin configured using the
names "select" and "update" ?


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

On 1/19/07, Yonik Seeley <yo...@apache.org> wrote:
> On 1/19/07, Chris Hostetter <ho...@fucit.org> wrote:
> > whoa ... hold on a minute, even if we use a ServletFilter do do all of the
> > dispatching instead of a Servlet we still need a base path right?
>
> I thought that's what the filter gave you... the ability to filter any
> URL to the /solr webapp, and Ryan was doing a lookup on the next
> element for a request handler.
>

yes, this is the beauty of a Filter.  It *can* process the request
and/or it can pass it along.  There is no problem at all with mapping
a filter to all requests and a servlet to some paths.  The filter will
only handle paths declared in solrconfig.xml everything else will be
handled however it is defined in web.xml

(As a sidenote, wicket 2.0 replaces their dispatch servlet with a
filter - it makes it MUCH easier to have their app co-exist with other
things in a shared URL structure.)

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/20/07, Ryan McKinley <ry...@gmail.com> wrote:
> >
> > what!? .. really????? ... you don't think the ones i mentioned before are
> > things we should support out of the box?
> >
> >   - no stream parser (needed for simple GETs)
> >   - single stream from raw post body (needed for current updates
> >   - multiple streams from multipart mime in post body (needed for SOLR-85)
> >   - multiple streams from files specified in params (needed for SOLR-66)
> >   - multiple streams from remote URL specified in params
> >
>
> I have imagined the single default parser handles *all* the cases you
> just mentioned.

Yes, this is what I had envisioned.
And if we come up with another cool standard one, we can add it and
all the current/older handlers get that additional behavior for free.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: >   throw new SolrException( 400, "missing parameter: "+p );
: >
: > This will return 400 with a message "missing parameter: " + p.
: >
: > Exceptions or SolrExceptions with code=500 || code<100 are sent to
: > client with status code 500 and a full stack trace.
:
: That all seems ideal to me, but there had been talk in the past about
: formatted responses on errors.  Given that even update handlers can
: return full responses, I don't see the point of formatted (XML,etc)
: response bodies when an exception is thrown.

I can't find the thread at the moment, but as I recall, there was once
some conscensus that while errors should definitely be returned with
appropriate HTTP status codes, and the exception message should be
included in the status line, the QueryResponseWriter whould be given an
opportunity to format the Exception -- the rationale being that all
clients should check the HTTP status code, and if it's not 2xx, then they
should use the status message for simple error reporting, but if they want
more details they can check the Content-Type of the response and if it
matches what they were expecting, they can get the detailed error info
from it.

So if you are writing a python client and expecting python back, the stack
trace will be formated in python so you can easily parse it ... if you are
expecting XML back, the stack trace will be formated in XML, etc...

i think the only time the dispatcher should return an html (or plain text)
error page is if it encounters an exception before it can extract the
writer to use from the request params, or if the exception is in the
ResponseWriter itself.

This would be one reason to leave getException() in the SolrQueryResponse
interface ... it let's us keep the API the same for ResponseWriters (no
need to add a new writeErrorPage(Exception) method) ... another advantage
to keeping that encapsalation is it gives the ResponseWriters the ability
to generate pages which contain the partial results from the
RequestHandler (prior to encountering an exception) as well the Exception
itself.


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 21, 2007, at 2:39 PM, Yonik Seeley wrote:

> On 1/21/07, Ryan McKinley <ry...@gmail.com> wrote:
>> >
>> > So is everyone happy with the way that errors are currently  
>> reported?
>> > If not, now (or right after this is committed), is the time to  
>> change
>> > that.  /solr/select/qt="myhandler"  should be backward  
>> compatible, but
>> > /solr/myhandler doesn't need to be.  Same for the update stuff.
>> >
>>
>> In SOLR-104, all exceptions are passed to the client as HTTP Status
>> codes with the message.  If you write:
>>
>>   throw new SolrException( 400, "missing parameter: "+p );
>>
>> This will return 400 with a message "missing parameter: " + p.
>>
>> Exceptions or SolrExceptions with code=500 || code<100 are sent to
>> client with status code 500 and a full stack trace.
>
> That all seems ideal to me, but there had been talk in the past about
> formatted responses on errors.  Given that even update handlers can
> return full responses, I don't see the point of formatted (XML,etc)
> response bodies when an exception is thrown.
> Just making sure there's a consensus.

Being able to check the HTTP status code to determine if there is an  
error, rather than having to parse XML and get a Solr-specific status  
code seems best for the Ruby work we're doing.  I'll confer with the  
others working on it and report back if they have any suggestions for  
improvement also.

	Erik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/21/07, Ryan McKinley <ry...@gmail.com> wrote:
> >
> > So is everyone happy with the way that errors are currently reported?
> > If not, now (or right after this is committed), is the time to change
> > that.  /solr/select/qt="myhandler"  should be backward compatible, but
> > /solr/myhandler doesn't need to be.  Same for the update stuff.
> >
>
> In SOLR-104, all exceptions are passed to the client as HTTP Status
> codes with the message.  If you write:
>
>   throw new SolrException( 400, "missing parameter: "+p );
>
> This will return 400 with a message "missing parameter: " + p.
>
> Exceptions or SolrExceptions with code=500 || code<100 are sent to
> client with status code 500 and a full stack trace.

That all seems ideal to me, but there had been talk in the past about
formatted responses on errors.  Given that even update handlers can
return full responses, I don't see the point of formatted (XML,etc)
response bodies when an exception is thrown.
Just making sure there's a consensus.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> So is everyone happy with the way that errors are currently reported?
> If not, now (or right after this is committed), is the time to change
> that.  /solr/select/qt="myhandler"  should be backward compatible, but
> /solr/myhandler doesn't need to be.  Same for the update stuff.
>

In SOLR-104, all exceptions are passed to the client as HTTP Status
codes with the message.  If you write:

  throw new SolrException( 400, "missing parameter: "+p );

This will return 400 with a message "missing parameter: " + p.

Exceptions or SolrExceptions with code=500 || code<100 are sent to
client with status code 500 and a full stack trace.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > >  3) there's a comment in RequestHandlerBase.init about "indexOf" that
: > > comes form the existing impl in DismaxRequestHandler -- but doesn't match
: > > the new code ... i also wasn't certain that the change you made matches

: > I just copied the code from DismaxRequestHandler and made sure it
: > passes the tests.  I don't totally understand what that case is doing.
:
: The first iteration of dismax (before we did generic defaults,
: invariants, etc for request handlers) took defaults directly from the
: init params, and that is what that case is checking for and

bingo .. the reason it jumped out at me in your patch, is that the comment
still refered to indexOf, but the code didn't ... it might be functionally
equivilent, i just wasn't sure when i did my quick read.

there's mention in the comment that indexOf is used so that <null
name="defaults" /> can indicate that you don't want all the init params as
defaults, but you don't acctually want defaults either -- but there
doesn't seem to be a test for that case.

you can see support for the legacy defaults syntax in
src/test/test-files/solr/conf/solrconfig.xml if you grep for
dismaxOldStyleDefaults



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/21/07, Ryan McKinley <ry...@gmail.com> wrote:
> >
> > I don't think i'll have time to look at your new patch today, design wise
> > i think you are right, but there was still stuff that needed to be
> > refactored out of core.update and into the UpdateHandler wasn't there?
> >
>
> Yes, I avoided doing that in an effort to minimize refactoring and
> focus just on adding ContentStreams to RequestHandlers.

Sounds like a good idea.  It's easier to review and process in smaller
steps if practical.

> I just posted (yet another) update to SOLR-104.  This one moves the
> core.update logic into UpdateRequestHander, and adds some glue to make
> old request behave as they used to.

Cool!

> I also deprecated the exception in SolrQueryResponse.  Handlers should
> throw the exception, not put it in the response.  (If you want error
> messages, put that in the response, not the exception)

Agreed.  I can't for the life of me remember *why* I did that.
I think it was because I thought ResponseHandlers might format the exception.

> >  3) there's a comment in RequestHandlerBase.init about "indexOf" that
> > comes form the existing impl in DismaxRequestHandler -- but doesn't match
> > the new code ... i also wasn't certain that the change you made matches
> > the old semantics for dismax (i don't think we have a unit test for that
> > case)
>
> When you get a chance to look at the patch, can you investigate this.
> I just copied the code from DismaxRequestHandler and made sure it
> passes the tests.  I don't totally understand what that case is doing.

The first iteration of dismax (before we did generic defaults,
invariants, etc for request handlers) took defaults directly from the
init params, and that is what that case is checking for and
replicating.... if there isn't a "defaults" in the list, it assumes
the entire list is defaults.

It's only needed for dismax since other handlers didn't support
"defaults" until later.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> I don't think i'll have time to look at your new patch today, design wise
> i think you are right, but there was still stuff that needed to be
> refactored out of core.update and into the UpdateHandler wasn't there?
>

Yes, I avoided doing that in an effort to minimize refactoring and
focus just on adding ContentStreams to RequestHandlers.

I just posted (yet another) update to SOLR-104.  This one moves the
core.update logic into UpdateRequestHander, and adds some glue to make
old request behave as they used to.

I also deprecated the exception in SolrQueryResponse.  Handlers should
throw the exception, not put it in the response.  (If you want error
messages, put that in the response, not the exception)

It still needs some cleanup and some idea what data/messages should be
returned in the SolrResponse.

The bottom of http://localhost:8983/solr/test.html has a form calling
/update2 with posted XML so you can see the output


> a couple of minor comments i had when i read the last patch (but didn't
> mention since i was focusing on design issues) ...
>
>  1) why rename the servlets "Legacy*" instead of just marking them deprecated?

In the new version, I got rid of both Servlets and am handling the
'legacy' cases explicitly in the dispatch filter.  This minimizes the
duplicated code and keeps things consisten.


>  2) getSourceId and getSoure need to be left in the concrete Handlers so
> they get illed in with the correct file version info on checkout.

done.

>  3) there's a comment in RequestHandlerBase.init about "indexOf" that
> comes form the existing impl in DismaxRequestHandler -- but doesn't match
> the new code ... i also wasn't certain that the change you made matches
> the old semantics for dismax (i don't think we have a unit test for that
> case)

When you get a chance to look at the patch, can you investigate this.
I just copied the code from DismaxRequestHandler and made sure it
passes the tests.  I don't totally understand what that case is doing.


>  4) ContentStream.getFieldName() would proabably be more general as
> ContentStream.getSourceInfo() ...

done.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/21/07, Chris Hostetter <ho...@fucit.org> wrote:
> : > The bugaboo is if the POST data is NOT in fact
> : > application/x-www-form-urlencoded but the user agent says it is -- as
> : > both of you have indicated can be the case when using curl.  Could that
> : > be why Yonik thought POST params was broken?
> :
> : Correct.  That's the format that post.sh in the example sends
> : (application/x-www-form-urlencoded) and we ignore it in the update
> : handler and always treat the body as binary.
> :
> : Now if you wanted to add some query args to what we already have, you
> : can't use getParameterMap().
>
> I think i mentioned this before, but I think what we should do is make the
> stream "guessing" code in the Dispatcher/RequestBuilder very strict, and
> make it's decisison about how to treat the post body entirely based on the
> Content-Type ... meanwhile the existing (eventually know as "old") way of
> doing updates via "/update" to the UpdateServlet can be more lax, and
> assume everything is a raw POST of XML.
>
> we can change post.sh to spcify XML as the Content-Type by default,
> modify the example schema to have other update handlers registered with
> names like "/update/csv" and eventually add an "/update/xml" encouraging
> people to use it if they want to send updates as xml dcouments, regardless
> of wehter htey want to POST them raw, uplodae them, or identify them by
> filename -- as long as they are explicit about their content type.

I think I agree with all that.

A long time ago in this thread, I remember saying that new URLs are an
opportunity to change request/response formats w/o worrying about
backward compatibility.

So is everyone happy with the way that errors are currently reported?
If not, now (or right after this is committed), is the time to change
that.  /solr/select/qt="myhandler"  should be backward compatible, but
/solr/myhandler doesn't need to be.  Same for the update stuff.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/21/07, Chris Hostetter <ho...@fucit.org> wrote:
> At the bottom of this email is a quick and dirty servlet i just tried to
> prove to myself that posting with params in the URL and the body worked
> fine ...

I tried that by simply posting to the Solr standard request handler
(it echoes params in the example config), and yes, it worked fine. The
problem is if the body should be the stream, and the content-type is
wrong (and we currently send it wrong with curl).

> The nut shell being: i'm totally on board with Ryan's simple URL scheme,
> having a single RequestParser/SolrRequestBuilder, going with an entirely
> "inspection" based approach for deciding where the streams come from, and
> leaving all mention of parsers or "stream.type" out of the URL.
>
> (because i have a good idea of how to support it in a backwards campatible
> way *later*)

Ahhhh.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: Great!  I just posted an update to SOLR-104 that I hope will make you happy.

Dude ... i can *not* keep up with you.

: If i'm following our discussion correctly, I *think* this takes care
: of all the major issues we have.

I don't think i'll have time to look at your new patch today, design wise
i think you are right, but there was still stuff that needed to be
refactored out of core.update and into the UpdateHandler wasn't there?

a couple of minor comments i had when i read the last patch (but didn't
mention since i was focusing on design issues) ...

 1) why rename the servlets "Legacy*" instead of just marking them deprecated?
 2) getSourceId and getSoure need to be left in the concrete Handlers so
they get illed in with the correct file version info on checkout.
 3) there's a comment in RequestHandlerBase.init about "indexOf" that
comes form the existing impl in DismaxRequestHandler -- but doesn't match
the new code ... i also wasn't certain that the change you made matches
the old semantics for dismax (i don't think we have a unit test for that
case)
 4) ContentStream.getFieldName() would proabably be more general as
ContentStream.getSourceInfo() ... it could stay as it is for files/urls,
but raw posts and multipart posts could have a usefull debuging
description as well.




-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> The nut shell being: i'm totally on board with Ryan's simple URL scheme,
> having a single RequestParser/SolrRequestBuilder, going with an entirely
> "inspection" based approach for deciding where the streams come from, and
> leaving all mention of parsers or "stream.type" out of the URL.
>
> (because i have a good idea of how to support it in a backwards campatible
> way *later*)
>

Great!  I just posted an update to SOLR-104 that I hope will make you happy.

It moved the various request parsing methods into distinct classes
that could easily be pluggable if that is necessary.  As written, It
supports stream.type="raw|multipart|simple|standard"  We can comment
that out and use 'standard' for everything as a first pass.

I added configuation to solrconfig.xml:
  <requestParsers enableRemoteStreaming="true"
multipartUploadLimitInKB="2048" />

I removed LegacySelectServlet and added an explicit check in the
DispatchFilter for paths starting with "/select"  This seems like a
better idea as the logic and expected results are identical.

If i'm following our discussion correctly, I *think* this takes care
of all the major issues we have.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > The bugaboo is if the POST data is NOT in fact
: > application/x-www-form-urlencoded but the user agent says it is -- as
: > both of you have indicated can be the case when using curl.  Could that
: > be why Yonik thought POST params was broken?
:
: Correct.  That's the format that post.sh in the example sends
: (application/x-www-form-urlencoded) and we ignore it in the update
: handler and always treat the body as binary.
:
: Now if you wanted to add some query args to what we already have, you
: can't use getParameterMap().

I think i mentioned this before, but I think what we should do is make the
stream "guessing" code in the Dispatcher/RequestBuilder very strict, and
make it's decisison about how to treat the post body entirely based on the
Content-Type ... meanwhile the existing (eventually know as "old") way of
doing updates via "/update" to the UpdateServlet can be more lax, and
assume everything is a raw POST of XML.

we can change post.sh to spcify XML as the Content-Type by default,
modify the example schema to have other update handlers registered with
names like "/update/csv" and eventually add an "/update/xml" encouraging
people to use it if they want to send updates as xml dcouments, regardless
of wehter htey want to POST them raw, uplodae them, or identify them by
filename -- as long as they are explicit about their content type.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/21/07, J.J. Larrea <jj...@panix.com> wrote:
> The bugaboo is if the POST data is NOT in fact application/x-www-form-urlencoded but the user agent says it is -- as both of you have indicated can be the case when using curl.  Could that be why Yonik thought POST params was broken?

Correct.  That's the format that post.sh in the example sends
(application/x-www-form-urlencoded) and we ignore it in the update
handler and always treat the body as binary.

Now if you wanted to add some query args to what we already have, you
can't use getParameterMap().

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by "J.J. Larrea" <jj...@panix.com>.

At 1:20 AM -0800 1/21/07, Chris Hostetter wrote:
>: We need code to do that anyway since getParameterMap() doesn't support
>: getting params from the URL if it's a POST (I believe I tried this in
>: the past and it didn't work).
>
>Uh ... i'm pretty sure you are mistaken ... yep, i've just checked and you
>are *definitely* mistaken.
>
>getParameterMap will in fact pull out params from both the URL and the
>body if it's a POST -- but only if you have not allready accessed either
>getReader or getInputStream -- this was at the heart of my cumbersome
>preProcess/process API that we all agree now was way too complicated.

The rules are very explicitly laid out in the Servlet 2.4 specification:

-----
SRV.4.1.1 When Parameters Are Available
The following are the conditions that must be met before post form data will
be populated to the parameter set:
1. The request is an HTTP or HTTPS request.
2. The HTTP method is POST.
3. The content type is application/x-www-form-urlencoded.
4. The servlet has made an initial call of any of the getParameter family of methods on the request object.
If the conditions are not met and the post form data is not included in the
parameter set, the post data must still be available to the servlet via the request object's input stream. If the conditions are met, post form data will no longer be available for reading directly from the request object's input stream.
-----

As Hoss notes a POST request can still have GET-style parameters in the URL query string, and getParameterMap will return both sets intermixed for a POST meeting the above conditions.  And calling getParameterMap won't impede the ability to subsequently read the input stream if the conditions are not met: "the post data must still be available to the servlet".  So it's theoretically valid to simply call getParameterMap and then blindly call getInputStream (possibly catching an Exception), or else use the results of getParameterMap to decide whether and how to process the input stream.

The bugaboo is if the POST data is NOT in fact application/x-www-form-urlencoded but the user agent says it is -- as both of you have indicated can be the case when using curl.  Could that be why Yonik thought POST params was broken?

- J.J.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > ...i was trying to avoid keeping the parser name out of the query string,
: > so we don't have to do any hack parsing of
: > HttpServletRequest.getQueryString() to get it.
:
: We need code to do that anyway since getParameterMap() doesn't support
: getting params from the URL if it's a POST (I believe I tried this in
: the past and it didn't work).

Uh ... i'm pretty sure you are mistaken ... yep, i've just checked and you
are *definitely* mistaken.

getParameterMap will in fact pull out params from both the URL and the
body if it's a POST -- but only if you have not allready accessed either
getReader or getInputStream -- this was at the heart of my cumbersome
preProcess/process API that we all agree now was way too complicated.

At the bottom of this email is a quick and dirty servlet i just tried to
prove to myself that posting with params in the URL and the body worked
fine ... i do rememebr reading up on this a few years back and verifying
that it's documented somewhere in the servlet spec, a quick google search
points this this article implying it was solidified in 2.2...

   http://java.sun.com/developer/technicalArticles/Servlets/servletapi/
   (grep for "Nit-picky on Parameters")


: Pluggable request parsers seems needlessly complex, and it gets harder
: to explain it all to someone new.
: Can't we start simple and defer anything like that until there is a real need?

Alas ... i appear to be getting worse at explaining myself in my old age.

What i was trying to say is that this idea i had for expressing
requestParsers as an optional prefix in fron of the requestHandler would
allow us to worry about the things i'm worried about *later* -- if/when
they become a problem (or when i have time to stop whinning, and actually
write the code)

The nut shell being: i'm totally on board with Ryan's simple URL scheme,
having a single RequestParser/SolrRequestBuilder, going with an entirely
"inspection" based approach for deciding where the streams come from, and
leaving all mention of parsers or "stream.type" out of the URL.

(because i have a good idea of how to support it in a backwards campatible
way *later*)



public class TestServlet extends HttpServlet {
  public void doPost(HttpServletRequest request, HttpServletResponse response)
    throws Exception {

    response.setContentType("text/plain");
    java.util.Map params = request.getParameterMap();
    for (Object k : params.keySet()) {
      Object v = params.get(k);
      if (v instanceof Object[]) {
        for (Object vv : (Object[])v) {
          response.getWriter().println(k.toString() + ":" + vv);
        }
      } else {
        response.getWriter().println(k.toString() + ":" + v);
      }
    }
  }
}

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/20/07, Chris Hostetter <ho...@fucit.org> wrote:
> : I'm on board as long as the URL structure is:
> :   ${path/from/solr/config}?stream.type=raw
>
> actually the URL i was suggesting was...
>
>     ${parser/path/from/solr/config}${handler/path/from/solr/config}?param=val
>
> ...i was trying to avoid keeping the parser name out of the query string,
> so we don't have to do any hack parsing of
> HttpServletRequest.getQueryString() to get it.

We need code to do that anyway since getParameterMap() doesn't support
getting params from the URL if it's a POST (I believe I tried this in
the past and it didn't work).

Aesthetically, having an optional parser in the queryString seems
nicer than in the path.

> basically if you have this...
>
>   <requestParser name="/raw" class="solr.RawPostRequestParser" />
>   <requestParser name="/multi" class="solr.MultiPartRequestParser" />
>   <requestParser name="/nostream" class="solr.SimpleRequestParser" />

Pluggable request parsers seems needlessly complex, and it gets harder
to explain it all to someone new.
Can't we start simple and defer anything like that until there is a real need?

> if they really had a reason to want to force one type of parsing, they
> could register it with a differnet prefix.

That is a point.  I'm not sure of the usecases though... it's not safe
to let untrusted people update solr at all, so I don't understand
prohibiting certain types of streams.

>   * default URLs stay clean
>   * no need for an extra "stream.type" param
>   * urls only get ugly if people want them to get ugly because they don't
>     want to make their clients set the mime type correctly.

The first and last points are also true for a stream.type type of thing.
After all, we will need other parameters for specifying local files,
right?  Or is opening local files up to the RequestHandler again?

Anyway, I'm not too unhappy either way, as long as I can leave out any
explicit "parser" and just get the right thing to happen.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

On Sat, 20 Jan 2007, Ryan McKinley wrote:

: Date: Sat, 20 Jan 2007 19:17:16 -0800
: From: Ryan McKinley <ry...@gmail.com>
: Reply-To: solr-dev@lucene.apache.org
: To: solr-dev@lucene.apache.org
: Subject: Re: Update Plugins (was Re: Handling disparate data sources in
:     Solr)
:
: >
: > ...what if we bring that idea back, and let people configure it in the
: > solrconfig.xml, using path like names...
: >
: >   <requestParser name="/raw" class="solr.RawPostRequestParser" />
: >   <requestParser name="/multi" class="solr.MultiPartRequestParser" />
: >   <requestParser name="/nostream" class="solr.SimpleRequestParser" />
: >   <requestParser name="/guess" class="solr.UseContentTypeRequestParser" />
: >
: > ...but don't make it a *public* interface ... make it package protected,
: > or maybe even a private static interface of the Dispatch Filter .. either
: > way, don't instantiate instances of it using the plugin-lib ClassLoader,
: > make sure it comes from the WAR to only uses the ones provided out of hte
: > box.


: I'm on board as long as the URL structure is:
:   ${path/from/solr/config}?stream.type=raw

actually the URL i was suggesting was...

    ${parser/path/from/solr/config}${handler/path/from/solr/config}?param=val

...i was trying to avoid keeping the parser name out of the query string,
so we don't have to do any hack parsing of
HttpServletRequest.getQueryString() to get it.

basically if you have this...

  <requestParser name="/raw" class="solr.RawPostRequestParser" />
  <requestParser name="/multi" class="solr.MultiPartRequestParser" />
  <requestParser name="/nostream" class="solr.SimpleRequestParser" />

  <requestHandler name="/update/commit" class="solr.CommitRequestHandler"/>
  <requestHandler name="/update" class="solr.UpdateRequestHandler" />
  <requestHandler name="/xml" class="solr.XmlQueryRequestHandler" />

...then these urls are all valid...

   http://localhost:9999/solr/raw/update?param=val
      ..uses raw post body for update
   http://localhost:9999/solr/multi/update?param=val
      ..uses multipart mime for update
   http://localhost:9999/solr/update?param=val
      ..no requestParser matched path prefix, so default is choosen and
        COntent-Type is used to decide where streams come from.

but if instead my config looks like this...

  <requestParser name="" class="solr.MultiPartRequestParser" />
  <requestParser name="/raw" class="solr.RawPostRequestParser" />

  <requestHandler name="/update/commit" class="solr.CommitRequestHandler"/>
  <requestHandler name="/update" class="solr.UpdateRequestHandler" />
  <requestHandler name="/xml" class="solr.XmlQueryRequestHandler" />

...then these URLs would fail...

   http://localhost:9999/solr/raw/update?param=val
   http://localhost:9999/solr/multi/update?param=val

...because the empty string would match as a parser, but "/raw/update"
and "/multi/update" wouldn't match as requestHandlers (the registration of
"/raw" as a parser would be useless)

this URL would work however...

   http://localhost:9999/solr/update?param=val
      ..treat all requetss as if they have multi-part mime streams

...i use this only as an example of what i'm describing ... not sa an
example of soemthing we shoudl recommend.

The key to all of this being that we'd check parser names against the URL
prefix in order from shortest to longest, then check the rest of the path
as a requestHandler ... if either of those fail, then the filter would
skip the request.

What we would probably recommended is that people map the "guess" request
parser to "/" so that they could put in all of hte options they want on
buffer sizes and such, then map their requestHandlers without a "/"
prefix, and use content types correctly.

if they really had a reason to want to force one type of parsing, they
could register it with a differnet prefix.

  * default URLs stay clean
  * no need for an extra "stream.type" param
  * urls only get ugly if people want them to get ugly because they don't
    want to make their clients set the mime type correctly.




-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> ...what if we bring that idea back, and let people configure it in the
> solrconfig.xml, using path like names...
>
>   <requestParser name="/raw" class="solr.RawPostRequestParser" />
>   <requestParser name="/multi" class="solr.MultiPartRequestParser" />
>   <requestParser name="/nostream" class="solr.SimpleRequestParser" />
>   <requestParser name="/guess" class="solr.UseContentTypeRequestParser" />
>
> ...but don't make it a *public* interface ... make it package protected,
> or maybe even a private static interface of the Dispatch Filter .. either
> way, don't instantiate instances of it using the plugin-lib ClassLoader,
> make sure it comes from the WAR to only uses the ones provided out of hte
> box.
>

I'm on board as long as the URL structure is:
  ${path/from/solr/config}?stream.type=raw

and if you are missing the parameter it chooses a good option.

(stream.type can change, just that the parser is configured in the
query string, not he path)

I like it!


Also, this would give us a natural place to configure the max size etc
for multi-part upload

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

(the three of us are online way to much ... for crying out loud it's a
saturday night folks!)

: In my opinion, I don't think we need to worry about it for the
: *default* handler.  That is not a very difficult constraint and, there
: is no one out there expecting to be able to post parameters in the URL
: and the body.  I'm not sure it is worth complicating anything if this
: is the only thing we are trying to avoid.

you'd be suprised the number of people i've run into who expect thta to
work.

: I think the *default* should handle all the cases mentioned without
: the client worrying about different URLs  for the various methods.
:
: The next question is which (if any) of the explicit parsers you think
: are worth including in web.xml?

holy crap, i think i have a solution that will make all of us really
happy...

remember that idea we all really detested of a public plugin interface,
configured in the solrconfig.xml that looked like this...

     public interface RequestParser(
        SolrRequest parse(HttpServletRequest req);
     }

...what if we bring that idea back, and let people configure it in the
solrconfig.xml, using path like names...

  <requestParser name="/raw" class="solr.RawPostRequestParser" />
  <requestParser name="/multi" class="solr.MultiPartRequestParser" />
  <requestParser name="/nostream" class="solr.SimpleRequestParser" />
  <requestParser name="/guess" class="solr.UseContentTypeRequestParser" />

...but don't make it a *public* interface ... make it package protected,
or maybe even a private static interface of the Dispatch Filter .. either
way, don't instantiate instances of it using the plugin-lib ClassLoader,
make sure it comes from the WAR to only uses the ones provided out of hte
box.

then make the dispatcher check each URL first by seeeing if it starts with
the name of any registered requestParser ... if it doesn't then use the
default "UseContentTypeRequestParser" .. *then* it does what the rest of
ryans current Dispatcher does, taking the rest of hte path to pick a
request handler.

the bueaty of this approach, is that if no <requestParser/> tags appear in
the solrconfig.xml, then the URLs look exactly like you guys want, and the
request parsing / stream building semantics are exactly the same as they
are today ... if/when we (or maybe just "i") write those other
RequestParsers people can choose to turn them on (and change their URLs)
if they want, but if they don't they can keep having the really simple
URLs ... OR they could register something like this...

  <requestParser name="" class="solr.RawPostRequestParser" />

...and have really simple URLs, but be garunteed that they allways got
their streams from raw POST bodies.

This would also solve Ryans concern about allowing people to turn off
fetching streams from remote URLs (or from local files, a small concern i
had but hadn't mentioend yet since we had bigger fish to fry)



Thoughts?


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> > consider the example you've got on your test.html page: "POST - with query
> > string" ... that doesn't obey the typical semantics of a POST with a query
> > string ... if you used the methods on HttpServletRequest to get the params
> > it would give you all the params it found both in the query strings *and*
> > in the post body.
>
> Blech.  I was wondering about that.  Sounds like bad form, but perhaps could be
> supported via something like
> /solr/foo?postbody=args
>

In my opinion, I don't think we need to worry about it for the
*default* handler.  That is not a very difficult constraint and, there
is no one out there expecting to be able to post parameters in the URL
and the body.  I'm not sure it is worth complicating anything if this
is the only thing we are trying to avoid.

I think the *default* should handle all the cases mentioned without
the client worrying about different URLs  for the various methods.

The next question is which (if any) of the explicit parsers you think
are worth including in web.xml?

http://${host}/${context}/${path/from/config}  (default)
http://${host}/${context}/params/${path/from/config} (used
getParameterMap() to fill args)
http://${host}/${context}/multipart/${path/from/config} (force
multipart request)
http://${host}/${context}/stream/${path/from/config} (params from URL,
body as stream)

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/19/07, Chris Hostetter <ho...@fucit.org> wrote:
> whoa ... hold on a minute, even if we use a ServletFilter do do all of the
> dispatching instead of a Servlet we still need a base path right?

I thought that's what the filter gave you... the ability to filter any
URL to the /solr webapp, and Ryan was doing a lookup on the next
element for a request handler.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/20/07, Chris Hostetter <ho...@fucit.org> wrote:
> Ryan: this patch truely does kick ass ... we can probably simplify a lot
> of the Legacy stuff by leveraging your new StandardRequestBuilder -- but
> that can be done later.

Much is already done by the looks of it.

> i'm stil really not liking the way there is a single SolrRequestBuilder
> with a big complicated build method that "guesses" what streams the user
> wants.

But I don't need a separate URL to do GET vs POST in HTTP.
It seems like having a different URL for where you put the args would
be hard to explain to people.

>   i really feel strongly that even if all the parsing logic is in
> the core, even if it's all in one class: a piece of the path should be
> used to determine where the streams come from.

If there's a ? in the URL, then it's args, so that could always
safetly  be parsed.  Perhaps a special arg, if present, could override
the default method of getting input streams?

> consider the example you've got on your test.html page: "POST - with query
> string" ... that doesn't obey the typical semantics of a POST with a query
> string ... if you used the methods on HttpServletRequest to get the params
> it would give you all the params it found both in the query strings *and*
> in the post body.

Blech.  I was wondering about that.  Sounds like bad form, but perhaps could be
supported via something like
/solr/foo?postbody=args

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: I just posted a new patch on SOLR-104.  I think it addresses most of
: the issues we have discussed.  (Its a little difficult to know as it
: has been somewhat circular)   I was going to reply to your points one
: by one, but i think that would just make the discussion more confusing
: then it already is!

Ryan: this patch truely does kick ass ... we can probably simplify a lot
of the Legacy stuff by leveraging your new StandardRequestBuilder -- but
that can be done later.

i'm stil really not liking the way there is a single SolrRequestBuilder
with a big complicated build method that "guesses" what streams the user
wants.   i really feel strongly that even if all the parsing logic is in
the core, even if it's all in one class: a piece of the path should be
used to determine where the streams come from.

consider the example you've got on your test.html page: "POST - with query
string" ... that doesn't obey the typical semantics of a POST with a query
string ... if you used the methods on HttpServletRequest to get the params
it would give you all the params it found both in the query strings *and*
in the post body.

This is a great example of what i was talking about: if i have no
intention of sending a stream, it should be possible for me to send params
in both the URL and in the POST body -- but in other cases i should be
able to POST some raw XML and still have params in the URL.

arguable: we could look at the Content-Type of the request and make the
assumption based on that -- but as i mentioned before, people don't
allways set the Content-TYpe perfectly.  if we used a URL fragment to
determine where the streams should come from we could be a lot more
confident that we know where the stream should come from -- and let the
RequestHandler decide if it wants to trust the ContentType

the multipart/mixed example i gave previously is another example -- your
code here assumes that should be given to the RequsetHandler as multiple
streams -- which is a great assumption to make for fileuploads, but which
gives me no way to POST multipart/mixed mime data that i want given to the
RequestHandler as a single ContentStream (so it can have access to all of
hte mime headers for each part)



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

I just posted a new patch on SOLR-104.  I think it addresses most of
the issues we have discussed.  (Its a little difficult to know as it
has been somewhat circular)   I was going to reply to your points one
by one, but i think that would just make the discussion more confusing
then it already is!

>
> > (i don't trust HTTP Client code -- but for the sake
> > of argument let's assume all clients are perfect) what happens when a
> > person wants to send a mim multi-part message *AS* the raw post body -- so
> > the RequestHandler gets it as a single ContentStream (ie: single input
> > stream, mime type of multipart/mixed) ?
>
> Multi-part posts will have the content-type set correctly, or it won't work.
> The big use-case I see is browser file upload, and they will set it correctly.
>

I don't see it as a big problem because we don't have to deal with
legacy streams yet.  No one is expecting their existing stream code to
work.  The only header values the SOLR-104 code relies on is
'multipart'  I think that is a reasonable constraint since it has to
be implemented properly for commons-file-upload to work.

ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/20/07, Yonik Seeley <yo...@apache.org> wrote:
> > It would be:
> > http://${context}/${path}?stream.type=post
>
> Yes!
> Feels like a much more natural place to me than as part of the path of the URL.
> Just need to hash out meaningful param names/values?

Oh, and I'm more interested in the semantics of those param/values,
and not what request parser it happens to get mapped to.  I'd vote for
different request parsers being an implementation detail, and keeping
those details (plugability) out of solrconfig.xml for now.

We could always add it later, but it's a lot tougher to remove things.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/20/07, Ryan McKinley <ry...@gmail.com> wrote:
> > >- put everyone
> > > understands how to put something in a URL.  if nothing else, think of
> > > putting the "parsetype" in the URL as a checksum that the RequestParaser
> > > can use to validate it's assumptions -- if it's not there, then it can do
> > > all of the intellegent things you think it should do, but if it is there
> > > that dictates what it should do.
> >
> > If it's optional in the args, I could be on board with that.
> >
>
> If its optional in the req.getQueryString() I'm in.
>
> Ignore my previous post about
> ${context}/multipart/asdgadsga
>
> It would be:
> http://${context}/${path}?stream.type=post

Yes!
Feels like a much more natural place to me than as part of the path of the URL.
Just need to hash out meaningful param names/values?

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> >- put everyone
> > understands how to put something in a URL.  if nothing else, think of
> > putting the "parsetype" in the URL as a checksum that the RequestParaser
> > can use to validate it's assumptions -- if it's not there, then it can do
> > all of the intellegent things you think it should do, but if it is there
> > that dictates what it should do.
>
> If it's optional in the args, I could be on board with that.
>

If its optional in the req.getQueryString() I'm in.

Ignore my previous post about
${context}/multipart/asdgadsga

It would be:
http://${context}/${path}?stream.type=post

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

> :
> : This would drop the ':' from my proposed URL and change the scheme to look like:
> : /parser/path/the/parser/knows/how/to/extract/?params
>
> i was totally okay with the ":" syntax (although we should double check if
> ":" is actaully a legal unescaped URL character) .. but i'm confused by
> this new suggestions ... is "parser" the name of the parser in that
> example and "path/the/parser/knows/how/to/extract" data that the parser
> may use to build to SolrRequest with? (ie: perhaps the RequestHandler)
>
> would parser names be required to not have slashes in them in that case?
>

(working with the assumption that most cases can be defined by a
single request parser)

I am/was suggesting that a dispatch servlet/fliter has a single
request parser.  The default request parser will choose the handler
based on names defined in solrconfig.xml.  If someone needs a custom
RequestParser, it would be linked to a new servlet/filter (possibly)
mapped to a distinct prefix.

If it is not possible to handle most standard stream cases with a
single request parser, i will go back to the /path:parser format.

I suggest it is configured in web.xml because that is a configurable
place that is not solrconfg.xml.  I don't think it is or should be a
highly configurable component.


> :
> : Thank goodness you didn't!  I'm confident you won't let me (or anyone)
> : talk you into something like that!  You guys made a lot of good
>
> the point i was trying to make is that if we make a RequestParser
> interface with a "parseRequest(HttpServletRequest req)" method, it amouts
> to just as much badness -- the key is we can make that interface as long
> as all the implimentations are in the SOlr code base where we can keep an
> eye on them, and people have to go way, WAY, *WAY* into solr to start
> shanging them.
>
>

Yes, implementing a RequestParser is more like writing a custom
Servlet then adding a Tokenizer.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/20/07, Chris Hostetter <ho...@fucit.org> wrote:
> but the HTTP Client libraries in vaious languages don't allways make it
> easy to set Content-type -- and even if they do that doesn't mean the
> person using that library knows how to use it properly -

I think we have to go with common usages.  We neither rely on, nor
discard content-type in all cases.
- When it has a charset, believe it.
- When it says form-encoded, only believe it if there aren't args on
the URL (because many clients like curl default to
"application/x-www-form-urlencoded" for a post.

>- put everyone
> understands how to put something in a URL.  if nothing else, think of
> putting the "parsetype" in the URL as a checksum that the RequestParaser
> can use to validate it's assumptions -- if it's not there, then it can do
> all of the intellegent things you think it should do, but if it is there
> that dictates what it should do.

If it's optional in the args, I could be on board with that.

> (aren't you the one that convinced me a few years back that it was better
> to trust a URL then to trust HTTP Headers? ... because people understand
> URLs and put things in them, but they don't allways know what headers to
> send .. curl being the great example, it allways sends a Content-TYpe even
> if the user doesn't ask it to right?)

Well, for the update server, we do ignore the form-data stuff, but we
don't ignore the charset.

> : Multi-part posts will have the content-type set correctly, or it won't work.
> : The big use-case I see is browser file upload, and they will set it correctly.
>
> right, but my point is what if i want the multi-part POST body left alone
> so my RequestHandler can deal with it as a single stream -- if i set
> every header correctly, the "smart" parsing code will parse it -- which is
> why sometihng in the URL telling it *not* to parse it is important.

That sounds like a pretty rare corner case.

> : We should not preclude wacky handlers from doing things for
> : themselves, calling our stuff as utility methods.
>
> how? ... if there is one and only one RequestParser which makes the
> SolrRequest before the RequestHandler ever sees it, and parses the post
> body because the content-type is multipart/mixed how can a  wacky
> handler ever get access to the raw post body?

I wasn't thinking *that* whacky :-)
There are always other options, such as using your own servlet though.
 I don't think we should try to solve every case (the whole 80/20
thing).

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > that scares me ... not only does it rely on the client code sending the
: > correct content-type
:
: Not really... that would perhaps be the default, but the parser (or a
: handler) can make intelligent decisions about that.
:
: If you put the parser in the URL, then there's *that* to be messed up
: by the client.

but the HTTP Client libraries in vaious languages don't allways make it
easy to set Content-type -- and even if they do that doesn't mean the
person using that library knows how to use it properly -- put everyone
understands how to put something in a URL.  if nothing else, think of
putting the "parsetype" in the URL as a checksum that the RequestParaser
can use to validate it's assumptions -- if it's not there, then it can do
all of the intellegent things you think it should do, but if it is there
that dictates what it should do.

(aren't you the one that convinced me a few years back that it was better
to trust a URL then to trust HTTP Headers? ... because people understand
URLs and put things in them, but they don't allways know what headers to
send .. curl being the great example, it allways sends a Content-TYpe even
if the user doesn't ask it to right?)

: Multi-part posts will have the content-type set correctly, or it won't work.
: The big use-case I see is browser file upload, and they will set it correctly.

right, but my point is what if i want the multi-part POST body left alone
so my RequestHandler can deal with it as a single stream -- if i set
every header correctly, the "smart" parsing code will parse it -- which is
why sometihng in the URL telling it *not* to parse it is important.

: We should not preclude wacky handlers from doing things for
: themselves, calling our stuff as utility methods.

how? ... if there is one and only one RequestParser which makes the
SolrRequest before the RequestHandler ever sees it, and parses the post
body because the content-type is multipart/mixed how can a  wacky
handler ever get access to the raw post body?



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/20/07, Chris Hostetter <ho...@fucit.org> wrote:
> : I have imagined the single default parser handles *all* the cases you
> : just mentioned.
>
> Ahhhhhhhhhhhh ... a lot of confusing things make more sense now. .. but
> some things are more confusing: If there is only one parser, and it
> decides what to do based entirely on param names and HTTP headers, then
> what's the point of having the parser name be part of the path in your
> URL design?

I didn't think it would be part of the URL anymore.

> : POST: depending on headers/content type etc you parse the body as a
> : single stream, multi-part files or read the params.
> :
> : It will take some careful design, but I think all the standard cases
> : can be handled by a single parser.
>
> that scares me ... not only does it rely on the client code sending the
> correct content-type

Not really... that would perhaps be the default, but the parser (or a
handler) can make intelligent decisions about that.

If you put the parser in the URL, then there's *that* to be messed up
by the client.

> (i don't trust HTTP Client code -- but for the sake
> of argument let's assume all clients are perfect) what happens when a
> person wants to send a mim multi-part message *AS* the raw post body -- so
> the RequestHandler gets it as a single ContentStream (ie: single input
> stream, mime type of multipart/mixed) ?

Multi-part posts will have the content-type set correctly, or it won't work.
The big use-case I see is browser file upload, and they will set it correctly.

> This may sound like a completely ridiculous idea, but consider the
> situation where someone is indexing email ... they've written a
> RequestHandler that knows how to parser multipart mime emails and
> convert them to documents, they want to POST them directly to Solr and let
> their RequestHandler deal with them as a single entity.

We should not preclude wacky handlers from doing things for
themselves, calling our stuff as utility methods.

> ..i think life would be a lot simpler if we kept the RequestParser name as
> part of hte URL, completely determined by the client (since the client
> knows what it's trying to send) ... even if there are only 2 or 3 types of
> RequestParsing being done.

Having to do different types of posts to different URLs doesn't seem
optimal, esp if we can do it in one.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Alan Burlison <Al...@sun.com>.

Chris Hostetter wrote:

> : 1) I think it should be a ServletFilter applied to all requests that
> : will only process requests with a registered handler.
> 
> I'm not sure what "it" is in the above sentence ... i believe from the
> context of the rest of hte message you are you refering to
> using a ServletFilter instead of a Servlet -- i honestly have no opinion
> about that either way.

I thought a filter required you to open up the WAR file and change 
web.xml, or am I misunderstanding?

-- 
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: I have imagined the single default parser handles *all* the cases you
: just mentioned.

Ahhhhhhhhhhhh ... a lot of confusing things make more sense now. .. but
some things are more confusing: If there is only one parser, and it
decides what to do based entirely on param names and HTTP headers, then
what's the point of having the parser name be part of the path in your
URL design?

: POST: depending on headers/content type etc you parse the body as a
: single stream, multi-part files or read the params.
:
: It will take some careful design, but I think all the standard cases
: can be handled by a single parser.

that scares me ... not only does it rely on the client code sending the
correct content-type (i don't trust HTTP Client code -- but for the sake
of argument let's assume all clients are perfect) what happens when a
person wants to send a mim multi-part message *AS* the raw post body -- so
the RequestHandler gets it as a single ContentStream (ie: single input
stream, mime type of multipart/mixed) ?

This may sound like a completely ridiculous idea, but consider the
situation where someone is indexing email ... they've written a
RequestHandler that knows how to parser multipart mime emails and
convert them to documents, they want to POST them directly to Solr and let
their RequestHandler deal with them as a single entity.


..i think life would be a lot simpler if we kept the RequestParser name as
part of hte URL, completely determined by the client (since the client
knows what it's trying to send) ... even if there are only 2 or 3 types of
RequestParsing being done.


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> what!? .. really????? ... you don't think the ones i mentioned before are
> things we should support out of the box?
>
>   - no stream parser (needed for simple GETs)
>   - single stream from raw post body (needed for current updates
>   - multiple streams from multipart mime in post body (needed for SOLR-85)
>   - multiple streams from files specified in params (needed for SOLR-66)
>   - multiple streams from remote URL specified in params
>

I have imagined the single default parser handles *all* the cases you
just mentioned.

GET: read params from paramMap().  Check thoes params for special
params that send you to one or many remote streams.

POST: depending on headers/content type etc you parse the body as a
single stream, multi-part files or read the params.

It will take some careful design, but I think all the standard cases
can be handled by a single parser.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: The RequestParser is not be part of the core API - It would be a
: helper function for Servlets and Filters that call the core API.  It
: could be configured in web.xml rather then solrconfig.xml.  A
: RequestDispatcher (Servlet or Filter) would be configured with a
: single RequestParser.
:
: The RequestParser would be in charge of taking HttpRequest and determining:
:   1) The RequestHandler
:   2) The SolrRequest (Params & Streams)

This sounds fine to me ... i was going to suggest that having a public API
for RequestParser that people could extend and register intsnces of in the
solrconfig would be better then no public API at all -- but if we do that
we've let the genie out of the bottle, better to be more restrictive about
the internal API, and if/when new usecase come up we can revisit the
decision then.

If the RequestParser is going to pick the RequestHandler, we might as
stick with the current model where the RequestHandler is determined by the
"qt" SolrParam (it just wouldn't neccessarily come from the "qt" param of
the URL, since the RequestParser can decide where everything comes form
it could be from a URL param or it could be from the path) to keep the API
simple right?

    interface RequestParser {
      public SolrRequest makeSolrRequest(HttpServletRequest req);
    }

I'm curious though why you think RequestParsers should be managed in the
web.xml ... do you mean they would each be a Servlet Filter? ... if we
assume there's going to be a fixed list and they aren't easily extended,
then why not just:
  - have a HashMap of them in a single ServletFilter dispatcher,
  - lookup the one to use pased on the appropriate part of the path
  - let that RequestParser make the SolrRequest
  - continue with common code for all requests regardless of format:
    - get RequestHandler from the core by name
    - execute RequestHandler
    - get output writer by name
    - write out response

: It would not be the most 'pluggable' of plugins, but I am still having
: trouble imagining anything beyond a single default RequestParser.

what!? .. really????? ... you don't think the ones i mentioned before are
things we should support out of the box?

  - no stream parser (needed for simple GETs)
  - single stream from raw post body (needed for current updates
  - multiple streams from multipart mime in post body (needed for SOLR-85)
  - multiple streams from files specified in params (needed for SOLR-66)
  - multiple streams from remote URL specified in params

: Assuming anything doing *really* complex ways of extracting
: ContentStreams will do it in the Handler not the request parser.  For
: reference see my argument for a seperate DocumentParser interface in:
: http://www.nabble.com/Re%3A-Update-Plugins-%28was-Re%3A-Handling-disparate-data-sources-in-Solr%29-p8386161.html

aggreed ... but that can easily be added later.

: In my view, the default one could be mapped to "/*" and a custom one
: could be mapped to "/mycustomparser/*"
:
: This would drop the ':' from my proposed URL and change the scheme to look like:
: /parser/path/the/parser/knows/how/to/extract/?params

i was totally okay with the ":" syntax (although we should double check if
":" is actaully a legal unescaped URL character) .. but i'm confused by
this new suggestions ... is "parser" the name of the parser in that
example and "path/the/parser/knows/how/to/extract" data that the parser
may use to build to SolrRequest with? (ie: perhaps the RequestHandler)

would parser names be required to not have slashes in them in that case?

: > Imagine if 3 years ago, when Yonik and I were first hammering out the API
: > for SolrRequestHandlers, we had picked this...
: >
: >    public interface SolrRequestHandlers extends SolrInfoMBean {
: >      public void init(NamedList args);
: >      public void handleRequest(HttpServletRequest req, SolrQueryResponse rsp);
: >    }
:
: Thank goodness you didn't!  I'm confident you won't let me (or anyone)
: talk you into something like that!  You guys made a lot of good

the point i was trying to make is that if we make a RequestParser
interface with a "parseRequest(HttpServletRequest req)" method, it amouts
to just as much badness -- the key is we can make that interface as long
as all the implimentations are in the SOlr code base where we can keep an
eye on them, and people have to go way, WAY, *WAY* into solr to start
shanging them.




-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

(Note: this is different then what i have suggested before.  Treat it
as brainstorming on how to take what i have suggested and mesh it with
your concerns)

What if:

The RequestParser is not be part of the core API - It would be a
helper function for Servlets and Filters that call the core API.  It
could be configured in web.xml rather then solrconfig.xml.  A
RequestDispatcher (Servlet or Filter) would be configured with a
single RequestParser.

The RequestParser would be in charge of taking HttpRequest and determining:
  1) The RequestHandler
  2) The SolrRequest (Params & Streams)

It would not be the most 'pluggable' of plugins, but I am still having
trouble imagining anything beyond a single default RequestParser.
Assuming anything doing *really* complex ways of extracting
ContentStreams will do it in the Handler not the request parser.  For
reference see my argument for a seperate DocumentParser interface in:
http://www.nabble.com/Re%3A-Update-Plugins-%28was-Re%3A-Handling-disparate-data-sources-in-Solr%29-p8386161.html

In my view, the default one could be mapped to "/*" and a custom one
could be mapped to "/mycustomparser/*"

This would drop the ':' from my proposed URL and change the scheme to look like:
/parser/path/the/parser/knows/how/to/extract/?params

This would give people a relativly easy way to implement 'restful'
URLs if they need to.  (but they would have to edit web.xml)


> : Would that be configured in solrconfig.xml as <handler name="xml"?
> : name="update/xml"?  If it is "update/xml" would it only really work if
> : the 'update' servlet were configured properly?
>
> it would only make sense to map that as "xml" ... the SolrCore (and hte
> solrconfig.xml) shouldn't have any knowledge of the Servlet/ServletFilter
> base paths because it should be possible to use the SolrCore independent
> of any ServletContainer (if for no other reason in unit tests)
>

Correct, SolrCore shoudl not care what the request path is.  That is
why I want to deprecate the execute( ) function that assumes the
handler is defined by 'qt'

Unit tests should be handled by execute( handler, req, res )

If I had my druthers, It would be:
  res = handler.execute( req )
but that is too big of leap for now :)


> ...
>
> A third use case of doing queries with POST might be that you want to use
> standard CGI form encoding/multi-part file upload semantics of HTTP to
> send an XML file (or files) to the above mentioned XmlQPRequestHandler ...
> so then we have "MultiPartMimeRequestParser" ...

I agree with all your use cases.  It just seems like a LOT of complex
overhead to extract the general aspects of translating a
URL+Params+Streams => Handler+Request(Params+Streams)

Again, since the number of 'RequestParsers' is small, it seems overly
complex to have a separate plugin to extract URL, another to extract
the Handler, and another to extract the streams.  Particulary since
the decsiions on how you parse the URL can totally affect the other
aspects.


>
> ...i really, really, REALLY don't like the idea that the RequestParser
> Impls -- classes users should be free to write on their own and plugin to
> Solr using the solrconfig.xml -- are responsible for the URL parsing and
> parameter extraction.  Maybe calling them "RequestParser" in my suggested
> design is missleading, maybe a better name like "StreamExtractor" would be
> better ... but they shouldn't be in charge of doing anything with the URL.
>

What if it were configured in web.xml, would you feel more comfortable
letting it determine how the URL is parsed and streams are extracted?

> Imagine if 3 years ago, when Yonik and I were first hammering out the API
> for SolrRequestHandlers, we had picked this...
>
>    public interface SolrRequestHandlers extends SolrInfoMBean {
>      public void init(NamedList args);
>      public void handleRequest(HttpServletRequest req, SolrQueryResponse rsp);
>    }

Thank goodness you didn't!  I'm confident you won't let me (or anyone)
talk you into something like that!  You guys made a lot of good
choices and solr is an amazing platform for it.

That said, the task at issue is: How do we convert an arbitrary
HttpServletRequest into a SolrRequest.

I am proposing we have a single interface to do this:
  SolrRequest r = RequestParser.parse( HttpServletRequest  )

You are proposing this is broken down further.  Something like:
  Handler h = (the filter) getHandler( req.getPath() )
  SolrParams = (the filter) do stuff to extract the params (using
parser.preProcess())
  ContentStreams = parser.parse( request )

While it is not great to have plugins manipulate the HttpRequest -
someone needs to do it.  In my opinion, the RequestParser's job is to
isolate *everything* *else* from the HttpServletRequest.

Again, since the number of RequestParser is small, it seems ok (to me)

>
> keeping HttpServletRequest out of the API for RequestParsers helps us
> future-proof against breaking plugins down the road.
>

I agree.  This is why i suggest the RequestParsers is not a core part
of the API, just a helper class for Servlets and Filters.


ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: 1) I think it should be a ServletFilter applied to all requests that
: will only process requests with a registered handler.

I'm not sure what "it" is in the above sentence ... i believe from the
context of the rest of hte message you are you refering to
using a ServletFilter instead of a Servlet -- i honestly have no opinion
about that either way.

: 2) I think the RequestParser should take care off parsing
: ContentStreams *and* SolrParams - not just the streams.  The dispatch
: servlet/filter should never call req.getParameter().

If that's the case, then the RequestParser is in control of the URL
structure ... except that it's not in control of the path info since
that's how we pick the RequestParser in the first place ... what if
we decide later that we want to change the URL structure -- then every
RequestParser would have to be changed.

: 3) I think the dispatcher picks the Handler and either calls it
: directly or passes it to SolrCore.  It does not put "qt" in the
: SolrParams and have SolrCore extract it (again)

that's perfectly fine with me - i only had it that way because that's how
RequestHandler execution currently works, i wanted to leave anything not
directly related to what i was suggestion exactly the way it was currently
in my psuedo code.

: == Arguments for a ServletFilter: ==
: If we implement the dispatcher as a Filter:
: * the URL is totally customizable from solrconfig.xml

can you explain this more ... why does a ServletFilter make the URL more
customizable then an alternative (which i believe is jsut a Servlet)

: If we implement the dispatcher as a Servlet
: * we have to define a 'base' path for each servlet - this would make
: the names longer then then need to be and add potential confusion in
: the configuration.

whoa ... hold on a minute, even if we use a ServletFilter do do all of the
dispatching instead of a Servlet we still need a base path right?
... even if we ignore the current admin pages and assume we're going to
replace them all with new RequestHandlers when we do this, what happens a
year from now when we decide we want to add some new piece of
functionality that needs a differnet Servlet/ServletFilter ... if we've
got a Filter matching on "/*" don't we burn every possible bridge we have
for adding something else latter.

: Consider the servlet 'update' and another servlet 'select'  When our
: proposed changes, these could both be the same servlet class
: configured to distinct paths.  Now lets say you want to call:
:   http://localhost/solr/update/xml?params
: Would that be configured in solrconfig.xml as <handler name="xml"?
: name="update/xml"?  If it is "update/xml" would it only really work if
: the 'update' servlet were configured properly?

it would only make sense to map that as "xml" ... the SolrCore (and hte
solrconfig.xml) shouldn't have any knowledge of the Servlet/ServletFilter
base paths because it should be possible to use the SolrCore independent
of any ServletContainer (if for no other reason in unit tests)

: == Why RequestParser should handle SolrParams AND Streams  ==
: * It is a MUCH easier interface to understand then pre/process

I won't argue with you there.

: * The dispatch filter does not need to touch req.getParameter()
: * It alows for plugable behavior to extract parameters - not just streams.
:
: Consider the current discussion on "POST for Queries"
:  (http://www.nabble.com/Using-HTTP-Post-for-Queries-tf3039973.html)
: It seems like we may need a few ways to parse params out of the
: request.  The way one handles the parameters directly affects the
: streams.  This logic should be contained in a single place.

The intent there is to use a regular CGI form encoded POST body to express
more params then the client feels safe putting in a URL, under teh API
i was suggesting that would be solved with a "No-Op" RequestParser that
has empty preProcess and process methods.  when the Servlet (or ServletFilter)
builds the Solrparams (in between calling parser.preProcess and
parser.process) it gets *all* of hte form encoded params from the
HttpServletRequest (because no code has touched the input stream)

an alternative situation in which you might want to "Query using HTTP
POST" is if you had an XmlQPRequestHandler that understood the
xml-query-parser syntax from this contrib...
  http://svn.apache.org/viewvc/lucene/java/trunk/contrib/xml-query-parser/
...which expected to read the XML from the ContentStreams of the
SolrRequest, and you wnated to put the XML in the raw POSt body of the
request (the same way our current update POSTs work) but there were other
options XmlQPRequestHandler wanted to get out of the
SolrRequest's SolrParams.

that would be handled by a "RawPostRequestParser" whose process
method would be a No-Op, but the preProcess method would make a
ContentStream out of the InputStream from the HttpServletRequest -- then
the Servlet/ServletFilter would parse the url useing the
HttpServletRequest.getParameter() methods (which are now safe to call
without damanging the InputStream).

(That RawPostRequestParser would be reused along with an XmlUpdateHandler
that we refactor the existing <add><doc> updated logic from the core to
support the legacy /update URLs)

A third use case of doing queries with POST might be that you want to use
standard CGI form encoding/multi-part file upload semantics of HTTP to
send an XML file (or files) to the above mentioned XmlQPRequestHandler ...
so then we have "MultiPartMimeRequestParser" that has a No-Op preProcess
method, and uses the Commons FileUpload code with a
org.apache.commons.fileupload.RequestContext it builds out of the header
info passed to preProcess by the Servlet.

: == The Dispatcher should pick the handler  ==

: There is no reason it would need to inject 'qt' into the solr params
: just so it can be pulled out by SolrCore (using the @depricated
: function: solrReq.getQueryType()!)

I agree ... as i said, i just left it that way in my psuedo code because i
didn't have a strong opinion about it and wanted to leave things that
worked okay now as they were in an attempt to not confuse the point i was
trying to make -- i guess it didn't work :)

: If the dispatcher is required to put a parameter in SolrParams, we
: could not make the RequestParser in charge of filling the SolrParams.
: This would require something like your pre/process system.

well .. even if the Servlet/ServletFilter dispatches directly to the
Handler without putting a "qt" in the SolrParams ... i still prefer that
the RequestParser not be the one parsing the URL.

: == Pseudo-Java ==

I'm happily on board with all of the code you posted except this line...

:         SolrQueryRequest solrReq = parser.parse( request );

...i really, really, REALLY don't like the idea that the RequestParser
Impls -- classes users should be free to write on their own and plugin to
Solr using the solrconfig.xml -- are responsible for the URL parsing and
parameter extraction.  Maybe calling them "RequestParser" in my suggested
design is missleading, maybe a better name like "StreamExtractor" would be
better ... but they shouldn't be in charge of doing anything with the URL.

Imagine if 3 years ago, when Yonik and I were first hammering out the API
for SolrRequestHandlers, we had picked this...

   public interface SolrRequestHandlers extends SolrInfoMBean {
     public void init(NamedList args);
     public void handleRequest(HttpServletRequest req, SolrQueryResponse rsp);
   }

...and the only thing the Servlet did was format the response?  Not only
would Unit tests have been a lot harder to write, but we'd be
screwed today, because we wouldn't be able to change the Servlet and
change how RequestHandlers are used without breaking any existing
RequestHandlers written by clients that might be depending on having the
full HttpServletRequest and do crazy things with the path.

keeping HttpServletRequest out of the API for RequestParsers helps us
future-proof against breaking plugins down the road.


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

Ok, now i think I get what you are suggesting.  The differences are that:

1) I think it should be a ServletFilter applied to all requests that
will only process requests with a registered handler.
2) I think the RequestParser should take care off parsing
ContentStreams *and* SolrParams - not just the streams.  The dispatch
servlet/filter should never call req.getParameter().
3) I think the dispatcher picks the Handler and either calls it
directly or passes it to SolrCore.  It does not put "qt" in the
SolrParams and have SolrCore extract it (again)


== Arguments for a ServletFilter: ==
If we implement the dispatcher as a Filter:
* the URL is totally customizable from solrconfig.xml
* we have a single Filter to handle all standard requests
* with this single Filter, we can easily handle the existing URL structures
* configured URLs can sit at the 'Top level' next to 'top level' servlets

If we implement the dispatcher as a Servlet
* we have to define a 'base' path for each servlet - this would make
the names longer then then need to be and add potential confusion in
the configuration.

Consider the servlet 'update' and another servlet 'select'  When our
proposed changes, these could both be the same servlet class
configured to distinct paths.  Now lets say you want to call:
  http://localhost/solr/update/xml?params
Would that be configured in solrconfig.xml as <handler name="xml"?
name="update/xml"?  If it is "update/xml" would it only really work if
the 'update' servlet were configured properly?


== Why RequestParser should handle SolrParams AND Streams  ==
* It is a MUCH easier interface to understand then pre/process
* The dispatch filter does not need to touch req.getParameter()
* It alows for plugable behavior to extract parameters - not just streams.

Consider the current discussion on "POST for Queries"
 (http://www.nabble.com/Using-HTTP-Post-for-Queries-tf3039973.html)
It seems like we may need a few ways to parse params out of the
request.  The way one handles the parameters directly affects the
streams.  This logic should be contained in a single place.


== The Dispatcher should pick the handler  ==

In the proposed url scheme: /path/to/handler:parser, the dispatcher
has to decide what handler it is.  If we use a filter, it will look
for a registered handler - if it can't find one, it will not process
the request.

There is no reason it would need to inject 'qt' into the solr params
just so it can be pulled out by SolrCore (using the @depricated
function: solrReq.getQueryType()!)

If the dispatcher is required to put a parameter in SolrParams, we
could not make the RequestParser in charge of filling the SolrParams.
This would require something like your pre/process system.


== Pseudo-Java ==

The real version will do error handling and will need some special
logic to make '/select' behave exactly as it does now.


class SolrFilter implements Filter {
 void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain)
 {
    String path = req.getServletPath();
    SolrRequestHandler handler = getHandlerFromPath( path );
    if( handler != null ) {
        SolrRequestParser parser = getParserFormPath( path );
        SolrQueryResponse solrRes = new SolrQueryResponse();
        SolrQueryRequest solrReq = parser.parse( request );

        core.execute( handler, solrReq, solrRes );
        return;
    }
    chain.doFilter(request, response);
 }
}

Modify core to directly accept the 'handler':

class SolrCore {

  public void execute(SolrRequestHandler handler, SolrQueryRequest
req, SolrQueryResponse rsp) {

    // setup response header and handle request
    final NamedList responseHeader = new NamedList();
    rsp.add("responseHeader", responseHeader);
    handler.handleRequest(req,rsp);
    setResponseHeaderValues(responseHeader,req,rsp);

    log.info(req.getParamString()+ " 0 "+
	     (int)(rsp.getEndTime() - req.getStartTime()));
  }

  @Depricated
  public void execute(SolrQueryRequest req, SolrQueryResponse rsp) {
    SolrRequestHandler handler = getRequestHandler(req.getQueryType());
    if (handler==null) {
      log.warning("Unknown Request Handler ...
    }
    this.execute( handler, req, rsp );
  }
}


ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > Ah ... this is the one problem with high volume on an involved thread ...
: > i'm sending replies to messages you write after you've already read other
: > replies to other messages you sent and changed your mind :)

: Should we start a new thread?

I don't think it would make a differnece ... we just need to slow down :)

: Ok, now (I think) I see the difference between our ideas.
:
: >From your code, it looks like you want the RequestParser to extract
: 'qt' that defines the RequestHandler.  In my proposal, the
: RequestHandler is selected independent of the RequestParser.

no, no, no ... i'm sorry if i gave that impression ... the RequestParser
*only* worries about getting a streams, it shouldn't have any way of even
*guessing* what RequestHandler is going to be used.

for refrence: http://www.nabble.com/Re%3A-p8438292.html

note that i never mention "qt" .. instead i refer to
"core.execute(solrReq, solrRsp);" doing exactly what it does today ...
core.execute will call getRequestHandler(solrReq.getQueryType()) to pick
the RequestHandler to use.

the Servlet is what creates the SolrRequest object, and puts whatever
SolrParams it wants (including "qt") in that SolrRequest before asking the
SolrCore to take care of it.

: What do you imagine happens in:
: >
: >     String p = pickRequestParser(req);

let's use the URL syntax you've been talking about that people seem to
have agreed looks good (assuming i understand correctly) ...

   /servlet/${requesthandler}:${requestparser}?param1=val1&param2=val2

what i was suggesting was that then the servlet which uses that URL
structure might have a utility method called pickRequestParser that would look like...

  private String pickRequestParser(HttpServletRequest req) {
    String[] pathParts = req.getPathInfo().split("\:");
    if (pathParts.length < 2 || "".equal(pathParts[1]))
      return "default"; // or "standard", or null -- whatever
    return pathParts[1];
  }


: If the RequestHandler is defined by the RequestParser,  I would
: suggest something like:

again, i can't emphasis enough that that's not what i was proposing ... i
am in no way shape or form trying to talk you out of the idea that it
should be possible to specify the RequestParser, the RequestHandler, and
the OutputWriter all as part of the URL, and completley independent of
eachother.

the RequestHandler and the OutputWriter could be specified as regular
SolrParams that come from any part of the HTTP request, but the
RequestParser needs to come from some part of the URL thta can be
inspected with out any risk of affecting the raw post stream (ie: no
HttpServletRequest.getParameter() calls)

: I still don't see why:
:
: >
: >     // let the parser preprocess the streams if it wants...
: >     Iterable<ContentStreams> s = solrParser.preprocess
: >       (getStreamIno(req),  new Pointer<InputStream>() {
: >         InputStream get() {
: >           return req.getInputStream();
: >         });
: >
: >     Solrparams params = makeSolrRequest(req);
: >
: >     // let the parser decide what to do with the existing streams,
: >     // or provide new ones
: >     Iterable<ContentStreams> solrParser.process(solrReq, s);
: >
: >     // ServletSolrRequest is a basic impl of SolrRequest
: >     SolrRequest solrReq = new ServletSolrRequest(params, s);
: >
:
: can not be contained entirely in:
:
:   SolrRequest solrReq = parser.parse( req );

because then the RequestParser would be defining how the URL is getting
parsed -- the makeSolrRequest utility placeholder i described had the
wrong name, i should have called it makeSolrParams ... it would look
something like this in the URL syntax i described above...

  private SolrParams makeSolrParams(HttpServletRequest req) {
    // this class already in our code base, used as is
    SolrParams p = new ServletSolrParams(req);
    String[] pathParts = req.getPathInfo().split("\:");
    if ("".equal(pathParts[0]))
      return p;
    Map tmp = new HashMap();
    tmp.put("qt", pathPaths[0]);
    return new DefaultSolrParams(new MapSolrParams(tmp), p);
  }



the nutshell version of everything i'm trying to say is...

 SolrRequest
   - models all info about a request to solr to do something:
     - the key=val params assocaited with that request
     - any streams of data associated with that request
 RequestParser(s)
   - different instances for different sources of streams
   - is given two chances to generate ContentStreams:
     - once using the raw stream from the HTTP request
     - once using the params for the SolrRequest
 SolrSerlvet
   - the only thing with direct access to the HttpServletRequest, shields
     the other interface APIs from from the mechanincs of HTTP
   - dictates the URL structure
     - determines the name of the RequestParser to use
     - lets parser have the raw input stream
     - determines where SolrParams for request come from
     - lets parser have params to make more streams if it wants to.
 SolrCore
   - does all of hte name lookups for processing a SolrRequest:
     - picks a RequestHandler to use based on params, and runs it
     - determines what RequestParser to use when asked for one by name
     - determines what OutputWriter to use when asked for one by name
 RequestHandler(s)
   - different instances for different logic to be execute
   - has access to full SOlrRequest (params and streams)
   - may run whatever code it wants, and put results in a SolrResponse
 SolrResponse
   - container for deep structures of data, with optional error info
 OutputWriter(s)
   - different instances for different output formats
   - has access to full SolrRequest
   - is expected to render all data in SolrResponse

-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> : I was...  then you talked me out of it!  You are correct, the client
> : should determine the RequestParser independent of the RequestHandler.
>
> Ah ... this is the one problem with high volume on an involved thread ...
> i'm sending replies to messages you write after you've already read other
> replies to other messages you sent and changed your mind :)
>

Should we start a new thread?


>
> Here's a more fleshed out version of the psuedo-java i posted earlier,
> with all of my adendums inlined and a few simple metho calls changed to
> try and make the purpose more clear...
>

Ok, now (I think) I see the difference between our ideas.

>From your code, it looks like you want the RequestParser to extract
'qt' that defines the RequestHandler.  In my proposal, the
RequestHandler is selected independent of the RequestParser.

What do you imagine happens in:
>
>     String p = pickRequestParser(req);
>

This looks like you would have to have a standard way (per servlet) of
gettting the RequestParser.  How do you invision that?  What would be
the standard way to choose your request parser?


If the RequestHandler is defined by the RequestParser,  I would
suggest something like:

interface SolrRequest
{
  RequestHandler getHandler();
  Iterable<ContentStream> getContentStreams();
  SolrParams getParams();
}

interface RequestParser
{
  SolrRequest getRequest( HttpServletRequest req );

  // perhaps remove getHandler() from SolrRequest and add:
  RequestHandler getHandler();
}

And then configure a servlet or filter with the RequestParser

 <filter>
    <filter-name>SolrRequestFilter</filter-name>
    <filter-class>...</filter-class>
    <init-param>
      <param-name>RequestParser</param-name>
      <param-value>org.apache.solr.parser.StandardRequestParser</param-value>
    </init-param>
</filter>

Given that the number of RequestParsers is realistically small (as
Yonik mentioned), I think this could be a good solution.

To update my current proposal:
1. Servlet/Filter defines the RequestParser
2. requestParser parses handler & request from HttpServletRequest
3. handled essentially as before

To update the example URLs, defined by the "StandardRequestParser"
  /path/to/handler/?param
where /path/to/handler is the "name" defined in solrconfig.xml

To use a different RequestParser, it would need to be configured in web.xml
  /customparser/whatever/path/i/like


- - - - - - - - - - - - - -

I still don't see why:

>
>     // let the parser preprocess the streams if it wants...
>     Iterable<ContentStreams> s = solrParser.preprocess
>       (getStreamIno(req),  new Pointer<InputStream>() {
>         InputStream get() {
>           return req.getInputStream();
>         });
>
>     Solrparams params = makeSolrRequest(req);
>
>     // let the parser decide what to do with the existing streams,
>     // or provide new ones
>     Iterable<ContentStreams> solrParser.process(solrReq, s);
>
>     // ServletSolrRequest is a basic impl of SolrRequest
>     SolrRequest solrReq = new ServletSolrRequest(params, s);
>

can not be contained entirely in:

  SolrRequest solrReq = parser.parse( req );

assuming the SolrRequest interface includes

  Iterable<ContentStream> getContentStreams();

the parser can use use req.getInputStream() however it likes - either
to make params and/or to build ContentStreams

- - - - - - - -

good good
ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: I was...  then you talked me out of it!  You are correct, the client
: should determine the RequestParser independent of the RequestHandler.

Ah ... this is the one problem with high volume on an involved thread ...
i'm sending replies to messages you write after you've already read other
replies to other messages you sent and changed your mind :)

: Are you suggesting there would be multiple servlets each with a
: different methods to get the SolrParams from the url?  How does the
: servlet know if it can touch req.getParameter()?

I'm suggesting that their *could* be multiple Servlets with multiple URL
structures ... my worry is not that we need multiple options now, it's
that i don't wnat to cope up with an API for writting plugins that then
has to be throw out down the road when if we want/ened to change the URL

: How would the default servlet fill up SolrParams?

prior to calling RequestParser.preProcess, it would only access very
limited parts of the HttpServletRequest -- the bare minimum it needs to
pick a RequsetParser ... probably just the path, maybe the HTTP Headers --
but if we had a URL structure where we really wanted to specify the
RequestParser in a URL param it could do it using getQueryString

*after* calling RequestParser.preProcess the Servlet can access any part
of the HttpServletRequest (because if the RequestParser wanted to use the
raw POST InputStream it would have, and if it doesn't then it's fair game
to let HttpServletRequest pull data out of it when the Servlet calls
HttpServletRequest.getParameterMap() -- or any of the other
HttpServletRequest methods to build up the SolrParams however it wants
based on the URL structure it wants to use ... then RequestParser.process
can use those SolrParams to get any other streams it may want and add them
to the SolrRequest.

Here's a more fleshed out version of the psuedo-java i posted earlier,
with all of my adendums inlined and a few simple metho calls changed to
try and make the purpose more clear...



// Simple inteface for having a lazy refrence to something
interface Pointer<T> {
  T get();
}

interface RequestParser {
  public init(NamedList nl); // the usual

  /** will be passed the raw input stream from the
   * HttpServletRequest, ... as well as whatever other HttpServletRequest
   * header info we decide its important for the RequestParser to know
   * about the stream, and is safe for Servlets to access and make
   * available to the RequestParser (ie: HTTP method, content-type,
   * content-length, etc...)
   *
   * I'm using a NamedList instance instead of passing the
   * HttpServletRequest to maintain a good abstraction -- only the Serlet
   * know about HTTP, so if we ever want to write an RMI interface to Solr,
   * the same RequestParser plugins will still work ... in practice it
   * might be better to explicitly spell out every piece of info about
   * the stream we want to pass
   *
   * This is the method where a RequestParser which is going to use the
   * raw POST body to build up eithera single stream, or several streams
   * from a multi-part request has the info it needs to do so.
   */
  public Iterable<ContentStream> preProcess(NamedList streamInfo,
                                            Pointer<InputStream> s);

  /** garunteed that the second arg will be the result from
   * a previous call to preProcess, and that that Iterable from
   * preProcess will not have been inspected or touched in anyway, nor
   * will any refrences to it be maintained after this call.
   *
   * this is the method where a RequestParser which is going to use
   * request params to open streams from local files, or remote URLs
   * can do so -- a particulararly ambitious RequestParser could use
   * both the raw POST data *and* remote files specified in params
   * because it has the choice of what to do with the
   * Iterable<ContentStream> it reutnred from the earlier preProcess call.
   */
  public Iterable<ContentStream> process(SolrRequest request,
                                         Iterable<ContentStream> i);

}


class SolrUberServlet extends HttpServlet {

  // servlet specific method which does minimal inspection of
  // req to determine the parser name based on the URL
  private String pickRequestParser(HttpServletRequest req) { ... }

  // extracts just the most crucial info about the HTTP Stream from the
  // HttpServletRequest, so it can be passed to RequestParser.preProcss
  // must be careful not to use anything that might access the stream.
  private NamedLIst getStreamInfo(HttpServletRequest req) { ... }

  // builds the SolrParams for the request using servlet specific URL rules,
  // this method is free to use anything in the HttpServletRequest
  // because it won't be called untill after preProcess
  private SolrParams makeSolrRequestParams(HttpServletRequest req) { ... }

  public service(HttpServletRequest req, HttpServletResponse response) {
    SolrCore core = getCore();
    Solr(Query)Response solrRsp = new Solr(Query)Response();

    String p = pickRequestParser(req);

    // looks up a registered instance (from solrconfig.xml)
    // matching that name, similar to core.getQueryResponseWriter
    RequestParser solrParser = core.getParserByName(p);

    // let the parser preprocess the streams if it wants...
    Iterable<ContentStreams> s = solrParser.preprocess
      (getStreamIno(req),  new Pointer<InputStream>() {
        InputStream get() {
          return req.getInputStream();
        });

    Solrparams params = makeSolrRequest(req);

    // let the parser decide what to do with the existing streams,
    // or provide new ones
    Iterable<ContentStreams> solrParser.process(solrReq, s);

    // ServletSolrRequest is a basic impl of SolrRequest
    SolrRequest solrReq = new ServletSolrRequest(params, s);

    // does exactly what it does now: picks the RequestHandler to
    // use based on the params in the solrReq, and calls it's
    // handleRequest method
    core.execute(solrReq, solrRsp);

    // the rest of this is cut/paste from the current SolrServlet.
    // use SolrParams to pick OutputWriter name, ask core for instance,
    // have that writer write the results.
    QueryResponseWriter responseWriter = core.getQueryResponseWriter(solrReq);
    response.setContentType(responseWriter.getContentType(solrReq, solrRsp));
    PrintWriter out = response.getWriter();
    responseWriter.write(out, solrReq, solrRsp);

  }
}






-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/18/07, Ryan McKinley <ry...@gmail.com> wrote:
> Yes, this proposal would fix the URL structure to be
> /path/defined/in/solrconfig:parser?params
> /${handler}:${parser}
>
> I *think* this cleanly handles most cases cleanly and simply.  The
> only exception is where you want to extract variables from the URL
> path.

But that's not a hypothetical case, extracting variables from the URL
path is something I need now (to add metadata about the data in the
raw post body, like the CSV separator).

POST to http://localhost:8983/solr/csv?separator=,&fields=foo,bar,baz
with a body of "10,20,30"

> There are pleanty of ways to rewrite RESTfull urls into a
> path+params structure.  If someone absolutly needs RESTfull urls, it
> can easily be implemented with a new Filter/Servlet that picks the
> 'handler' and directly creates a SolrRequest from the URL path.

While being able to customize something is good, having really good
defaults is better IMO :-)  We should also be focused on exactly what
we want our standard update URLs to look like in parallel with the
design of how to support them.

As a site note, with a change of URLs, we get a "free" chance to
change whatever we want about the parameters or response format...
backward compatibility only applies to the original URLs IMO.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> I'm confused by your sentence "A RequestParser converts a
> HttpServletRequest to a SolrRequest." .. i thought you were advocating
> that the servlet parse the URL to pick a RequestHandler, and then the
> RequestHandler dicates the RequestParser?
>

I was...  then you talked me out of it!  You are correct, the client
should determine the RequestParser independent of the RequestHandler.


> : /path/registered/in/solr/config:requestparser?params
> :
> : If no ':' is in the URL, use 'standard' parser
> :
> : 1. The URL path determins the RequestHandler
> : 2. The URL path determins the RequestParser
> : 3. SolrRequest = RequestParser.parse( HttpServletRequest )
> : 4. handler.handleRequest( req, res );
> : 5. write the response
>
> do you mean the path before hte colon determins the RequestHandler and the
> path after the colon determines the RequestParser?

yes, that is my proposal.

> fine too ... i was specificly trying to avoid making any design
> decissions that required a particular URL structure, in what you propose
> we are dictating more then just the "/handler/path:parser" piece of the
> URL, we are also dicating that the Parser decides how the rest of the path
> and all URL query string data will be interpreted ...

Yes, this proposal would fix the URL structure to be
/path/defined/in/solrconfig:parser?params
/${handler}:${parser}

I *think* this cleanly handles most cases cleanly and simply.  The
only exception is where you want to extract variables from the URL
path.  There are pleanty of ways to rewrite RESTfull urls into a
path+params structure.  If someone absolutly needs RESTfull urls, it
can easily be implemented with a new Filter/Servlet that picks the
'handler' and directly creates a SolrRequest from the URL path.  In my
opinion, for this level of customization is reasonable that people
edit web.xml and put in their own servlets and filters.

>
> what i'm proposing is that the Servlet decide how to get the SolrParams
> out of an HttpServletRequest, using whatever URL that servlet wants;

I guess I'm not understanding this yet:

Are you suggesting there would be multiple servlets each with a
different methods to get the SolrParams from the url?  How does the
servlet know if it can touch req.getParameter()?

How would the default servlet fill up SolrParams?


>
> I think i'm getting confused ... i thought you were advocating that
> RequestParsers be implimented as ServletFilters (or Servlets) ...

Originally I was... but again, you talked me out of it.  (this time
not totally)  I think the /path:parser format is clear and allows for
most everything off the shelf.  If you want to do something different,
that can easily be a custom filter (or servlet)

Essentially, i think it is reasonable for people to skip
'RequestParsers' in a custom servlet and be able to build the
SolrRequest directly.  This level of customization is reasonable to
handle directly with web.xml

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: However, I'm not yet convinced the benefits are worth the costs.  If
: the number of RequestParsers remain small, and within the scope of
: being included in the core, that functionality could just be included
: in a single non-pluggable RequestParser.
:
: I'm not convinced is a bad idea either, but I'd like to hear about
: usecases for new RequestParsers (new ways of generically getting an
: input stream)?

I don't really see it being a very high cost ... and even if we can't
imagine any other potential user written RequestParser, we already know of
at least 4 use cases we want to support out of the box for getting
streams:

 1) raw post body (as a single stream)
 2) multi-part post body (file upload, potentially several streams)
 3) local file(s) specified by path (1 or more streams)
 4) remote resource(s) specified by URL(s) (1 or more streams)

...we could put all that logic in a single class with that looks at a
SolrParam to pick what method to use or we could extract each one into
it's own class using a common interface ... either way we can hardcode the
list of viable options if we want to avoid the issue of letting the client
configure them .. but i still think it's worth the effort to talk about
what that common interface might be.

I think my idea of having both a preProcess and a process method in
RequestParser so it can do things before and after the Servlet has
extracted SolrParams from the URL would work in all of the cases we've
thought of.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

OK, trying to catch up on this huge thread... I think I see why it's
become more complicated than I originally envisioned.

What I originally thought:
1) add a way to get a Reader or InputStream from SolrQueryRequest, and
then reuse it for updates too
2) use the plugin name in the URL
3) write code that could handle multi-part post, or could grab args
from the URL.
4) profit!

I think the main additional complexity is the idea that RequestParser
(#3) be both pluggable and able to be specified in the actual request.
 I hadn't considered that, and it's an interesting idea.

Without pluggable RequestParser:
 - something like CSV loader would have to check the params for a
"file" param and if so, open the local file themselves

With a pluggable RequestParser:
 - the LocalFileRequestParser would be specified in the url (like
/update/csv:local) and it will handle looking for the "file" param and
opening the file.  The CSV plugin can be a little simpler by just
getting a Reader.
 - a new way of getting a stream could be developed (a new
RequestParser) and most stream oriented plugins could just use it.

However, I'm not yet convinced the benefits are worth the costs.  If
the number of RequestParsers remain small, and within the scope of
being included in the core, that functionality could just be included
in a single non-pluggable RequestParser.

I'm not convinced is a bad idea either, but I'd like to hear about
usecases for new RequestParsers (new ways of generically getting an
input stream)?

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: I think the confusion is that (in my view) the RequestParser is the
: *only* object able to touch the stream.  I don't think anything should
: happen between preProcess() and process();  A RequestParser converts a
: HttpServletRequest to a SolrRequest.  Nothing else will touch the
: servlet request.

that makes it the RequestParsers responsibility to dictate the URL format
(if it's the only one that can touch the HttpServletRequest) i was
proposing a method by which the Servlet could determine the URL format --
there could in fact be multiple servlets supporting different URL formats
if we had some need for it -- and the RequestParser could generate streams
based on the raw POST data and/or any streams it wants to find based on
the SolrParams generated from the URL (ie: local files, remote resources,
etc)

I'm confused by your sentence "A RequestParser converts a
HttpServletRequest to a SolrRequest." .. i thought you were advocating
that the servlet parse the URL to pick a RequestHandler, and then the
RequestHandler dicates the RequestParser?

: /path/registered/in/solr/config:requestparser?params
:
: If no ':' is in the URL, use 'standard' parser
:
: 1. The URL path determins the RequestHandler
: 2. The URL path determins the RequestParser
: 3. SolrRequest = RequestParser.parse( HttpServletRequest )
: 4. handler.handleRequest( req, res );
: 5. write the response

do you mean the path before hte colon determins the RequestHandler and the
path after the colon determines the RequestParser? ... that would work
fine too ... i was specificly trying to avoid making any design
decissions that required a particular URL structure, in what you propose
we are dictating more then just the "/handler/path:parser" piece of the
URL, we are also dicating that the Parser decides how the rest of the path
and all URL query string data will be interpreted -- which means if we
have a PostBodyRequestParser and a LocalFileRequestParser and a
RemoteUrlRequestParser and which all use the query stirng params to get
the SolrParams for the request (and in the case of the last two: to know
what file/url to parse) and then we decide that we want to support a URL
structure that is more REST like and uses the path for including
information, now we have to write a new version of all of those
RequestParsers ( subclass of each probably) that knows what our new URL
structure looks like ... even if that never comes up, every RequestParser
(even custom ones written by users to use some crazy proprietery binary
protocols we've never heard of to fetch stream of data has to worry about
extracting the SOlrParams out of the URL.

what i'm proposing is that the Servlet decide how to get the SolrParams
out of an HttpServletRequest, using whatever URL that servlet wants; the
RequestParser decides how to get the ContentStreams needed for that
request -- in a way that can work regardless of wether the stream is
acctually part of the HttpServletRequest, or just refrenced by a param in
the the request; the RequestHandler decides what to do with those params
and streams; and the the ResponseWriter decides how to format the results
produced by the RequestHandler back to the client.

: > : If anyone needs to customize this chain of events, they could easily
: > : write their own Servlet/Filter

: I don't *think* this would happen often, and the people would only do
: it if they are unhappy with the default URL structure -> behavior
: mapping.  I am not suggesting this would be the normal way to
: configure solr.

I think i'm getting confused ... i thought you were advocating that
RequestParsers be implimented as ServletFilters (or Servlets) ... but if
that were the case it wouldn't just be able changing hte URL structure, it
would be able picking new ways to get streams .. but that doesn't seem to
be what you are suggesting, so i'm not sure what i was missunderstanding.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

On 1/17/07, Chris Hostetter <ho...@fucit.org> wrote:
>
> : I'm not sure i underestand preProcess( ) and what it gets us.
>
> it gets us the abiliity for a RequestParser to be able to pull out the raw
> InputStream from the HTTP POST body, and make it available to the
> RequestHandler as a ContentStream and/or it can wait untill the servlet
> has parsed the URL to get the params and *then* it can generate
> ContentStreams based on those param values.
>
>  - preProcess is neccessary to write a RequestParser that can handle the
>    current POST raw XML model,
>  - process is neccessary to write RequestParsers that can get file names
>    or URLs out of escaped query params and fetch them as streams
>

I think the confusion is that (in my view) the RequestParser is the
*only* object able to touch the stream.  I don't think anything should
happen between preProcess() and process();  A RequestParser converts a
HttpServletRequest to a SolrRequest.  Nothing else will touch the
servlet request.

> : 1. The URL path selectes the RequestHandler
> : 2. RequestParser = RequestHandler.getRequestParser()  (typically from
> : its default params)
> : 3. SolrRequest = RequestParser.parse( HttpServletRequest )
> : 4. handler.handleRequest( req, res );
> : 5. write the response
>
> the problem i see with that, is that the RequestHandler shouldn't have any
> say in what RequestParser is used -- ...
>

got it.  Then i vote we use a syntax like:

/path/registered/in/solr/config:requestparser?params

If no ':' is in the URL, use 'standard' parser

1. The URL path determins the RequestHandler
2. The URL path determins the RequestParser
3. SolrRequest = RequestParser.parse( HttpServletRequest )
4. handler.handleRequest( req, res );
5. write the response

> : If anyone needs to customize this chain of events, they could easily
> : write their own Servlet/Filter
>
> this is why i was confused about your Filter comment earlier: if the only
> way a user can customize behavior is by writting a Servlet, they can't
> specify that servlet in a solr config file -- they'd have to unpack the
> war and manually eidt the web.xml ... which makes upgrading a pain.
>

I don't *think* this would happen often, and the people would only do
it if they are unhappy with the default URL structure -> behavior
mapping.  I am not suggesting this would be the normal way to
configure solr.

The main case where I imagine someone would need to write their own
servlet/filter is if they insist the parameters need to be in the URL.
 For example:

  /delete/id/

The URL structure I am proposing could not support this (unless you
had a handler mapped to each id :)

ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: I'm not sure i underestand preProcess( ) and what it gets us.

it gets us the abiliity for a RequestParser to be able to pull out the raw
InputStream from the HTTP POST body, and make it available to the
RequestHandler as a ContentStream and/or it can wait untill the servlet
has parsed the URL to get the params and *then* it can generate
ContentStreams based on those param values.

 - preProcess is neccessary to write a RequestParser that can handle the
   current POST raw XML model,
 - process is neccessary to write RequestParsers that can get file names
   or URLs out of escaped query params and fetch them as streams

: 1. The URL path selectes the RequestHandler
: 2. RequestParser = RequestHandler.getRequestParser()  (typically from
: its default params)
: 3. SolrRequest = RequestParser.parse( HttpServletRequest )
: 4. handler.handleRequest( req, res );
: 5. write the response

the problem i see with that, is that the RequestHandler shouldn't have any
say in what RequestParser is used -- the client is hte only one that knows
what type of data they are sending to Solr, they should put information in
the URL that directly picks the RequestParser.

If you think about it in terms of the current POSTing XML model, an
XmlUpdateRequestHandler that reads in our "<add><doc>..." style info
shouldn't know anywhere in it's confiuration where that stream of XML
bytes came from -- when it gets asked to handle the request, all it should
know is that it has some optional params, and an InputStream to work with
... the RequestParsers job is to decide where that input stream came from.

: If anyone needs to customize this chain of events, they could easily
: write their own Servlet/Filter

this is why i was confused about your Filter comment earlier: if the only
way a user can customize behavior is by writting a Servlet, they can't
specify that servlet in a solr config file -- they'd have to unpack the
war and manually eidt the web.xml ... which makes upgrading a pain.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

I'm not sure i underestand preProcess( ) and what it gets us.

I like the model that

1. The URL path selectes the RequestHandler
2. RequestParser = RequestHandler.getRequestParser()  (typically from
its default params)
3. SolrRequest = RequestParser.parse( HttpServletRequest )
4. handler.handleRequest( req, res );
5. write the response

If anyone needs to customize this chain of events, they could easily
write their own Servlet/Filter


On 1/17/07, Chris Hostetter <ho...@fucit.org> wrote:
>
> Acctually, i have to amend that ... it occured to me in my slep last night
> that calling HttpServletRequest.getInputStream() wasn't safe unless we
> *now* the Requestparser wasnts it, and will close it if it's non-null, so
> the API for preProcess would need to look more like this...
>
>      interface Pointer<T> {
>        T get();
>      }
>      interface RequestParser {
>        ...
>        /** the will be passed a "Pointer" to the raw input stream from the
>         * HttpServletRequest, ... if this method accesses the IputStream
>         * from the pointer, it is required to close it if it is non-null.
>         */
>        public Iterable<ContentStream> preProcess(SolrParam headers,
>                                                  Pointer<InputStream> s);
>        ...
>      }
>
>
>
> -Hoss
>
>

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

Acctually, i have to amend that ... it occured to me in my slep last night
that calling HttpServletRequest.getInputStream() wasn't safe unless we
*now* the Requestparser wasnts it, and will close it if it's non-null, so
the API for preProcess would need to look more like this...

     interface Pointer<T> {
       T get();
     }
     interface RequestParser {
       ...
       /** the will be passed a "Pointer" to the raw input stream from the
        * HttpServletRequest, ... if this method accesses the IputStream
        * from the pointer, it is required to close it if it is non-null.
        */
       public Iterable<ContentStream> preProcess(SolrParam headers,
                                                 Pointer<InputStream> s);
       ...
     }



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

talking about the URL structure made me realize that the Servlet should
dicate the URL structure and the param parsing, but it should do it after
giving the RequestParser a crack at any streams it wants (actually i think
that may be a direct quote from JJ ... can't remember now) ... *BUT* the
RequestParser may not want to provide a list of streams, untill the params
have been parsed (if for example, one of the params is the name of a file)

so what if the interface for RequestParser looked like this...

  interface RequestParser {
    public init(NamedList nl); // the usual
    /** will be passed the raw input stream from the
     * HttpServletRequest, ... may need other HttpServletRequest info as
     * SolrParam (ie: method, content-type/content-length, ...but we use
     * a SolrParam instance instead of the HttpServletRequest to
     * maintain an abstraction.
     */
    public Iterable<ContentStream> preProcess(SolrParam headers,
                                              InputStream s);
    /** garunteed that the second arg will be the result from
     * a previous call to preProcess, and that that Iterable from
     * preProcess will not have been inspected or touched in anyway, nor
     * will any refrences to it be maintained after this call.
     * this method is responsible for calling
     * request.setContentStreams(Iterable<ContentStreams) as it sees fit
     */
    public void process(SolrRequest request, Iterable<ContentStream> i);

  }

...the idea being that many RequestParsers will choose to impliment one or
both of those methods as a NOOP that just returns null but if they want
to impliment both, they have the choice of obliterating the Iterable
returned by preProcess and completely replacing it once they see the
SolrParams in the request....

: specifically what i had in mind was something like this...
:
:   class SolrUberServlet extends HttpServlet {
:     public service(HttpServletRequest req, HttpServletResponse response) {
:       SolrCore core = getCore();
:       Solr(Query)Response solrRsp = new Solr(Query)Response();
:
:       // servlet specific method which does minimal inspection of
:       // req to determine the parser name
:       String p = pickRequestParser(req);
:
:       // looks up a registered instance (from solrconfig.xml)
:       // matching that name
:       RequestParser solrParser = coreGetParserByName(p);
:

        // let the parser preprocess the streams if it wants...
        Iterable<ContentStreams> s = solrParser.preprocess(req.getInputStream())

        // build the request using servlet specific URL rules
        Solr(Query)Request solrReq = makeSolrRequest(req);

        // let the parser decide what to do with the existing streams,
        // or provide new ones
        solrParser.process(solrReq, s);

:       // does exactly what it does now: picks the RequestHandler to
:       // use based on the params, calls it's handleRequest method
:       core.execute(solrReq, solrRsp)
:
:       // the rest of this is cut/paste from the current SolrServlet.
:       // use SolrParams to pick OutputWriter name, ask core for instance,
:       // have that writer write the results.
:       QueryResponseWriter responseWriter = core.getQueryResponseWriter(solrReq);
:       response.setContentType(responseWriter.getContentType(solrReq, solrRsp));
:       PrintWriter out = response.getWriter();
:       responseWriter.write(out, solrReq, solrRsp);
:
:     }
:   }
:
:
: -Hoss
:



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > So to understand better:
: >
: > user request -> micro-plugin -> RequestHandler -> ResponseHandler

: or:
:
: HttpServletRequest -> SolrRequestParser -> SolrRequestProcessor ->
: SolrResponse -> SolrResponseWriter


specifically what i had in mind was something like this...

  class SolrUberServlet extends HttpServlet {
    public service(HttpServletRequest req, HttpServletResponse response) {
      SolrCore core = getCore();
      Solr(Query)Response solrRsp = new Solr(Query)Response();

      // servlet specific method which does minimal inspection of
      // req to determine the parser name
      String p = pickRequestParser(req);

      // looks up a registered instance (from solrconfig.xml)
      // matching that name
      RequestParser solrParser = coreGetParserByName(p);

      // RequestParser is the only plugin class that knows about
      // HttpServletRequest, it builds up the SolrRequest (aka
      // SolrQueryRequest) which contains the SolrParams and streams
      SolrRequest solrReq = solrParser.parse(req);

      // does exactly what it does now: picks the RequestHandler to
      // use based on the params, calls it's handleRequest method
      core.execute(solrReq, solrRsp)

      // the rest of this is cut/paste from the current SolrServlet.
      // use SolrParams to pick OutputWriter name, ask core for instance,
      // have that writer write the results.
      QueryResponseWriter responseWriter = core.getQueryResponseWriter(solrReq);
      response.setContentType(responseWriter.getContentType(solrReq, solrRsp));
      PrintWriter out = response.getWriter();
      responseWriter.write(out, solrReq, solrRsp);

    }
  }


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> So to understand better:
>
> user request -> micro-plugin -> RequestHandler -> ResponseHandler
>
> Right?
>

or:

HttpServletRequest -> SolrRequestParser -> SolrRequestProcessor ->
SolrResponse -> SolrResponseWriter

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.

On Mon, 2007-01-15 at 12:23 -0800, Chris Hostetter wrote:
> : > Right, you're getting at issues of why I haven't committed my CSV handler yet.
> : > It currently handles reading a local file (this is more like an SQL
> : > update handler... only a reference to the data is passed).  But I also
> : > wanted to be able to handle a POST of the data  , or even a file
> : > upload from a browser.  Then I realized that this should be generic...
> : > the same should also apply to XML updates, and potential future update
> : > formats like JSON.
> :
> : I do not see the problem here. One just need to add a couple of lines in
> : the upload servlet and change the csv plugin to input stream (not local
> : file).
> 
> what Yonik and i are worried about is that we don't want the list of all
> possible ways for an Update Plugin to get a Stream to be hardcoded in the
> UpdateServlet or Solr Core or in the Plugins themselves ... we'd like the
> notion of indexing docs expressed as CSV records or XML records or JSON
> records to be independent of where the CSV, XML, or JSON data stream came
> from ... in the same way that the current RequestHandlers can execute
> specific search logic, without needing to worry about what format the
> results are going to be returned in.
> 
> 
> It's not writting code to get the stream from one of N known ways
> that's hard -- it's designing an API so we can get the stream from one of
> any number of *unknown* ways that can be specified at run time thats
> tricky :)
> 

Ok, I am still trying to understand your concept of micro-plugin, but I
understand the above and your comments later in this thread that you are
looking for a generic stream resolver/producer (or solrSource). 


On Mon, 2007-01-15 at 12:42 -0800, Chris Hostetter wrote:
> i disagree ... it should be possible to create "micro-plugins" (I
> think i
> called them "UpdateSource" instances in my orriginal suggestion) that
> know
> about getting streams in various ways, but don't care what format of
> data
> is found on those streams -- that would be left for the
> (Update)RequestHandler (which wouldn't need to know where the data
> came
> from)
> 
> a JDBC/SQL updater would probably be a very special case -- where the
> format and the stream are inheriently related -- in which case a No-Op
> UpdateSource could be used that didn't provide any stream, and the
> JdbcUpdateRequestHandler would manage it's JDBC streams directly. 

So to understand better:

user request -> micro-plugin -> RequestHandler -> ResponseHandler

Right?

salu2

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: > Right, you're getting at issues of why I haven't committed my CSV handler yet.
: > It currently handles reading a local file (this is more like an SQL
: > update handler... only a reference to the data is passed).  But I also
: > wanted to be able to handle a POST of the data  , or even a file
: > upload from a browser.  Then I realized that this should be generic...
: > the same should also apply to XML updates, and potential future update
: > formats like JSON.
:
: I do not see the problem here. One just need to add a couple of lines in
: the upload servlet and change the csv plugin to input stream (not local
: file).

what Yonik and i are worried about is that we don't want the list of all
possible ways for an Update Plugin to get a Stream to be hardcoded in the
UpdateServlet or Solr Core or in the Plugins themselves ... we'd like the
notion of indexing docs expressed as CSV records or XML records or JSON
records to be independent of where the CSV, XML, or JSON data stream came
from ... in the same way that the current RequestHandlers can execute
specific search logic, without needing to worry about what format the
results are going to be returned in.


It's not writting code to get the stream from one of N known ways
that's hard -- it's designing an API so we can get the stream from one of
any number of *unknown* ways that can be specified at run time thats
tricky :)



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.

On Fri, 2007-01-12 at 15:41 -0500, Yonik Seeley wrote:
> On 1/10/07, Chris Hostetter <ho...@fucit.org> wrote:
> > The one hitch i think to the the notion that updates and queries map
> > cleanlly with something like this...
> >
> >   SolrRequestHandler => SolrUpdateHandler
> >   SolrQueryRequest => SolrUpdateRequest
> >   SolrQueryResponse => SolrUpdateResponse (possibly the same class)
> >   QueryResponseWriter => UpdateResponseWriter (possible the same class)
> >
> > ...is that with queries, the "input" tends to be fairly simple.  very
> > generic code can be run by the query Servlet to get all of the input
> > params and build the SolrQueryRequest ... but with updates this isn't
> > quite as simple.  there's the two issues i spoke of in my earlier mail
> > which should be independenly confiugable:
> >   1) where does the "stream" of update data come from?  is it in the raw
> >      POST body? is it in a POSTed multi-part MIME part? is it a remote
> >      resource refrenced by URL?
> >   2) how should the raw binary stream of update data be parsed?  is it
> >      XML? (in the current update format)  is it a CSV file?  is it a PDF?
> >
> > ...#2 can be what the SolrUpdateHandler interface is all about -- when
> > hitting the update url you specify a "ut" (update type) that determines
> > that logic ... but it should be independed of #1
> 
> Right, you're getting at issues of why I haven't committed my CSV handler yet.
> It currently handles reading a local file (this is more like an SQL
> update handler... only a reference to the data is passed).  But I also
> wanted to be able to handle a POST of the data  , or even a file
> upload from a browser.  Then I realized that this should be generic...
> the same should also apply to XML updates, and potential future update
> formats like JSON.

I do not see the problem here. One just need to add a couple of lines in
the upload servlet and change the csv plugin to input stream (not local
file).

See
https://issues.apache.org/jira/secure/attachment/12347425/solar-85.with.file.upload.diff
...
+        boolean isMultipart = ServletFileUpload
+                .isMultipartContent(new ServletRequestContext(request));
...
+        if (isMultipart) {
+            // Create a new file upload handler
...
+                    commandReader = new BufferedReader(new InputStreamReader(stream));

Now instead of 
+                    core.update(commandReader, responseWriter);
one would use the updateHandler for the in the request defined format (format=json)

UpdateHandler handler = core.lookupUpdateHandler(format);
handler.update(commandReader, responseWriter);

Or do I miss something?

salu2

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

I got a little carried away and have posted:
http://issues.apache.org/jira/browse/SOLR-104

I'm sorry for not waiting for more discussion on the issue....  but i
had the itch!

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
> - for errors, use HTTP error codes instead of putting it in the XML as now.
>

yes!

>
> - more REST like?
>

I would like the URL form to look like this:
  /solr/${verb}/?param=value&param=value...
or:
  /solr/${verb}/${handler}/?param=value&param=value...

Commands should work with or without the trailing '/'

/solr/query should continue to support qt=dismax, but I don't think we
should add params for 'at' (add type) 'ct' (commit type) etc...

Examples:

/solr/query/?q=...  (use standard)
/solr/query/dismax/?q=...
/solr/add/
 POST: <docs>...</docs>
/solr/add/sql/?q=SELECT *
/solr/add/csv/?param=xxx...
  POST (fileupload): file.csv
/solr/commit/?waitSearcher=true
/solr/delete/?id=AAA&id=BBB&q=id:[* TO CCC]
/solr/optimize/?waitSearcher=true


RequestHandlers would be registered with a verb (command?) in solrconfig.xml
  <requestHandler command="query" name="dismax" ...>
  <requestHandler command="add" name="xml" class="..>
  <requestHandler command="add" name="sql" class="..>
 <requestHandler command="commit" ..>

RequestHandlers would register verb/name not just name.  It would also
be nice to specify the default handler in solrconfig.xml (rather then
hardcoded to 'standard')
  <requestHandler command="query" name="dismax" isDefault="true">


>
> DEL /solr/document?id=1003
>   OR
> POST /solr/document/delete?id=1003
>

I think the request method should be added to the base SolrRequest and
let the handler decide if it will do something different for GET vs.
POST vs. DEL

Conceptually /commit should be a POST, but it may be nice to use your
bowser (GET) to run commands like:
  /solr/commit?waitSearcher=true

If the method is passed to the RequestHandler, this will let anyone
who is unhappy with the standard behavior change it easily.  Someone
may want to require you send DEL to delete and POST to change anything
-- or the opposite if you care.

>
> - administrative commands, setting certain limits
>
> POST /solr/command/set&mergeFactor=100&maxBufferedDocs=1000
> POST /solr/command/set&logLevel=3
>

In my proposal, this could be something like:
/solr/setconfig?logLevel=3

If someone wrote a handler that set the variables, then saved them in
solrconfig.xml, it could be:
/solr/setconfig/save/?mergeFactor=100

good good
ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Yonik Seeley <yo...@apache.org>.

On 1/10/07, Chris Hostetter <ho...@fucit.org> wrote:
> The one hitch i think to the the notion that updates and queries map
> cleanlly with something like this...
>
>   SolrRequestHandler => SolrUpdateHandler
>   SolrQueryRequest => SolrUpdateRequest
>   SolrQueryResponse => SolrUpdateResponse (possibly the same class)
>   QueryResponseWriter => UpdateResponseWriter (possible the same class)
>
> ...is that with queries, the "input" tends to be fairly simple.  very
> generic code can be run by the query Servlet to get all of the input
> params and build the SolrQueryRequest ... but with updates this isn't
> quite as simple.  there's the two issues i spoke of in my earlier mail
> which should be independenly confiugable:
>   1) where does the "stream" of update data come from?  is it in the raw
>      POST body? is it in a POSTed multi-part MIME part? is it a remote
>      resource refrenced by URL?
>   2) how should the raw binary stream of update data be parsed?  is it
>      XML? (in the current update format)  is it a CSV file?  is it a PDF?
>
> ...#2 can be what the SolrUpdateHandler interface is all about -- when
> hitting the update url you specify a "ut" (update type) that determines
> that logic ... but it should be independed of #1

Right, you're getting at issues of why I haven't committed my CSV handler yet.
It currently handles reading a local file (this is more like an SQL
update handler... only a reference to the data is passed).  But I also
wanted to be able to handle a POST of the data  , or even a file
upload from a browser.  Then I realized that this should be generic...
the same should also apply to XML updates, and potential future update
formats like JSON.

The most important issue is to nail down the external HTTP interface.
If the URL structure changes, it's also an opportunity to change
whatever we don't like about the current XML format.  The old update
URL can still implement the original syntax.
It's also an opportunity to make the interface a little more REST-like
if we so choose.

Brainstorming:
- for errors, use HTTP error codes instead of putting it in the XML as now.

- perhaps get rid of the enclosing <add>... that could be a verb in
the URL, or for multiple documents, change it to <docs>.

- add information about the data in the URL:

POST /solr/add?format=json&overwrite=true
[
  {"field1":"value1", "field2":[false,true,false,true,true]}
]

POST /solr/add?format=csv&separator=,&...
field1,field2
val1,val2

This is more flexible as it allows one to add more metadata about the
data w/o having to change the data format.  For example, if one wanted
to be able to specify which index the add should go to, or other info
about the handling of the data, it's simple to add an additional param
in the URL.

- For browser friendliness, we could support a standard mechanism for
putting the body in the URL (not for general use since the URL can be
size limited, but good for testing).

POST /solr/add?format=json&overwrite=true&body=[{"field1":"value1"}]

- more REST like?
PUT /solr/document/1003?title=howdy&author=snafoo&cat=misc&cat=book
#not sure I like that format, and we would still want the multi-doc
format anyway

- more REST like?
DEL /solr/document/1003
  OR
DEL /solr/document?id=1003
  OR
POST /solr/document/delete?id=1003

#how to do delete-by-query, optimize, etc?
DEL/POST /solr/document/delete?q=id:[10 TO 20]
  OR
POST /solr/command/delete?id=1002&id=1003&q=id:[1000 TO 1010]
  OR
POST /solr/command/deletebyquery?q=id:[10 TO 20]

POST /solr/command/optimize&wait=true

- administrative commands, setting certain limits

POST /solr/command/set&mergeFactor=100&maxBufferedDocs=1000
POST /solr/command/set&logLevel=3

You get the idea of some of the options available.
Ideas?  Thoughts?

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

On 1/15/07, Chris Hostetter <ho...@fucit.org> wrote:
>
> : >   SolrRequestHandler => SolrUpdateHandler
> : >   SolrQueryRequest => SolrUpdateRequest
> : >   SolrQueryResponse => SolrUpdateResponse (possibly the same class)
> : >   QueryResponseWriter => UpdateResponseWriter (possible the same class)
> : >
> :
> : Is there any reason the plugin system needs a different RequestObject
> : for Query vs Update?
>
> as i said: only to the extend that Updates tend to have streams of data
> that queries don't need (as far as i can imagine)
>

I get it now.  I was confusing UpdateHandler with RequestHandler.
When looking at how to make an 'Update Plugin' I was looking at
UpdateHandler.


> : SolrRequest would be the current SolrQueryRequest augmented with the
> : HTTP method type and a way to get the raw post stream.
>
> the raw POST stream may not be where the data is though -- consider the
> file upload case, or the reading from a local file case, or the reading
> form a list of remote URLs specified in params.
>
> : I'm not sure the nitty gritty, but it should be as close to
> : HttpServletRequest as possible.  If possible, I think handlers should
> : choose how to handle the stream.
> :
> : It it is a remote resource, I think its the handlers job to open the stream.
>
> i disagree ... it should be possible to create "micro-plugins" (I think i
> called them "UpdateSource" instances in my orriginal suggestion) that know
> about getting streams in various ways, but don't care what format of data
> is found on those streams -- that would be left for the
> (Update)RequestHandler (which wouldn't need to know where the data came
> from)
>

I'm convinced.


>
> : While we are at it... is there any reason (for or against) exposing
> : other parts of the HttpServletRequest to SolrRequestHandlers?
>
> the biggest one is Unit testing -- giving plugins very simple APIs that
> don't require a lot of knowledge about external APIs make it much easier
> to test them.  it also helps make it possible for use to "future proof"
> plugins.  other messages in this thread have discussed the possibility of
> changing the URL structure, supporting more restful URLs and things like
> that ... if we currently exposed lots of info from the HttpServletRequest
> in the SolrQueryRequest, then making changes like that in a backwards
> compatible way would be nearly impossible.  As it stands, we can write a
> new Servlet that deals with input *completely* differently from the
> current URL structure, and be 99% certain that existing plugins will
> continue to work.
>
> : While it is not the focus of solr, someone (including me!) may want to
> : implement some more complex authentication scheme - Perhaps setting a
> : field on each document saying who added it and from what IP.
> :
> : stuff to consider: cookies, headers, remoteUser, remoteHost...
>
> all of that could concievably be done by changing the servlet to add
> that info into the SolrParams.
>

damn, you win again!

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: >   SolrRequestHandler => SolrUpdateHandler
: >   SolrQueryRequest => SolrUpdateRequest
: >   SolrQueryResponse => SolrUpdateResponse (possibly the same class)
: >   QueryResponseWriter => UpdateResponseWriter (possible the same class)
: >
:
: Is there any reason the plugin system needs a different RequestObject
: for Query vs Update?

as i said: only to the extend that Updates tend to have streams of data
that queries don't need (as far as i can imagine)

: SolrRequest would be the current SolrQueryRequest augmented with the
: HTTP method type and a way to get the raw post stream.

the raw POST stream may not be where the data is though -- consider the
file upload case, or the reading from a local file case, or the reading
form a list of remote URLs specified in params.

: I'm not sure the nitty gritty, but it should be as close to
: HttpServletRequest as possible.  If possible, I think handlers should
: choose how to handle the stream.
:
: It it is a remote resource, I think its the handlers job to open the stream.

i disagree ... it should be possible to create "micro-plugins" (I think i
called them "UpdateSource" instances in my orriginal suggestion) that know
about getting streams in various ways, but don't care what format of data
is found on those streams -- that would be left for the
(Update)RequestHandler (which wouldn't need to know where the data came
from)

a JDBC/SQL updater would probably be a very special case -- where the
format and the stream are inheriently related -- in which case a No-Op
UpdateSource could be used that didn't provide any stream, and the
JdbcUpdateRequestHandler would manage it's JDBC streams directly.

: Likewise I don't see anything in QueryResponseWriter that should tie
: it to 'Query.'  Could it just be ResponseWriter?

probably -- as i said, both it and SolrQueryResponse could probably be
reused, the only hitch is that their names might be confusing (we could
allways refactor all of their guts into super classes, and deprecate the
existing classes)

: While we are at it... is there any reason (for or against) exposing
: other parts of the HttpServletRequest to SolrRequestHandlers?

the biggest one is Unit testing -- giving plugins very simple APIs that
don't require a lot of knowledge about external APIs make it much easier
to test them.  it also helps make it possible for use to "future proof"
plugins.  other messages in this thread have discussed the possibility of
changing the URL structure, supporting more restful URLs and things like
that ... if we currently exposed lots of info from the HttpServletRequest
in the SolrQueryRequest, then making changes like that in a backwards
compatible way would be nearly impossible.  As it stands, we can write a
new Servlet that deals with input *completely* differently from the
current URL structure, and be 99% certain that existing plugins will
continue to work.

: While it is not the focus of solr, someone (including me!) may want to
: implement some more complex authentication scheme - Perhaps setting a
: field on each document saying who added it and from what IP.
:
: stuff to consider: cookies, headers, remoteUser, remoteHost...

all of that could concievably be done by changing the servlet to add
that info into the SolrParams.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Ryan McKinley <ry...@gmail.com>.

>
>   SolrRequestHandler => SolrUpdateHandler
>   SolrQueryRequest => SolrUpdateRequest
>   SolrQueryResponse => SolrUpdateResponse (possibly the same class)
>   QueryResponseWriter => UpdateResponseWriter (possible the same class)
>

Is there any reason the plugin system needs a different RequestObject
for Query vs Update?

I think the most flexible system would have the plugin manager take
any HttpServletRequest, convert it to a 'SolrRequest' and pass it to a
RequestHandler.

SolrRequest would be the current SolrQueryRequest augmented with the
HTTP method type and a way to get the raw post stream.

Likewise I don't see anything in QueryResponseWriter that should tie
it to 'Query.'  Could it just be ResponseWriter?

This way the plugin system would not need to care if its a
query/update/ or some other command someone wants to add.


>   1) where does the "stream" of update data come from?  is it in the raw
>      POST body? is it in a POSTed multi-part MIME part? is it a remote
>      resource refrenced by URL?

I'm not sure the nitty gritty, but it should be as close to
HttpServletRequest as possible.  If possible, I think handlers should
choose how to handle the stream.

It it is a remote resource, I think its the handlers job to open the stream.

>   2) how should the raw binary stream of update data be parsed?  is it
>      XML? (in the current update format)  is it a CSV file?  is it a PDF?
>
> ...#2 can be what the SolrUpdateHandler interface is all about -- when
> hitting the update url you specify a "ut" (update type) that determines
> that logic ... but it should be independed of #1
>
> maybe the full list of stream sources for #1 is finite and the code for
> all of them can live in the UpdateServlet ... but it still needs to be an
> option configured as a param, and it seems like it might as well be a
> plugin so it's easy for people to write new ones in the future.
>
> -Hoss
>

While we are at it... is there any reason (for or against) exposing
other parts of the HttpServletRequest to SolrRequestHandlers?

While it is not the focus of solr, someone (including me!) may want to
implement some more complex authentication scheme - Perhaps setting a
field on each document saying who added it and from what IP.

stuff to consider: cookies, headers, remoteUser, remoteHost...


ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by Chris Hostetter <ho...@fucit.org>.

: indexing app I wrote into SOLR.  It occurred to me that it would almost
: be simpler to use the plugin-friendly QueryRequest mechanism rather than
: the UpdateRequest mechanism; coupled with what you wrote below, Hoss, it
: makes me think that a little refactoring of request handling might go a
: long way:

I think you are definitely right ... refacting some of the
SolrRequestHandler/SolrQueryRequest/SolrQueryResponse interfaces/abstract
base classes to have some bases extendeable by some other
SolrUpdateHandler/SolrUpdateRequest/SolrUpdateResponse interfaces/abstract
base classes would go a long way.

Your post also made me realize that i'd total discounted the issue of
returning information about the *results* of the update back to the client
... currently it's done with XML which is "ok" because in order to send an
update the client has to udnerstand XML -- but if we start supporting
arbitrary formats for updates, we need to be able to respond in kind.
Your comment about reusing SolrQueryResponse and QueryResponseWriters for
this sounds perfect.

The one hitch i think to the the notion that updates and queries map
cleanlly with something like this...

  SolrRequestHandler => SolrUpdateHandler
  SolrQueryRequest => SolrUpdateRequest
  SolrQueryResponse => SolrUpdateResponse (possibly the same class)
  QueryResponseWriter => UpdateResponseWriter (possible the same class)

...is that with queries, the "input" tends to be fairly simple.  very
generic code can be run by the query Servlet to get all of the input
params and build the SolrQueryRequest ... but with updates this isn't
quite as simple.  there's the two issues i spoke of in my earlier mail
which should be independenly confiugable:
  1) where does the "stream" of update data come from?  is it in the raw
     POST body? is it in a POSTed multi-part MIME part? is it a remote
     resource refrenced by URL?
  2) how should the raw binary stream of update data be parsed?  is it
     XML? (in the current update format)  is it a CSV file?  is it a PDF?

...#2 can be what the SolrUpdateHandler interface is all about -- when
hitting the update url you specify a "ut" (update type) that determines
that logic ... but it should be independed of #1

maybe the full list of stream sources for #1 is finite and the code for
all of them can live in the UpdateServlet ... but it still needs to be an
option configured as a param, and it seems like it might as well be a
plugin so it's easy for people to write new ones in the future.

-Hoss

Update Plugins (was Re: Handling disparate data sources in Solr)

Posted by "J.J. Larrea" <jj...@panix.com>.

Along similar lines I have been thinking of how to splice a Lucene indexing app I wrote into SOLR.  It occurred to me that it would almost be simpler to use the plugin-friendly QueryRequest mechanism rather than the UpdateRequest mechanism; coupled with what you wrote below, Hoss, it makes me think that a little refactoring of request handling might go a long way:

SolrRequestHandler now defines

  public void handleRequest(SolrQueryRequest req, SolrQueryResponse rsp)

Interface SolrQueryRequest and abstract implementation SolrQueryRequestBase are mainly involved with parsing request parameters; the only method signatures which are query-specific are getSearcher() and the @deprecated getQueryString() and getQueryType().

SolrQueryResponse is mainly concerned with building a generic response message including execution time, though it also supports a default set of returned field names.

So SolrRequestHandler.handleRequest could be changed to

  public void handleRequest(SolrRequest req, SolrResponse rsp)

with SolrRequest and SolrResponse interfaces having the generic functionality described above.

Then SolrQueryRequest and SolrQueryResponse could be crafted as sub-interfaces and/or abstract implementations segregating the few Query-specific functionality.  One would also create SolrUpdateRequest and SolrUpdateResponse interfaces and/or base implementations much the same way.

Then in SolrCore, the RequestHandler registry and execute() method would without modification handle both Query and Update requests; the code in SolrCore.update and SolrCore.readDoc should be moved into an implementation of SolrRequestHandler, e.g. DefaultUpdateRequestHandler, which would be registered under the request name "update" and could then be subclassed by users. It could then use SolrResponse to formulate the response, and would get the request timing information put in by SolrCore.execute() for free, as well as the pluggable response format mechanism.

Note the UpdateRequestHandler which formulates update requests would be separate from the UpdateHandler, which controls the update logic (index acrobatics).

Finally, the SolrUpdateServlet could be cast as a trivial subclass of SolrServlet; perhaps all it needs to do is to set the default value for the request type to "update" rather than "standard", for reverse compatibility, and perhaps to let an a parameter other than 'qt' be used to specify the request type for updates.

I am pretty sure something along these lines would accomplish all the benefits you suggest below and more, with a minimal amount of coding and fairly good reverse-compatibility.  It of course still leaves the hard work of writing the actual update handler plugins.  But it's a lot simpler to subclass an UpdateRequestHandler than SolrCore!

What do you folks think?

- J.J.

PS: If I weren't up to my ears in other deadline-driven deliverables, I'd just jump in and try it.

At 4:21 PM -0800 1/7/07, Chris Hostetter wrote:
>It seems like [Handling disparate data sources in Solr] could be addressed by modifing the SolrUpdateServlet to to support to low level query params similar to the way the SolrServlet looks at "qt" and "wt".  The first Param would be used to pick an UpdateSource plugin that would have an API like...
>  public interface UpdateSource {
>     SolrUpdateRequest makeRequest(HttpServletRequest req);
>  }
>
>with the SolrUpdateRequest interface looking something like...
>  public interface SolrUpdateRequest {
>     SolrParams getParams();
>     Iterable<java.io.Reader> getRawUpdates();
>  }
>
>different out of the box versions of UpdateSource would support building
>SolrUpdateRequest objects from HttpServletRequests using...
>  1) URL query args and the raw POST body
>  2) query args from multipart form input and Readers from file uploads
>  3) query args and local filenames specificed in query args
>  4) query args and remote URLs specified in query args
>
>The SolrUpdateServlet would then use SolrUpdateRequest.getParams() to
>lookup it's second core param for picking an UpdateParser plugin, which
>would be responsible for parsing all of those Readers in sequence,
>converting them to UpdateCommands, and calling the appropriate methods on
>the UpdateHandler.
>
>Out of the box versions of UpdateParser could do the XML parsing currently
>done, or JSON parsing, or CSV parsing.  Custom plugins written by users
>could do more exotic schema specific parsing: ie, reading raw PDFs and
>extracting specific field values.
>
>
>what do you guys think?
>
>
>-Hoss

Re: Handling disparate data sources in Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: > There has been some discussion about adding plugin support for the
: > "update" side of things as well -- at a very simple level this could allow
: > for messages to be sent via JSON, or CSV instead of just XML -- but

: I'm interested in discussing this further.  I've moved the discussion
: onto solr-dev, as suggested.

Currently, the "modularity" of updates is configurable only the
upateHandler -- which decides how instances of "UpdateCommand" will be
handled by the SOlrCore (directly, via a temp index, etc...)

The relevent discussion so far seems to have focused on a two
different aspects of issue related to how SolrCore gets those commands...
  1) parsing different String representations (ie: XML vs JSON vs CSV) of
     the same basic command structure (ie: "add" containing "doc"s,
     containing "field"s)
  2) differnet means of feeding those String commands to Solr (raw POST,
     CGI file upload, local file)

with this thread, a third aspect has been brought up:
  3) Sending Solr more "raw" data and letting a plugin extract the
     individual fields based on rules (IE: parsing a PDF and determing the
     "title" and "body" on the server side)

It seems like these issues could be addressed by modifing the
SolrUpdateServlet to to support to low level query params similar to the
way the SolrServlet looks at "qt" and "wt".  The first Param would be used
to pick an UpdateSource plugin that would have an API like...
  public interface UpdateSource {
     SolrUpdateRequest makeRequest(HttpServletRequest req);
  }

with the SolrUpdateRequest interface looking something like...
  public interface SolrUpdateRequest {
     SolrParams getParams();
     Iterable<java.io.Reader> getRawUpdates();
  }

different out of the box versions of UpdateSource would support building
SolrUpdateRequest objects from HttpServletRequests using...
  1) URL query args and the raw POST body
  2) query args from multipart form input and Readers from file uploads
  3) query args and local filenames specificed in query args
  4) query args and remote URLs specified in query args

The SolrUpdateServlet would then use SolrUpdateRequest.getParams() to
lookup it's second core param for picking an UpdateParser plugin, which
would be responsible for parsing all of those Readers in sequence,
converting them to UpdateCommands, and calling the appropriate methods on
the UpdateHandler.

Out of the box versions of UpdateParser could do the XML parsing currently
done, or JSON parsing, or CSV parsing.  Custom plugins written by users
could do more exotic schema specific parsing: ie, reading raw PDFs and
extracting specific field values.


what do you guys think?


-Hoss

Re: Handling disparate data sources in Solr

Posted by Alan Burlison <Al...@sun.com>.

Erik Hatcher wrote:

> The idea of having Solr handle various document types is a good one, for 
> sure.  I'm not sure what specifics would need to be implemented, but I 
> at least wanted to reply and say its a good idea!
> 
> Care has to be taken when passing a URL to Solr for it to go fetch, 
> though.  There are a lot of complexities in fetching resources via HTTP, 
> especially when handing something off to Solr which should be behind a 
> firewall and may not be able to see the web as you would with your browser.

In that case the client should encode the content and send it as part of 
the index insert/update request - the aim is to merely prevent the bloat 
caused by encoding the document (e.g. as base64) when the indexer can 
access the source document directly.

-- 
Alan Burlison
--

Re: Handling disparate data sources in Solr

Posted by Alan Burlison <Al...@sun.com>.

Walter Underwood wrote:

> Cracking documents and spidering URLs are both big, big problems.
> PDF is a horrid mess, as are old versions of MS Office. Proxies,
> logins, cookies, all sort of issues show up with fetching URLs,
> along with a fun variety of misbehaving servers.
> 
> I remember crashing one server with 25 GET requests before we
> implemented session cookies in our spider. That used all that
> DB connections and killed the server.
> 
> If you need to do a lot of spidering and parse lots of kinds of
> documents, I don't know of an open source solution for that.
> Products like Ultraseek and the Googlebox are about your only
> choice.

I'm not suggesting that Solr be extended to become a spider, I'm just 
suggesting we provide a mechanism for direct access to source documents 
if they are accessible.  For example if the document being indexed was 
on the same machine as Solr, the href would usually start "file://", not 
"http://"

BTW, this discussion is also occurring on solr-dev, it might be better 
to move all of it over there ;-)

-- 
Alan Burlison
--

Re: Handling disparate data sources in Solr

Posted by Walter Underwood <wu...@netflix.com>.

On 1/7/07 7:24 AM, "Erik Hatcher" <er...@ehatchersolutions.com> wrote:

> Care has to be taken when passing a URL to Solr for it to go fetch,
> though.  There are a lot of complexities in fetching resources via
> HTTP, especially when handing something off to Solr which should be
> behind a firewall and may not be able to see the web as you would
> with your browser.

Cracking documents and spidering URLs are both big, big problems.
PDF is a horrid mess, as are old versions of MS Office. Proxies,
logins, cookies, all sort of issues show up with fetching URLs,
along with a fun variety of misbehaving servers.

I remember crashing one server with 25 GET requests before we
implemented session cookies in our spider. That used all that
DB connections and killed the server.

If you need to do a lot of spidering and parse lots of kinds of
documents, I don't know of an open source solution for that.
Products like Ultraseek and the Googlebox are about your only
choice.

wunder
-- 
Walter Underwood
Search Guru, Netflix
Former Architect for Ultraseek

Re: Handling disparate data sources in Solr

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

The idea of having Solr handle various document types is a good one,  
for sure.  I'm not sure what specifics would need to be implemented,  
but I at least wanted to reply and say its a good idea!

Care has to be taken when passing a URL to Solr for it to go fetch,  
though.  There are a lot of complexities in fetching resources via  
HTTP, especially when handing something off to Solr which should be  
behind a firewall and may not be able to see the web as you would  
with your browser.

	Erik


On Jan 4, 2007, at 4:53 PM, Alan Burlison wrote:

> Original problem statement:
>
> ----------
> I'm considering using Solr to replace an existing bare-metal Lucene  
> deployment - the current Lucene setup is embedded inside an  
> existing monolithic webapp, and I want to factor out the search  
> functionality into a separate webapp so it can be reused more easily.
>
> At present the content of the Lucene index comes from many  
> different sources (web pages, documents, blog posts etc) and can be  
> different formats (plaintext, HTML, PDF etc).  All the various  
> content types are rendered to plaintext before being inserted into  
> the Lucene index.
>
> The net result is that the data in one field in the index (say  
> "content") may have come from one of a number of source document  
> types.  I'm having difficulty understanding how I might map this  
> functionality onto Solr.  I understand how (for example) I could  
> use HTMLStripStandardTokenizer to insert the contents of a HTML  
> document into a field called "content", but (assuming I'd written a  
> PDF analyser) how would I insert the content of a PDF document into  
> the same "content" field?
>
> I know I could do this by preprocessing the various document types  
> to plaintext in the various Solr clients before inserting the data  
> into the index, but that means that each client would need to know  
> how to do the document transformation.  As well as centralising the  
> index, I also want to centralise the handling of the different  
> document types.
> ----------
>
> My initial suggestion, to get the discussion started, is to extend  
> the <doc> and <field> element with the following attributes:
>
> mime-type
> Mime type of the document, e.g. application/pdf, text/html and so on.
>
> encoding
> Encoding of the document, with base64 being the standard  
> implementation.
>
> href
> The URL of any documents that can be accessed over HTTP, instead of  
> embedding them in the indexing request.  The indexer would fetch  
> the document using the specified URL.
>
> There would then be entries in the configuration file that map each  
> MIME type to a handler that is capable of dealing with that  
> document type.
>
> Thoughts?
>
> -- 
> Alan Burlison
> --

Re: Handling disparate data sources in Solr

Posted by Alan Burlison <Al...@sun.com>.

Original problem statement:

----------
I'm considering using Solr to replace an existing bare-metal Lucene 
deployment - the current Lucene setup is embedded inside an existing 
monolithic webapp, and I want to factor out the search functionality 
into a separate webapp so it can be reused more easily.

At present the content of the Lucene index comes from many different 
sources (web pages, documents, blog posts etc) and can be different 
formats (plaintext, HTML, PDF etc).  All the various content types are 
rendered to plaintext before being inserted into the Lucene index.

The net result is that the data in one field in the index (say 
"content") may have come from one of a number of source document types. 
  I'm having difficulty understanding how I might map this functionality 
onto Solr.  I understand how (for example) I could use 
HTMLStripStandardTokenizer to insert the contents of a HTML document 
into a field called "content", but (assuming I'd written a PDF analyser) 
how would I insert the content of a PDF document into the same "content" 
field?

I know I could do this by preprocessing the various document types to 
plaintext in the various Solr clients before inserting the data into the 
index, but that means that each client would need to know how to do the 
document transformation.  As well as centralising the index, I also want 
to centralise the handling of the different document types.
----------

My initial suggestion, to get the discussion started, is to extend the 
<doc> and <field> element with the following attributes:

mime-type
Mime type of the document, e.g. application/pdf, text/html and so on.

encoding
Encoding of the document, with base64 being the standard implementation.

href
The URL of any documents that can be accessed over HTTP, instead of 
embedding them in the indexing request.  The indexer would fetch the 
document using the specified URL.

There would then be entries in the configuration file that map each MIME 
type to a handler that is capable of dealing with that document type.

Thoughts?

-- 
Alan Burlison
--

Re: Handling disparate data sources in Solr

Posted by Alan Burlison <Al...@sun.com>.

Original problem statement:

----------
I'm considering using Solr to replace an existing bare-metal Lucene 
deployment - the current Lucene setup is embedded inside an existing 
monolithic webapp, and I want to factor out the search functionality 
into a separate webapp so it can be reused more easily.

At present the content of the Lucene index comes from many different 
sources (web pages, documents, blog posts etc) and can be different 
formats (plaintext, HTML, PDF etc).  All the various content types are 
rendered to plaintext before being inserted into the Lucene index.

The net result is that the data in one field in the index (say 
"content") may have come from one of a number of source document types. 
  I'm having difficulty understanding how I might map this functionality 
onto Solr.  I understand how (for example) I could use 
HTMLStripStandardTokenizer to insert the contents of a HTML document 
into a field called "content", but (assuming I'd written a PDF analyser) 
how would I insert the content of a PDF document into the same "content" 
field?

I know I could do this by preprocessing the various document types to 
plaintext in the various Solr clients before inserting the data into the 
index, but that means that each client would need to know how to do the 
document transformation.  As well as centralising the index, I also want 
to centralise the handling of the different document types.
----------

My initial suggestion, to get the discussion started, is to extend the 
<doc> and <field> element with the following attributes:

mime-type
Mime type of the document, e.g. application/pdf, text/html and so on.

encoding
Encoding of the document, with base64 being the standard implementation.

href
The URL of any documents that can be accessed over HTTP, instead of 
embedding them in the indexing request.  The indexer would fetch the 
document using the specified URL.

There would then be entries in the configuration file that map each MIME 
type to a handler that is capable of dealing with that document type.

Thoughts?

-- 
Alan Burlison
--

Re: Handling disparate data sources in Solr

Posted by Alan Burlison <Al...@sun.com>.

Chris Hostetter wrote:

> For your purposes, if you've got a system that works and does the Document
> conversion for you, then you are probably right: Solr may not be a usefull
> addition to your architecture.  Solr doesn't really attempt to solve the
> problem of parsing differnet kinds of data streams into a unified Document
> module -- it just tries to expose all of the Lucene goodness through an
> easy to use, easy to configre, HTTP interface.  Besides the
> configuration, Solr's other means of being a value add is in it's
> IndexReader management, it's caching, and it's plugin support for mixing
> and matching request handlers, output writters, and field types as easily
> as you can mix and match Analyzers.
> 
> There has been some discussion about adding plugin support for the
> "update" side of things as well -- at a very simple level this could allow
> for messages to be sent via JSON, or CSV instead of just XML -- but
> there's no reason a more comple upate plugin couldn't read in a binary PDF
> file and parse it into it's appropriate fields ... but we aren't
> quite there yet.  Feel free to bring this up on solr-dev if you'd be
> interested in working on it.

I'm interested in discussing this further.  I've moved the discussion 
onto solr-dev, as suggested.

-- 
Alan Burlison
--

Re: Handling disparate data sources in Solr

Posted by Alan Burlison <Al...@sun.com>.

Chris Hostetter wrote:

> For your purposes, if you've got a system that works and does the Document
> conversion for you, then you are probably right: Solr may not be a usefull
> addition to your architecture.  Solr doesn't really attempt to solve the
> problem of parsing differnet kinds of data streams into a unified Document
> module -- it just tries to expose all of the Lucene goodness through an
> easy to use, easy to configre, HTTP interface.  Besides the
> configuration, Solr's other means of being a value add is in it's
> IndexReader management, it's caching, and it's plugin support for mixing
> and matching request handlers, output writters, and field types as easily
> as you can mix and match Analyzers.
> 
> There has been some discussion about adding plugin support for the
> "update" side of things as well -- at a very simple level this could allow
> for messages to be sent via JSON, or CSV instead of just XML -- but
> there's no reason a more comple upate plugin couldn't read in a binary PDF
> file and parse it into it's appropriate fields ... but we aren't
> quite there yet.  Feel free to bring this up on solr-dev if you'd be
> interested in working on it.

I'm interested in discussing this further.  I've moved the discussion 
onto solr-dev, as suggested.

-- 
Alan Burlison
--

Re: Handling disparate data sources in Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: > You could do it in Solr.  The difficulty is that arbitrary binary data
: > is not easily transferred via xml.  So you must specify that the input
: > is in base64 or some other encoding.  Then you could decode it on the
: > fly using a custom Analyzer before passing it along.
:
: Why won't cdata work?

because your binary data might the byte sequence: 0x5D 0x5D 0x3E --
indicating hte end of the CDATA section. CDATA is short for "Charatacter
DATA" -- you can't put arbitrary binary data in (or even arbitrary text in
it) and be sure thta it will work.

: > It might be easier to do this outside of solr, but still in a
: > centralized manner.  Write another webapp which accepts files.   It
: > will decode them appropriately and pass them along to the solr
: > instance in the same container.  Then your client don't even need to
: > know how to talk to solr.
:
: In that case there's little point in using Solr at all - the main
: benefit it gives me is that I don't have to write all the HTTP protocol
: bits.  If I have to do that myself I might as well use raw Luceme - and
: in fact that's how the existing system works.

For your purposes, if you've got a system that works and does the Document
conversion for you, then you are probably right: Solr may not be a usefull
addition to your architecture.  Solr doesn't really attempt to solve the
problem of parsing differnet kinds of data streams into a unified Document
module -- it just tries to expose all of the Lucene goodness through an
easy to use, easy to configre, HTTP interface.  Besides the
configuration, Solr's other means of being a value add is in it's
IndexReader management, it's caching, and it's plugin support for mixing
and matching request handlers, output writters, and field types as easily
as you can mix and match Analyzers.

There has been some discussion about adding plugin support for the
"update" side of things as well -- at a very simple level this could allow
for messages to be sent via JSON, or CSV instead of just XML -- but
there's no reason a more comple upate plugin couldn't read in a binary PDF
file and parse it into it's appropriate fields ... but we aren't
quite there yet.  Feel free to bring this up on solr-dev if you'd be
interested in working on it.


-Hoss

Re: Handling disparate data sources in Solr

Posted by Walter Underwood <wu...@netflix.com>.

On 12/23/06 5:28 AM, "Alan Burlison" <Al...@sun.com> wrote:

>> You could do it in Solr.  The difficulty is that arbitrary binary data
>> is not easily transferred via xml.  So you must specify that the input
>> is in base64 or some other encoding.  Then you could decode it on the
>> fly using a custom Analyzer before passing it along.
> 
> Why won't cdata work?

Some octet (byte) values are illegal in XML. Most of the ASCII control
characters are not allowed. If one of those is in an XML document,
it is a fatal error and must stop parsing in any conforming XML
parser.

wunder
-- 
Walter Underwood
Search Guru, Netflix

Re: Handling disparate data sources in Solr

Posted by Alan Burlison <Al...@sun.com>.

Mike Klaas wrote:

> You could do it in Solr.  The difficulty is that arbitrary binary data
> is not easily transferred via xml.  So you must specify that the input
> is in base64 or some other encoding.  Then you could decode it on the
> fly using a custom Analyzer before passing it along.

Why won't cdata work?

> It might be easier to do this outside of solr, but still in a
> centralized manner.  Write another webapp which accepts files.   It
> will decode them appropriately and pass them along to the solr
> instance in the same container.  Then your client don't even need to
> know how to talk to solr.

In that case there's little point in using Solr at all - the main 
benefit it gives me is that I don't have to write all the HTTP protocol 
bits.  If I have to do that myself I might as well use raw Luceme - and 
in fact that's how the existing system works.

-- 
Alan Burlison
--

Re: Handling disparate data sources in Solr

Posted by Mike Klaas <mi...@gmail.com>.

On 12/22/06, Alan Burlison <Al...@sun.com> wrote:

> At present the content of the Lucene index comes from many different
> sources (web pages, documents, blog posts etc) and can be different
> formats (plaintext, HTML, PDF etc).  All the various content types are
> rendered to plaintext before being inserted into the Lucene index.
>
> The net result is that the data in one field in the index (say
> "content") may have come from one of a number of source document types.
>   I'm having difficulty understanding how I might map this functionality
> onto Solr.  I understand how (for example) I could use
> HTMLStripStandardTokenizer to insert the contents of a HTML document
> into a field called "content", but (assuming I'd written a PDF analyser)
> how would I insert the content of a PDF document into the same "content"
> field?

You could do it in Solr.  The difficulty is that arbitrary binary data
is not easily transferred via xml.  So you must specify that the input
is in base64 or some other encoding.  Then you could decode it on the
fly using a custom Analyzer before passing it along.

It might be easier to do this outside of solr, but still in a
centralized manner.  Write another webapp which accepts files.   It
will decode them appropriately and pass them along to the solr
instance in the same container.  Then your client don't even need to
know how to talk to solr.

-Mike