You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/12/01 14:01:50 UTC

Document aware analyzers was Re: deprecating Versions

On Nov 29, 2010, at 5:34 AM, Robert Muir wrote:

> On Mon, Nov 29, 2010 at 2:50 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
>> And for indexes:
>> * Index compatibility is guaranteed across two adjacent major
>> releases. eg 2.x -> 3.x, 3.x -> 4.x.
>>  That includes both binary compat - codecs, and semantic compat -
>> analyzers (if appropriate Version is used).
>> * Older releases are most probably unsupported.
>>  e.g. 4.x still supports shared docstores for reading, though never
>> writes them. 5.x won't read them either, so you'll have to at least
>> fully optimize your 3.x indexes when going through 4.x to 5.x.
>> 
> 
> Is it somehow possible i could convince everyone that all the
> analyzers we provide are simply examples?
> This way we could really make this a bit more reasonable and clean up
> a lot of stuff.
> 
> Seems like we really want to move towards a more declarative model
> where these are just config files... so only then it will ok for us to
> change them because they suddenly aren't suffixed with .java?!

While we are at it, how about we make the Analysis process document aware instead of Field aware?  The PerFieldAnalyzerWrapper, while doing exactly what it says it does, is just silly.  If you had an analysis process that was aware, if it chooses to be, of the document as a whole then you open up a whole lot more opportunity for doing interesting analysis while losing nothing towards the individual treatment of fields.  The TeeSink stuff is an attempt at this, but it is not sufficient.

Just a thought,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Document aware analyzers was Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.
On Wed, Dec 1, 2010 at 3:44 PM, Grant Ingersoll <gs...@apache.org> wrote:

>> Well i have trouble with a few of your examples: "want to use
>> Tee/Sink" doesn't work for me... its a description of an XY problem to
>> me... i've never needed to use it, and its rarely discussed on the
>> user list...
>
> Shrugs.  In my experiments, it can really speed things up when analyzing the same content, but with different outcomes, or at least it did back before the new API.

<snip>

> For instance, the typical copy field scenario where one has two fields containing the same content analyzed in slightly different ways.  In many cases, most of the work is exactly the same (tokenize, lowercase, stopword, stem or not) and yet we have to pass around the string twice and do almost all of the same work twice all so that we can change one little thing on the token.
>

but didnt you just answer your own question? sounds like you just need
to implement copyField in solr with Tee/Sink.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Document aware analyzers was Re: deprecating Versions

Posted by Earwin Burrfoot <ea...@gmail.com>.
I agree with Robert that minimizing analysis <-> indexer interface is
the way to go.
For me, one of Lucene's problems is that it wants to do too much stuff
out of the box, and is tightly coupled, so you can't drop much of the
things you never need.

Having minimal interface for the indexer allows us to experiment with
various analysis approaches without touching core functionality at
all.
I.e. - I'd like to make analysis chain non-streaming. This will
greatly simplify much of my filters, and for my use-case likely yields
more performance.
At the same time I understand that many people can't afford keeping
their docs completely in memory while indexing.

My ideal API is unusable for them, wrapping my token buffers to look
like Document+Fieldables+TokenStreams is uuugly.

Having the lowest possible common denominator as indexer interface is
best for both parties.

On Wed, Dec 1, 2010 at 23:44, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Dec 1, 2010, at 2:40 PM, Robert Muir wrote:
>
>> On Wed, Dec 1, 2010 at 2:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
>>>
>>> Nah, I just meant analysis would often benefit from having knowledge of the document as a whole instead of just the individual field.
>>>
>>
>> and analysis would suffer from this too, because right now these
>> things are independent and we have a fast simple reusable model.
>> I'd prefer to keep the TokenStream analysis api... but as we have
>> discussed on the list, it would be nice to minimize the interface
>> between analysis components and indexer/queryparser so you can use an
>> *alternative* API... we are working in this direction already.
>
> I think the existing TokenStream API still works, at least in my mind.
>
>>
>>>>
>>>> Maybe if you give a concrete example then I would have a better
>>>> understanding of the problem you think this might solve.
>>>
>>> Let me see if I can put some flesh on the bones.  I'm assuming the raw document has already been parsed and that we are still basically dealing with strings and that we have a document which contains one or more fields.
>>>
>>> If we step back and look at our analysis process, there are some things that are easy and some things that are hard that maybe shouldn't be because even though we talk like we are indexing and search documents, we are really indexing and searching fields and everything is Field centric.  That works fine for the easy analysis things like tokenization, stemming, lowercasing, etc. when all the content is in one language.  It doesn't work well when you have multiple languages in a single document or if you want to do things like Tee/Sink or even something as simple as Solr's copy field semantics.
>>
>> Well i have trouble with a few of your examples: "want to use
>> Tee/Sink" doesn't work for me... its a description of an XY problem to
>> me... i've never needed to use it, and its rarely discussed on the
>> user list...
>
> Shrugs.  In my experiments, it can really speed things up when analyzing the same content, but with different outcomes, or at least it did back before the new API.  My bigger point is things like that and the PerFieldAnalyzerWrapper are symptoms of treating documents as second class citizens.
>
>>
>> As far as working with a lot of languages, i understand this issue
>> much more... but i've never much had a desire for this, especially
>> given the fact that "Query is a document too"... I'm personally not a
>> fan of language detection,
>> and I don't think it belongs in our analysis API: like encoding
>> detection and other similar heuristics, its part of document parsing
>> to me!
>
> I didn't say it did, I just said it is an example of the types of things where we pretend like we are document-centric, but we are actually field centric.
>
>>
>> As I said before, I think our TokenStream analysis API is already
>> quite complicated and I dont think we should make it more complicated
>> for these reasons (especially since these examples are quite vague and
>> i'm still not sure you cannot solve them easier in another way.
>
> I never said you couldn't solve them in other ways, but I always find they are kludgy.  For instance, how many times, in a complex environment, must one tokenize the same text over and over again just to get it in the index?
>
>>
>> If you want to use a more complicated analysis API that doesnt work
>> like TokenStreams but instead incorporates things that are document
>> parsing or whatever, i guess you should be able to do that. I'm not
>> sure Lucene should provide such an API, but we shouldn't force you to
>> use the TokenStreams API either.
>
> You keep going back to document parsing, even though I have never mentioned it.  All I am proposing/_wanting to discuss_ is the notion that Analysis might benefit from a more document centric view of analysis.  You're presupposing I want to change TokenStreams, etc. when all I'm wanting to do is take a step back and discuss the bigger picture of how a user actually does analysis in the real world and whether we can make it easier for them.  I don't even have an implementation in mind yet.
>
> For instance, the typical copy field scenario where one has two fields containing the same content analyzed in slightly different ways.  In many cases, most of the work is exactly the same (tokenize, lowercase, stopword, stem or not) and yet we have to pass around the string twice and do almost all of the same work twice all so that we can change one little thing on the token.
>
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Document aware analyzers was Re: deprecating Versions

Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 1, 2010, at 2:40 PM, Robert Muir wrote:

> On Wed, Dec 1, 2010 at 2:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> 
>> Nah, I just meant analysis would often benefit from having knowledge of the document as a whole instead of just the individual field.
>> 
> 
> and analysis would suffer from this too, because right now these
> things are independent and we have a fast simple reusable model.
> I'd prefer to keep the TokenStream analysis api... but as we have
> discussed on the list, it would be nice to minimize the interface
> between analysis components and indexer/queryparser so you can use an
> *alternative* API... we are working in this direction already.

I think the existing TokenStream API still works, at least in my mind.  

> 
>>> 
>>> Maybe if you give a concrete example then I would have a better
>>> understanding of the problem you think this might solve.
>> 
>> Let me see if I can put some flesh on the bones.  I'm assuming the raw document has already been parsed and that we are still basically dealing with strings and that we have a document which contains one or more fields.
>> 
>> If we step back and look at our analysis process, there are some things that are easy and some things that are hard that maybe shouldn't be because even though we talk like we are indexing and search documents, we are really indexing and searching fields and everything is Field centric.  That works fine for the easy analysis things like tokenization, stemming, lowercasing, etc. when all the content is in one language.  It doesn't work well when you have multiple languages in a single document or if you want to do things like Tee/Sink or even something as simple as Solr's copy field semantics.
> 
> Well i have trouble with a few of your examples: "want to use
> Tee/Sink" doesn't work for me... its a description of an XY problem to
> me... i've never needed to use it, and its rarely discussed on the
> user list...

Shrugs.  In my experiments, it can really speed things up when analyzing the same content, but with different outcomes, or at least it did back before the new API.  My bigger point is things like that and the PerFieldAnalyzerWrapper are symptoms of treating documents as second class citizens.

> 
> As far as working with a lot of languages, i understand this issue
> much more... but i've never much had a desire for this, especially
> given the fact that "Query is a document too"... I'm personally not a
> fan of language detection,
> and I don't think it belongs in our analysis API: like encoding
> detection and other similar heuristics, its part of document parsing
> to me!

I didn't say it did, I just said it is an example of the types of things where we pretend like we are document-centric, but we are actually field centric.

> 
> As I said before, I think our TokenStream analysis API is already
> quite complicated and I dont think we should make it more complicated
> for these reasons (especially since these examples are quite vague and
> i'm still not sure you cannot solve them easier in another way.

I never said you couldn't solve them in other ways, but I always find they are kludgy.  For instance, how many times, in a complex environment, must one tokenize the same text over and over again just to get it in the index?

> 
> If you want to use a more complicated analysis API that doesnt work
> like TokenStreams but instead incorporates things that are document
> parsing or whatever, i guess you should be able to do that. I'm not
> sure Lucene should provide such an API, but we shouldn't force you to
> use the TokenStreams API either.

You keep going back to document parsing, even though I have never mentioned it.  All I am proposing/_wanting to discuss_ is the notion that Analysis might benefit from a more document centric view of analysis.  You're presupposing I want to change TokenStreams, etc. when all I'm wanting to do is take a step back and discuss the bigger picture of how a user actually does analysis in the real world and whether we can make it easier for them.  I don't even have an implementation in mind yet.

For instance, the typical copy field scenario where one has two fields containing the same content analyzed in slightly different ways.  In many cases, most of the work is exactly the same (tokenize, lowercase, stopword, stem or not) and yet we have to pass around the string twice and do almost all of the same work twice all so that we can change one little thing on the token.  

-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Document aware analyzers was Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.
On Wed, Dec 1, 2010 at 2:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> Nah, I just meant analysis would often benefit from having knowledge of the document as a whole instead of just the individual field.
>

and analysis would suffer from this too, because right now these
things are independent and we have a fast simple reusable model.
I'd prefer to keep the TokenStream analysis api... but as we have
discussed on the list, it would be nice to minimize the interface
between analysis components and indexer/queryparser so you can use an
*alternative* API... we are working in this direction already.

>>
>> Maybe if you give a concrete example then I would have a better
>> understanding of the problem you think this might solve.
>
> Let me see if I can put some flesh on the bones.  I'm assuming the raw document has already been parsed and that we are still basically dealing with strings and that we have a document which contains one or more fields.
>
> If we step back and look at our analysis process, there are some things that are easy and some things that are hard that maybe shouldn't be because even though we talk like we are indexing and search documents, we are really indexing and searching fields and everything is Field centric.  That works fine for the easy analysis things like tokenization, stemming, lowercasing, etc. when all the content is in one language.  It doesn't work well when you have multiple languages in a single document or if you want to do things like Tee/Sink or even something as simple as Solr's copy field semantics.

Well i have trouble with a few of your examples: "want to use
Tee/Sink" doesn't work for me... its a description of an XY problem to
me... i've never needed to use it, and its rarely discussed on the
user list...

As far as working with a lot of languages, i understand this issue
much more... but i've never much had a desire for this, especially
given the fact that "Query is a document too"... I'm personally not a
fan of language detection,
and I don't think it belongs in our analysis API: like encoding
detection and other similar heuristics, its part of document parsing
to me!

As I said before, I think our TokenStream analysis API is already
quite complicated and I dont think we should make it more complicated
for these reasons (especially since these examples are quite vague and
i'm still not sure you cannot solve them easier in another way.

If you want to use a more complicated analysis API that doesnt work
like TokenStreams but instead incorporates things that are document
parsing or whatever, i guess you should be able to do that. I'm not
sure Lucene should provide such an API, but we shouldn't force you to
use the TokenStreams API either.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Document aware analyzers was Re: deprecating Versions

Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 1, 2010, at 8:07 AM, Robert Muir wrote:

> On Wed, Dec 1, 2010 at 8:01 AM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> While we are at it, how about we make the Analysis process document aware instead of Field aware?  The PerFieldAnalyzerWrapper, while doing exactly what it says it does, is just silly.  If you had an analysis process that was aware, if it chooses to be, of the document as a whole then you open up a whole lot more opportunity for doing interesting analysis while losing nothing towards the individual treatment of fields.  The TeeSink stuff is an attempt at this, but it is not sufficient.
>> 
> 
> I'm not sure I like this: traditionally we let the user application
> deal with "document parsing" (how do you take your content and define
> it as documents/fields).

Nah, I just meant analysis would often benefit from having knowledge of the document as a whole instead of just the individual field.  

> 
> If we want to change lucene to start dealing with this "document
> parsing" aspect, thats pretty scary in itself, but in my opinion the
> very last choice of where we would want to add something like that is
> analysis! So personally I really like analysis being separate from
> document parsing: our analysis API is already way too complicated.

Yes, I agree.


> 
> Maybe if you give a concrete example then I would have a better
> understanding of the problem you think this might solve.

Let me see if I can put some flesh on the bones.  I'm assuming the raw document has already been parsed and that we are still basically dealing with strings and that we have a document which contains one or more fields.

If we step back and look at our analysis process, there are some things that are easy and some things that are hard that maybe shouldn't be because even though we talk like we are indexing and search documents, we are really indexing and searching fields and everything is Field centric.  That works fine for the easy analysis things like tokenization, stemming, lowercasing, etc. when all the content is in one language.  It doesn't work well when you have multiple languages in a single document or if you want to do things like Tee/Sink or even something as simple as Solr's copy field semantics.  The fact that we have PerFieldAnalyzerWrapper is a symptom of this.  The clunkiness of the TeeSinkTokenFilter is also another one.  Handling auto language identification is another.  The end result of all of these things is you often have to do analysis work twice (or more) for the same piece of content when I believe that an analysis process that knew a document had multiple fields (which seems like a given) might lead to more efficiencies because repeated analysis work could be shared and also because work that inherently crosses multiple fields on the same document or selects a particular field out of a choice of several can be handled more cleanly.

So, you as the developer would still need to define out what your fields are and what analysis you want done for each of those fields, but we, as Lucene developers, might be able to make things more efficient if we can recognize commonalities, etc. as well as offer users tools that make it easy to work across fields.

At any rate, this is all just food for thought.  I don't have any proposed API changes at this point.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Document aware analyzers was Re: deprecating Versions

Posted by Robert Muir <rc...@gmail.com>.
On Wed, Dec 1, 2010 at 8:01 AM, Grant Ingersoll <gs...@apache.org> wrote:

> While we are at it, how about we make the Analysis process document aware instead of Field aware?  The PerFieldAnalyzerWrapper, while doing exactly what it says it does, is just silly.  If you had an analysis process that was aware, if it chooses to be, of the document as a whole then you open up a whole lot more opportunity for doing interesting analysis while losing nothing towards the individual treatment of fields.  The TeeSink stuff is an attempt at this, but it is not sufficient.
>

I'm not sure I like this: traditionally we let the user application
deal with "document parsing" (how do you take your content and define
it as documents/fields).

If we want to change lucene to start dealing with this "document
parsing" aspect, thats pretty scary in itself, but in my opinion the
very last choice of where we would want to add something like that is
analysis! So personally I really like analysis being separate from
document parsing: our analysis API is already way too complicated.

Maybe if you give a concrete example then I would have a better
understanding of the problem you think this might solve.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org