You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dmitry Kan <so...@gmail.com> on 2013/03/05 15:16:59 UTC

Re: access matched token ids in the FacetComponent?

Hello,

I spent some more time on this and used Mikhail's suggestions of which
classes would need to be implemented.

1. Since we use SpanQuery family, we would need to modify the SpanScorer to
collect some stats over matched spans.
2. DelegatingCollector takes Scorer class via setScorer() method. The class
will have access to the statistics that is collected in the SpanScorer
class.
3. This DelegatingCollector class should then be referenced in the
SolrIndexSearcher class. There will be a need to implement some getter
methods for accessing the above statistics.
4. Make use of this modified SolrIndexSearcher in the SimpleFacets class.
5. Access the statistics that is visible in the SimpleFacets class in the
FacetComponent, in the method process().

Does this sound like an accurate list of classes to modify? Am I missing
something, any road blocks?

Dmitry

On Wed, Jan 23, 2013 at 12:47 PM, Dmitry Kan <so...@gmail.com> wrote:

> Thanks Alexandre for correcting the link and Mikhail for sharing the ideas!
>
> Mihkail,
>
> I will need to look closer at your customization of SpansFacetComponent on
> the blogpost.
> Is it so, that in this component, you are accessing and counting the
> matched spans?
>
> Thanks,
>
> Dmitry
>
>
> On Tue, Jan 22, 2013 at 9:17 PM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
>> Dmitry,
>>
>> Solr faceting is really fast due to using in-memory approach (keeping few
>> noticeable exceptions in mind), hence spans should be slower. Reading term
>> positions/payloads always has sensible gain. You can estimate it, if you
>> compare time for a phrase query "foo bar" with a plain conjunction +foo
>> +bar one.
>> It worth to mention that our SpansFacetComponent performed well enough,
>> even for public site. You can find my comment about performance numbers
>> "64К docs with 5-20 span positions per each. Search result length 100-2000
>> docs with 3-5 facet fields. It shows 100 q/sec on an average datacenter
>> box."
>>
>>
>> On Mon, Jan 21, 2013 at 5:23 PM, Dmitry Kan <so...@gmail.com> wrote:
>>
>> > Mikhail,
>> >
>> > Thanks for the guidance! This indeed sounds challenging, esp. given the
>> > bonus of fighting with solr 3.x in light of disjunction queries.
>> Although,
>> > moving to solr 4.0 if this makes life easier should be ok.
>> >
>> > But even before getting one's hands dirty, it would be good to know, if
>> > this is going to fly performance wise. Has your span based
>> implementation
>> > been fast enough? Did it stand close to the native solr's faceting in
>> terms
>> > of performance?
>> >
>> > On Mon, Jan 21, 2013 at 2:33 PM, Mikhail Khludnev <
>> > mkhludnev@griddynamics.com> wrote:
>> >
>> > > Dmitry,
>> > >
>> > > First of all, FacetComponent is the Solr's out-of-the-box
>> functionality.
>> > It
>> > > runs after search is done and accesses the bitSet of the found
>> document,
>> > > i.e. there is no spans (matched terms positions) there at all.
>> > >
>> > > StandardFacetsAccumulator sounds like the "brand new" lucene faceting
>> > > library. see http://shaierera.blogspot.com/. I don't think but don't
>> > > exactly know whether they are accessible there too.
>> > >
>> > > Some time ago my team successfully prototyped facet component backed
>> on
>> > > spans
>> > >
>> >
>> blog.griddynamics.com/2011/10/solr-experience-search-parent-child.htmlbut
>> > > I don't suggest you go this way.
>> > > I can suggest you start from the following:
>> > > - supply PostFilter/DelegatingCollector
>> > > http://yonik.com/posts/advanced-filter-caching-in-solr/
>> > > - the DelegatingCollector will accept the scorer instance
>> > > - if this scorer is BooleanScorer2 (but not BooleanScorer!), you can
>> > access
>> > > the SpanQueryScorer in one of the legs and try to access the matched
>> > spans
>> > > - if you are in 3.x you'll have a problem with disjunction queries.
>> > >
>> > > it seems challenging, doesn't it?
>> > >
>> > > 18.01.2013 17:40 пользователь "Dmitry Kan" <so...@gmail.com>
>> > написал:
>> > >
>> > > > Mikhail,
>> > > >
>> > > > Do you say, that it is not possible to access the matched terms
>> > positions
>> > > > in the FacetComponent? If that would be possible (somewhere in the
>> > > > StandardFacetsAccumulator class, where docids are available), then
>> by
>> > > > knowing the matched term positions I can do some school simple math
>> to
>> > > > calculate the sentence counts per doc id.
>> > > >
>> > > > Dmitry
>> > > >
>> > > > On Fri, Jan 18, 2013 at 2:45 PM, Mikhail Khludnev <
>> > > > mkhludnev@griddynamics.com> wrote:
>> > > >
>> > > > > Dmitry,
>> > > > >
>> > > > > It definitely seems like postptocessing highlighter's output. The
>> > also
>> > > > > approach is:
>> > > > > - limit number of occurrences of a word in a sentence to 1
>> > > > > - play with facet by function patch
>> > > > > https://issues.apache.org/jira/browse/SOLR-1581 accomplished by
>> tf()
>> > > > > function.
>> > > > >
>> > > > > It doesn't seem like much help.
>> > > > >
>> > > > > On Fri, Jan 18, 2013 at 12:42 PM, Dmitry Kan <
>> solrexpert@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > > > that we actually require the count of the sentences inside
>> > > > > > each document where the hits were found.
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Sincerely yours
>> > > > > Mikhail Khludnev
>> > > > > Principal Engineer,
>> > > > > Grid Dynamics
>> > > > >
>> > > > > <http://www.griddynamics.com>
>> > > > >  <mk...@griddynamics.com>
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>>
>> <http://www.griddynamics.com>
>>  <mk...@griddynamics.com>
>>
>
>

Re: access matched token ids in the FacetComponent?

Posted by Dmitry Kan <so...@gmail.com>.

Thanks Mikhail.

On Tue, Mar 5, 2013 at 8:23 PM, Mikhail Khludnev <mkhludnev@griddynamics.com
> wrote:

> Something like this.
>
>
> On Tue, Mar 5, 2013 at 6:16 PM, Dmitry Kan <so...@gmail.com> wrote:
>
> > Hello,
> >
> > I spent some more time on this and used Mikhail's suggestions of which
> > classes would need to be implemented.
> >
> > 1. Since we use SpanQuery family, we would need to modify the SpanScorer
> to
> > collect some stats over matched spans.
> > 2. DelegatingCollector takes Scorer class via setScorer() method. The
> class
> > will have access to the statistics that is collected in the SpanScorer
> > class.
> > 3. This DelegatingCollector class should then be referenced in the
> > SolrIndexSearcher class. There will be a need to implement some getter
> > methods for accessing the above statistics.
> > 4. Make use of this modified SolrIndexSearcher in the SimpleFacets class.
> > 5. Access the statistics that is visible in the SimpleFacets class in the
> > FacetComponent, in the method process().
> >
> > Does this sound like an accurate list of classes to modify? Am I missing
> > something, any road blocks?
> >
> > Dmitry
> >
> > On Wed, Jan 23, 2013 at 12:47 PM, Dmitry Kan <so...@gmail.com>
> wrote:
> >
> > > Thanks Alexandre for correcting the link and Mikhail for sharing the
> > ideas!
> > >
> > > Mihkail,
> > >
> > > I will need to look closer at your customization of SpansFacetComponent
> > on
> > > the blogpost.
> > > Is it so, that in this component, you are accessing and counting the
> > > matched spans?
> > >
> > > Thanks,
> > >
> > > Dmitry
> > >
> > >
> > > On Tue, Jan 22, 2013 at 9:17 PM, Mikhail Khludnev <
> > > mkhludnev@griddynamics.com> wrote:
> > >
> > >> Dmitry,
> > >>
> > >> Solr faceting is really fast due to using in-memory approach (keeping
> > few
> > >> noticeable exceptions in mind), hence spans should be slower. Reading
> > term
> > >> positions/payloads always has sensible gain. You can estimate it, if
> you
> > >> compare time for a phrase query "foo bar" with a plain conjunction
> +foo
> > >> +bar one.
> > >> It worth to mention that our SpansFacetComponent performed well
> enough,
> > >> even for public site. You can find my comment about performance
> numbers
> > >> "64К docs with 5-20 span positions per each. Search result length
> > 100-2000
> > >> docs with 3-5 facet fields. It shows 100 q/sec on an average
> datacenter
> > >> box."
> > >>
> > >>
> > >> On Mon, Jan 21, 2013 at 5:23 PM, Dmitry Kan <so...@gmail.com>
> > wrote:
> > >>
> > >> > Mikhail,
> > >> >
> > >> > Thanks for the guidance! This indeed sounds challenging, esp. given
> > the
> > >> > bonus of fighting with solr 3.x in light of disjunction queries.
> > >> Although,
> > >> > moving to solr 4.0 if this makes life easier should be ok.
> > >> >
> > >> > But even before getting one's hands dirty, it would be good to know,
> > if
> > >> > this is going to fly performance wise. Has your span based
> > >> implementation
> > >> > been fast enough? Did it stand close to the native solr's faceting
> in
> > >> terms
> > >> > of performance?
> > >> >
> > >> > On Mon, Jan 21, 2013 at 2:33 PM, Mikhail Khludnev <
> > >> > mkhludnev@griddynamics.com> wrote:
> > >> >
> > >> > > Dmitry,
> > >> > >
> > >> > > First of all, FacetComponent is the Solr's out-of-the-box
> > >> functionality.
> > >> > It
> > >> > > runs after search is done and accesses the bitSet of the found
> > >> document,
> > >> > > i.e. there is no spans (matched terms positions) there at all.
> > >> > >
> > >> > > StandardFacetsAccumulator sounds like the "brand new" lucene
> > faceting
> > >> > > library. see http://shaierera.blogspot.com/. I don't think but
> > don't
> > >> > > exactly know whether they are accessible there too.
> > >> > >
> > >> > > Some time ago my team successfully prototyped facet component
> backed
> > >> on
> > >> > > spans
> > >> > >
> > >> >
> > >>
> >
> blog.griddynamics.com/2011/10/solr-experience-search-parent-child.htmlbut
> > >> > > I don't suggest you go this way.
> > >> > > I can suggest you start from the following:
> > >> > > - supply PostFilter/DelegatingCollector
> > >> > > http://yonik.com/posts/advanced-filter-caching-in-solr/
> > >> > > - the DelegatingCollector will accept the scorer instance
> > >> > > - if this scorer is BooleanScorer2 (but not BooleanScorer!), you
> can
> > >> > access
> > >> > > the SpanQueryScorer in one of the legs and try to access the
> matched
> > >> > spans
> > >> > > - if you are in 3.x you'll have a problem with disjunction
> queries.
> > >> > >
> > >> > > it seems challenging, doesn't it?
> > >> > >
> > >> > > 18.01.2013 17:40 пользователь "Dmitry Kan" <so...@gmail.com>
> > >> > написал:
> > >> > >
> > >> > > > Mikhail,
> > >> > > >
> > >> > > > Do you say, that it is not possible to access the matched terms
> > >> > positions
> > >> > > > in the FacetComponent? If that would be possible (somewhere in
> the
> > >> > > > StandardFacetsAccumulator class, where docids are available),
> then
> > >> by
> > >> > > > knowing the matched term positions I can do some school simple
> > math
> > >> to
> > >> > > > calculate the sentence counts per doc id.
> > >> > > >
> > >> > > > Dmitry
> > >> > > >
> > >> > > > On Fri, Jan 18, 2013 at 2:45 PM, Mikhail Khludnev <
> > >> > > > mkhludnev@griddynamics.com> wrote:
> > >> > > >
> > >> > > > > Dmitry,
> > >> > > > >
> > >> > > > > It definitely seems like postptocessing highlighter's output.
> > The
> > >> > also
> > >> > > > > approach is:
> > >> > > > > - limit number of occurrences of a word in a sentence to 1
> > >> > > > > - play with facet by function patch
> > >> > > > > https://issues.apache.org/jira/browse/SOLR-1581 accomplished
> by
> > >> tf()
> > >> > > > > function.
> > >> > > > >
> > >> > > > > It doesn't seem like much help.
> > >> > > > >
> > >> > > > > On Fri, Jan 18, 2013 at 12:42 PM, Dmitry Kan <
> > >> solrexpert@gmail.com>
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > > that we actually require the count of the sentences inside
> > >> > > > > > each document where the hits were found.
> > >> > > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > > Sincerely yours
> > >> > > > > Mikhail Khludnev
> > >> > > > > Principal Engineer,
> > >> > > > > Grid Dynamics
> > >> > > > >
> > >> > > > > <http://www.griddynamics.com>
> > >> > > > >  <mk...@griddynamics.com>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Sincerely yours
> > >> Mikhail Khludnev
> > >> Principal Engineer,
> > >> Grid Dynamics
> > >>
> > >> <http://www.griddynamics.com>
> > >>  <mk...@griddynamics.com>
> > >>
> > >
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: access matched token ids in the FacetComponent?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Something like this.


On Tue, Mar 5, 2013 at 6:16 PM, Dmitry Kan <so...@gmail.com> wrote:

> Hello,
>
> I spent some more time on this and used Mikhail's suggestions of which
> classes would need to be implemented.
>
> 1. Since we use SpanQuery family, we would need to modify the SpanScorer to
> collect some stats over matched spans.
> 2. DelegatingCollector takes Scorer class via setScorer() method. The class
> will have access to the statistics that is collected in the SpanScorer
> class.
> 3. This DelegatingCollector class should then be referenced in the
> SolrIndexSearcher class. There will be a need to implement some getter
> methods for accessing the above statistics.
> 4. Make use of this modified SolrIndexSearcher in the SimpleFacets class.
> 5. Access the statistics that is visible in the SimpleFacets class in the
> FacetComponent, in the method process().
>
> Does this sound like an accurate list of classes to modify? Am I missing
> something, any road blocks?
>
> Dmitry
>
> On Wed, Jan 23, 2013 at 12:47 PM, Dmitry Kan <so...@gmail.com> wrote:
>
> > Thanks Alexandre for correcting the link and Mikhail for sharing the
> ideas!
> >
> > Mihkail,
> >
> > I will need to look closer at your customization of SpansFacetComponent
> on
> > the blogpost.
> > Is it so, that in this component, you are accessing and counting the
> > matched spans?
> >
> > Thanks,
> >
> > Dmitry
> >
> >
> > On Tue, Jan 22, 2013 at 9:17 PM, Mikhail Khludnev <
> > mkhludnev@griddynamics.com> wrote:
> >
> >> Dmitry,
> >>
> >> Solr faceting is really fast due to using in-memory approach (keeping
> few
> >> noticeable exceptions in mind), hence spans should be slower. Reading
> term
> >> positions/payloads always has sensible gain. You can estimate it, if you
> >> compare time for a phrase query "foo bar" with a plain conjunction +foo
> >> +bar one.
> >> It worth to mention that our SpansFacetComponent performed well enough,
> >> even for public site. You can find my comment about performance numbers
> >> "64К docs with 5-20 span positions per each. Search result length
> 100-2000
> >> docs with 3-5 facet fields. It shows 100 q/sec on an average datacenter
> >> box."
> >>
> >>
> >> On Mon, Jan 21, 2013 at 5:23 PM, Dmitry Kan <so...@gmail.com>
> wrote:
> >>
> >> > Mikhail,
> >> >
> >> > Thanks for the guidance! This indeed sounds challenging, esp. given
> the
> >> > bonus of fighting with solr 3.x in light of disjunction queries.
> >> Although,
> >> > moving to solr 4.0 if this makes life easier should be ok.
> >> >
> >> > But even before getting one's hands dirty, it would be good to know,
> if
> >> > this is going to fly performance wise. Has your span based
> >> implementation
> >> > been fast enough? Did it stand close to the native solr's faceting in
> >> terms
> >> > of performance?
> >> >
> >> > On Mon, Jan 21, 2013 at 2:33 PM, Mikhail Khludnev <
> >> > mkhludnev@griddynamics.com> wrote:
> >> >
> >> > > Dmitry,
> >> > >
> >> > > First of all, FacetComponent is the Solr's out-of-the-box
> >> functionality.
> >> > It
> >> > > runs after search is done and accesses the bitSet of the found
> >> document,
> >> > > i.e. there is no spans (matched terms positions) there at all.
> >> > >
> >> > > StandardFacetsAccumulator sounds like the "brand new" lucene
> faceting
> >> > > library. see http://shaierera.blogspot.com/. I don't think but
> don't
> >> > > exactly know whether they are accessible there too.
> >> > >
> >> > > Some time ago my team successfully prototyped facet component backed
> >> on
> >> > > spans
> >> > >
> >> >
> >>
> blog.griddynamics.com/2011/10/solr-experience-search-parent-child.htmlbut
> >> > > I don't suggest you go this way.
> >> > > I can suggest you start from the following:
> >> > > - supply PostFilter/DelegatingCollector
> >> > > http://yonik.com/posts/advanced-filter-caching-in-solr/
> >> > > - the DelegatingCollector will accept the scorer instance
> >> > > - if this scorer is BooleanScorer2 (but not BooleanScorer!), you can
> >> > access
> >> > > the SpanQueryScorer in one of the legs and try to access the matched
> >> > spans
> >> > > - if you are in 3.x you'll have a problem with disjunction queries.
> >> > >
> >> > > it seems challenging, doesn't it?
> >> > >
> >> > > 18.01.2013 17:40 пользователь "Dmitry Kan" <so...@gmail.com>
> >> > написал:
> >> > >
> >> > > > Mikhail,
> >> > > >
> >> > > > Do you say, that it is not possible to access the matched terms
> >> > positions
> >> > > > in the FacetComponent? If that would be possible (somewhere in the
> >> > > > StandardFacetsAccumulator class, where docids are available), then
> >> by
> >> > > > knowing the matched term positions I can do some school simple
> math
> >> to
> >> > > > calculate the sentence counts per doc id.
> >> > > >
> >> > > > Dmitry
> >> > > >
> >> > > > On Fri, Jan 18, 2013 at 2:45 PM, Mikhail Khludnev <
> >> > > > mkhludnev@griddynamics.com> wrote:
> >> > > >
> >> > > > > Dmitry,
> >> > > > >
> >> > > > > It definitely seems like postptocessing highlighter's output.
> The
> >> > also
> >> > > > > approach is:
> >> > > > > - limit number of occurrences of a word in a sentence to 1
> >> > > > > - play with facet by function patch
> >> > > > > https://issues.apache.org/jira/browse/SOLR-1581 accomplished by
> >> tf()
> >> > > > > function.
> >> > > > >
> >> > > > > It doesn't seem like much help.
> >> > > > >
> >> > > > > On Fri, Jan 18, 2013 at 12:42 PM, Dmitry Kan <
> >> solrexpert@gmail.com>
> >> > > > wrote:
> >> > > > >
> >> > > > > > that we actually require the count of the sentences inside
> >> > > > > > each document where the hits were found.
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Sincerely yours
> >> > > > > Mikhail Khludnev
> >> > > > > Principal Engineer,
> >> > > > > Grid Dynamics
> >> > > > >
> >> > > > > <http://www.griddynamics.com>
> >> > > > >  <mk...@griddynamics.com>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >> Principal Engineer,
> >> Grid Dynamics
> >>
> >> <http://www.griddynamics.com>
> >>  <mk...@griddynamics.com>
> >>
> >
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>