You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by davidphilip cherian <da...@gmail.com> on 2016/04/05 17:35:34 UTC

Contrib module for Document Clustering

Hi,

Is there any contribution(open source contrib module) that routes documents
to shards based on document similarity technique? Or any suggestions that
integrates mahout to solr for this use case?

>From what I know, currently there are two document route strategies as
explained here
https://lucidworks.com/blog/2013/06/13/solr-cloud-document-routing/. But Is
there anything else that I'm missing?




Thanks.

Re: Contrib module for Document Clustering

Posted by Joel Bernstein <jo...@gmail.com>.
My gut instinct is that it's a hard path you're considering. There is the
logistics of sharding by document similarity on both the indexing side and
query side. Even if you pull that off, it would be extremely difficult to
know if you're getting good results and really hard to fix if you're not
getting good results.

I would check the search performance you're getting on each shard. It may
very be that you just need to speed up the searches on the shards
themselves, rather then trying to limit the search to a subset of shards.



Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Apr 7, 2016 at 12:10 AM, davidphilip cherian <
davidphilipcherian@gmail.com> wrote:

> Hi Joel,
>
> Right now, we are (web) crawling almost 85millions of documents and this
> can increase to double. Collection is plainly divided into shards and so
> while searching, its search across all shards.
> If it is possible for a system to distributed documents into shards based
> on documents similarity, and at search time, analyze the query and search
> across these shards, it can improve search time performance and reduce
> resource utilization as well.  Let me know your thoughts. Use Case: Since
> this is a web search kind of data, both false positives and false negatives
> to an extent should be fine.
>
>
>
> On Wed, Apr 6, 2016 at 11:18 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
>
> > I don't know of any contrib or module that does this. Can you describe
> why
> > you'd want to route documents to shards based on similarity? What
> > advantages would you get by using this approach?
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Wed, Apr 6, 2016 at 1:36 PM, davidphilip cherian <
> > davidphilipcherian@gmail.com> wrote:
> >
> > > Any thoughts?
> > >
> > >
> > > On Tue, Apr 5, 2016 at 9:05 PM, davidphilip cherian <
> > > davidphilipcherian@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Is there any contribution(open source contrib module) that routes
> > > > documents to shards based on document similarity technique? Or any
> > > > suggestions that integrates mahout to solr for this use case?
> > > >
> > > > From what I know, currently there are two document route strategies
> as
> > > > explained here
> > > > https://lucidworks.com/blog/2013/06/13/solr-cloud-document-routing/.
> > But
> > > > Is there anything else that I'm missing?
> > > >
> > > >
> > > >
> > > >
> > > > Thanks.
> > > >
> > > >
> > > >
> > >
> >
>

Re: Contrib module for Document Clustering

Posted by davidphilip cherian <da...@gmail.com>.
Hi Joel,

Right now, we are (web) crawling almost 85millions of documents and this
can increase to double. Collection is plainly divided into shards and so
while searching, its search across all shards.
If it is possible for a system to distributed documents into shards based
on documents similarity, and at search time, analyze the query and search
across these shards, it can improve search time performance and reduce
resource utilization as well.  Let me know your thoughts. Use Case: Since
this is a web search kind of data, both false positives and false negatives
to an extent should be fine.



On Wed, Apr 6, 2016 at 11:18 PM, Joel Bernstein <jo...@gmail.com> wrote:

> I don't know of any contrib or module that does this. Can you describe why
> you'd want to route documents to shards based on similarity? What
> advantages would you get by using this approach?
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Apr 6, 2016 at 1:36 PM, davidphilip cherian <
> davidphilipcherian@gmail.com> wrote:
>
> > Any thoughts?
> >
> >
> > On Tue, Apr 5, 2016 at 9:05 PM, davidphilip cherian <
> > davidphilipcherian@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Is there any contribution(open source contrib module) that routes
> > > documents to shards based on document similarity technique? Or any
> > > suggestions that integrates mahout to solr for this use case?
> > >
> > > From what I know, currently there are two document route strategies as
> > > explained here
> > > https://lucidworks.com/blog/2013/06/13/solr-cloud-document-routing/.
> But
> > > Is there anything else that I'm missing?
> > >
> > >
> > >
> > >
> > > Thanks.
> > >
> > >
> > >
> >
>

Re: Contrib module for Document Clustering

Posted by Joel Bernstein <jo...@gmail.com>.
I don't know of any contrib or module that does this. Can you describe why
you'd want to route documents to shards based on similarity? What
advantages would you get by using this approach?

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Apr 6, 2016 at 1:36 PM, davidphilip cherian <
davidphilipcherian@gmail.com> wrote:

> Any thoughts?
>
>
> On Tue, Apr 5, 2016 at 9:05 PM, davidphilip cherian <
> davidphilipcherian@gmail.com> wrote:
>
> > Hi,
> >
> > Is there any contribution(open source contrib module) that routes
> > documents to shards based on document similarity technique? Or any
> > suggestions that integrates mahout to solr for this use case?
> >
> > From what I know, currently there are two document route strategies as
> > explained here
> > https://lucidworks.com/blog/2013/06/13/solr-cloud-document-routing/. But
> > Is there anything else that I'm missing?
> >
> >
> >
> >
> > Thanks.
> >
> >
> >
>

Re: Contrib module for Document Clustering

Posted by davidphilip cherian <da...@gmail.com>.
Any thoughts?


On Tue, Apr 5, 2016 at 9:05 PM, davidphilip cherian <
davidphilipcherian@gmail.com> wrote:

> Hi,
>
> Is there any contribution(open source contrib module) that routes
> documents to shards based on document similarity technique? Or any
> suggestions that integrates mahout to solr for this use case?
>
> From what I know, currently there are two document route strategies as
> explained here
> https://lucidworks.com/blog/2013/06/13/solr-cloud-document-routing/. But
> Is there anything else that I'm missing?
>
>
>
>
> Thanks.
>
>
>