You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2015/06/04 16:47:14 UTC

List all Collections together with number of records

Hi,

Would like to check, are we able to use the Collection API or any other
method to list all the collections in the cluster together with the number
of records in each of the collections in one output?

Currently, I only know of the List Collections
/admin/collections?action=LIST. However, this only list the names of the
collections that are in the cluster, but not the number of records.

Is there a way to show the number of records in each of the collections as
well?

Regards,
Edwin

Re: List all Collections together with number of records

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
We're thinking of writing a custom request handler to do that, although the
handler will also query all the collections at the backend.

Will this lead to a faster response speed for the user?

Regards,
Edwin


On 8 June 2015 at 00:06, Erick Erickson <er...@gmail.com> wrote:

> bq: we still need those information to be stored in a separate collection
> for security reasons.
>
> Not necessarily. I've seen lots of installations where "auth tokens" are
> embedded in the document (say groups that can see this doc). Then
> the front-end simply attaches &fq=auth_field:(groups each user belongs to)
> to every query to restrict access.
>
> That said, some organizations aren't comfortable with this and demand
> separate collections, in which case you're stuck.
>
> You've defined an architecture though, and one of the consequences
> of that is if you have many collections, you'll have to fire off many
> queries (perhaps in parallel, but still). There's no magic to get around
> that. And it really doesn't matter, because in what you've described
> what has to happen is one query has to be fired to each collection.
> It doesn't matter whether Solr does that for you or you spawn a bunch
> of threads on the client, the same work has to happen somewhere.
>
> You also have to figure out how to present the results to the user,
> if it's simple count you're OK. But scores will _not_ be comparable
> across the various collections so the presentation will be challenging.
>
> Best,
> Erick
>
> On Sun, Jun 7, 2015 at 6:29 AM, Zheng Lin Edwin Yeo
> <ed...@gmail.com> wrote:
> > The reasons we want to have different collections is that each of the
> > collections have different fields, and that some collections will contain
> > information that are more sensitive than others.
> >
> > As such, we may need to restrict access to certain collections for some
> > users. Although the restriction will be done on the front end client
> side,
> > but we still need those information to be stored in a separate collection
> > for security reasons..
> >
> > Regards,
> > Edwin
> >
> >
> > On 7 June 2015 at 12:23, Erick Erickson <er...@gmail.com> wrote:
> >
> >> bq: Yup this information will need to be collected each time the user
> >> search
> >> for a query, as we want to show the number of records that matches the
> >> search query in each of the collections.
> >>
> >> You're looking at something akin to "federated search". About all you
> can
> >> do is send out parallel queries to each collection.
> >>
> >> This is an "interesting" requirement, and I really question whether
> it's a
> >> wise
> >> thing to insist on. I'd really think about going back to the design.
> >> For instance,
> >> could you consolidate all these collections into a single one, with
> perhaps
> >> a collection_id? Then the problem is relatively simple, use field
> >> collapsing
> >> (aka "grouping").
> >>
> >> Best,
> >> Erick
> >>
> >> On Sat, Jun 6, 2015 at 6:40 PM, Zheng Lin Edwin Yeo
> >> <ed...@gmail.com> wrote:
> >> > Yup this information will need to be collected each time the user
> search
> >> > for a query, as we want to show the number of records that matches the
> >> > search query in each of the collections.
> >> >
> >> > Currently I only have 6 collections, but it could increase to
> hundreds of
> >> > collections in the future. So I'm worried that it could slow down the
> >> > system a lot if we have to pass hundreds of queries for each search
> >> request.
> >> >
> >> > Regards,
> >> > Edwin
> >> >
> >> >
> >> > On 5 June 2015 at 21:00, Upayavira <uv...@odoko.co.uk> wrote:
> >> >
> >> >> I'm not so sure this is as bad as it sounds. When your collection is
> >> >> sharded, no single node knows about the documents in other
> shards/nodes,
> >> >> so to find the total number, a query will need to go to every node.
> >> >>
> >> >> Trying to work out something to do a single request to every node,
> >> >> combine their collection statistics and aggregate them into a single
> >> >> result sounds very complicated, and likely overkill.
> >> >>
> >> >> Are you needing to collect this information often? Do you have a lot
> of
> >> >> collections?
> >> >>
> >> >> Upayavira
> >> >>
> >> >>
> >> >> On Fri, Jun 5, 2015, at 06:29 AM, Zheng Lin Edwin Yeo wrote:
> >> >> > I'm trying to write a SolrJ program in Java to read and consolidate
> >> all
> >> >> > the
> >> >> > information into a JSON file, The client will just need to call
> this
> >> >> > SolrJ
> >> >> > program and read this JSON file to get the details. But the problem
> >> is we
> >> >> > are still querying the Solr once for each collection, just that
> this
> >> time
> >> >> > it is done in the SolrJ program in a for-loop, while previously
> it's
> >> done
> >> >> > on the client side. Not sure will this lead to performance
> >> improvement?
> >> >> >
> >> >> > For your suggestion on spawning a bunch of threads, does it mean
> the
> >> same
> >> >> > thing as I did?
> >> >> >
> >> >> > Regards,
> >> >> > Edwin
> >> >> >
> >> >> >
> >> >> > On 5 June 2015 at 12:03, Erick Erickson <er...@gmail.com>
> >> wrote:
> >> >> >
> >> >> > > Have you considered spawning a bunch of threads, one per
> collection
> >> >> > > and having them all run in parallel?
> >> >> > >
> >> >> > > Best,
> >> >> > > Erick
> >> >> > >
> >> >> > > On Thu, Jun 4, 2015 at 4:52 PM, Zheng Lin Edwin Yeo
> >> >> > > <ed...@gmail.com> wrote:
> >> >> > > > The reason we wanted to do a single call is to improve on the
> >> >> > > performance,
> >> >> > > > as our application requires to list the total number of
> records in
> >> >> each
> >> >> > > of
> >> >> > > > the collections, and the number of records that matches the
> query
> >> >> each of
> >> >> > > > the collections.
> >> >> > > >
> >> >> > > > Currently we are querying each collection one by one to
> retrieve
> >> the
> >> >> > > > numFound value and display them, but this can slow down the
> system
> >> >> > > > significantly when the number of collection grows. So we are
> >> >> thinking of
> >> >> > > > ways to improve the speed in this area.
> >> >> > > >
> >> >> > > > Any other methods which you can suggest that we can do to
> overcome
> >> >> this
> >> >> > > > speed problem?
> >> >> > > >
> >> >> > > > Regards,
> >> >> > > > Edwin
> >> >> > > > On 5 Jun 2015 00:16, "Erick Erickson" <erickerickson@gmail.com
> >
> >> >> wrote:
> >> >> > > >
> >> >> > > >> Not in a single call that I know of. These are really
> orthogonal
> >> >> > > >> concepts. Getting the cluster status merely involves reading
> the
> >> >> > > >> Zookeeper clusterstate whereas getting the total number of
> docs
> >> for
> >> >> > > >> each would involve querying each collection, i.e. going to the
> >> Solr
> >> >> > > >> nodes themselves. I'd guess it's unlikely to be combined.
> >> >> > > >>
> >> >> > > >> Best,
> >> >> > > >> Erick
> >> >> > > >>
> >> >> > > >> On Thu, Jun 4, 2015 at 7:47 AM, Zheng Lin Edwin Yeo
> >> >> > > >> <ed...@gmail.com> wrote:
> >> >> > > >> > Hi,
> >> >> > > >> >
> >> >> > > >> > Would like to check, are we able to use the Collection API
> or
> >> any
> >> >> > > other
> >> >> > > >> > method to list all the collections in the cluster together
> with
> >> >> the
> >> >> > > >> number
> >> >> > > >> > of records in each of the collections in one output?
> >> >> > > >> >
> >> >> > > >> > Currently, I only know of the List Collections
> >> >> > > >> > /admin/collections?action=LIST. However, this only list the
> >> names
> >> >> of
> >> >> > > the
> >> >> > > >> > collections that are in the cluster, but not the number of
> >> >> records.
> >> >> > > >> >
> >> >> > > >> > Is there a way to show the number of records in each of the
> >> >> > > collections
> >> >> > > >> as
> >> >> > > >> > well?
> >> >> > > >> >
> >> >> > > >> > Regards,
> >> >> > > >> > Edwin
> >> >> > > >>
> >> >> > >
> >> >>
> >>
>

Re: List all Collections together with number of records

Posted by Erick Erickson <er...@gmail.com>.
bq: we still need those information to be stored in a separate collection
for security reasons.

Not necessarily. I've seen lots of installations where "auth tokens" are
embedded in the document (say groups that can see this doc). Then
the front-end simply attaches &fq=auth_field:(groups each user belongs to)
to every query to restrict access.

That said, some organizations aren't comfortable with this and demand
separate collections, in which case you're stuck.

You've defined an architecture though, and one of the consequences
of that is if you have many collections, you'll have to fire off many
queries (perhaps in parallel, but still). There's no magic to get around
that. And it really doesn't matter, because in what you've described
what has to happen is one query has to be fired to each collection.
It doesn't matter whether Solr does that for you or you spawn a bunch
of threads on the client, the same work has to happen somewhere.

You also have to figure out how to present the results to the user,
if it's simple count you're OK. But scores will _not_ be comparable
across the various collections so the presentation will be challenging.

Best,
Erick

On Sun, Jun 7, 2015 at 6:29 AM, Zheng Lin Edwin Yeo
<ed...@gmail.com> wrote:
> The reasons we want to have different collections is that each of the
> collections have different fields, and that some collections will contain
> information that are more sensitive than others.
>
> As such, we may need to restrict access to certain collections for some
> users. Although the restriction will be done on the front end client side,
> but we still need those information to be stored in a separate collection
> for security reasons..
>
> Regards,
> Edwin
>
>
> On 7 June 2015 at 12:23, Erick Erickson <er...@gmail.com> wrote:
>
>> bq: Yup this information will need to be collected each time the user
>> search
>> for a query, as we want to show the number of records that matches the
>> search query in each of the collections.
>>
>> You're looking at something akin to "federated search". About all you can
>> do is send out parallel queries to each collection.
>>
>> This is an "interesting" requirement, and I really question whether it's a
>> wise
>> thing to insist on. I'd really think about going back to the design.
>> For instance,
>> could you consolidate all these collections into a single one, with perhaps
>> a collection_id? Then the problem is relatively simple, use field
>> collapsing
>> (aka "grouping").
>>
>> Best,
>> Erick
>>
>> On Sat, Jun 6, 2015 at 6:40 PM, Zheng Lin Edwin Yeo
>> <ed...@gmail.com> wrote:
>> > Yup this information will need to be collected each time the user search
>> > for a query, as we want to show the number of records that matches the
>> > search query in each of the collections.
>> >
>> > Currently I only have 6 collections, but it could increase to hundreds of
>> > collections in the future. So I'm worried that it could slow down the
>> > system a lot if we have to pass hundreds of queries for each search
>> request.
>> >
>> > Regards,
>> > Edwin
>> >
>> >
>> > On 5 June 2015 at 21:00, Upayavira <uv...@odoko.co.uk> wrote:
>> >
>> >> I'm not so sure this is as bad as it sounds. When your collection is
>> >> sharded, no single node knows about the documents in other shards/nodes,
>> >> so to find the total number, a query will need to go to every node.
>> >>
>> >> Trying to work out something to do a single request to every node,
>> >> combine their collection statistics and aggregate them into a single
>> >> result sounds very complicated, and likely overkill.
>> >>
>> >> Are you needing to collect this information often? Do you have a lot of
>> >> collections?
>> >>
>> >> Upayavira
>> >>
>> >>
>> >> On Fri, Jun 5, 2015, at 06:29 AM, Zheng Lin Edwin Yeo wrote:
>> >> > I'm trying to write a SolrJ program in Java to read and consolidate
>> all
>> >> > the
>> >> > information into a JSON file, The client will just need to call this
>> >> > SolrJ
>> >> > program and read this JSON file to get the details. But the problem
>> is we
>> >> > are still querying the Solr once for each collection, just that this
>> time
>> >> > it is done in the SolrJ program in a for-loop, while previously it's
>> done
>> >> > on the client side. Not sure will this lead to performance
>> improvement?
>> >> >
>> >> > For your suggestion on spawning a bunch of threads, does it mean the
>> same
>> >> > thing as I did?
>> >> >
>> >> > Regards,
>> >> > Edwin
>> >> >
>> >> >
>> >> > On 5 June 2015 at 12:03, Erick Erickson <er...@gmail.com>
>> wrote:
>> >> >
>> >> > > Have you considered spawning a bunch of threads, one per collection
>> >> > > and having them all run in parallel?
>> >> > >
>> >> > > Best,
>> >> > > Erick
>> >> > >
>> >> > > On Thu, Jun 4, 2015 at 4:52 PM, Zheng Lin Edwin Yeo
>> >> > > <ed...@gmail.com> wrote:
>> >> > > > The reason we wanted to do a single call is to improve on the
>> >> > > performance,
>> >> > > > as our application requires to list the total number of records in
>> >> each
>> >> > > of
>> >> > > > the collections, and the number of records that matches the query
>> >> each of
>> >> > > > the collections.
>> >> > > >
>> >> > > > Currently we are querying each collection one by one to retrieve
>> the
>> >> > > > numFound value and display them, but this can slow down the system
>> >> > > > significantly when the number of collection grows. So we are
>> >> thinking of
>> >> > > > ways to improve the speed in this area.
>> >> > > >
>> >> > > > Any other methods which you can suggest that we can do to overcome
>> >> this
>> >> > > > speed problem?
>> >> > > >
>> >> > > > Regards,
>> >> > > > Edwin
>> >> > > > On 5 Jun 2015 00:16, "Erick Erickson" <er...@gmail.com>
>> >> wrote:
>> >> > > >
>> >> > > >> Not in a single call that I know of. These are really orthogonal
>> >> > > >> concepts. Getting the cluster status merely involves reading the
>> >> > > >> Zookeeper clusterstate whereas getting the total number of docs
>> for
>> >> > > >> each would involve querying each collection, i.e. going to the
>> Solr
>> >> > > >> nodes themselves. I'd guess it's unlikely to be combined.
>> >> > > >>
>> >> > > >> Best,
>> >> > > >> Erick
>> >> > > >>
>> >> > > >> On Thu, Jun 4, 2015 at 7:47 AM, Zheng Lin Edwin Yeo
>> >> > > >> <ed...@gmail.com> wrote:
>> >> > > >> > Hi,
>> >> > > >> >
>> >> > > >> > Would like to check, are we able to use the Collection API or
>> any
>> >> > > other
>> >> > > >> > method to list all the collections in the cluster together with
>> >> the
>> >> > > >> number
>> >> > > >> > of records in each of the collections in one output?
>> >> > > >> >
>> >> > > >> > Currently, I only know of the List Collections
>> >> > > >> > /admin/collections?action=LIST. However, this only list the
>> names
>> >> of
>> >> > > the
>> >> > > >> > collections that are in the cluster, but not the number of
>> >> records.
>> >> > > >> >
>> >> > > >> > Is there a way to show the number of records in each of the
>> >> > > collections
>> >> > > >> as
>> >> > > >> > well?
>> >> > > >> >
>> >> > > >> > Regards,
>> >> > > >> > Edwin
>> >> > > >>
>> >> > >
>> >>
>>

Re: List all Collections together with number of records

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
The reasons we want to have different collections is that each of the
collections have different fields, and that some collections will contain
information that are more sensitive than others.

As such, we may need to restrict access to certain collections for some
users. Although the restriction will be done on the front end client side,
but we still need those information to be stored in a separate collection
for security reasons..

Regards,
Edwin


On 7 June 2015 at 12:23, Erick Erickson <er...@gmail.com> wrote:

> bq: Yup this information will need to be collected each time the user
> search
> for a query, as we want to show the number of records that matches the
> search query in each of the collections.
>
> You're looking at something akin to "federated search". About all you can
> do is send out parallel queries to each collection.
>
> This is an "interesting" requirement, and I really question whether it's a
> wise
> thing to insist on. I'd really think about going back to the design.
> For instance,
> could you consolidate all these collections into a single one, with perhaps
> a collection_id? Then the problem is relatively simple, use field
> collapsing
> (aka "grouping").
>
> Best,
> Erick
>
> On Sat, Jun 6, 2015 at 6:40 PM, Zheng Lin Edwin Yeo
> <ed...@gmail.com> wrote:
> > Yup this information will need to be collected each time the user search
> > for a query, as we want to show the number of records that matches the
> > search query in each of the collections.
> >
> > Currently I only have 6 collections, but it could increase to hundreds of
> > collections in the future. So I'm worried that it could slow down the
> > system a lot if we have to pass hundreds of queries for each search
> request.
> >
> > Regards,
> > Edwin
> >
> >
> > On 5 June 2015 at 21:00, Upayavira <uv...@odoko.co.uk> wrote:
> >
> >> I'm not so sure this is as bad as it sounds. When your collection is
> >> sharded, no single node knows about the documents in other shards/nodes,
> >> so to find the total number, a query will need to go to every node.
> >>
> >> Trying to work out something to do a single request to every node,
> >> combine their collection statistics and aggregate them into a single
> >> result sounds very complicated, and likely overkill.
> >>
> >> Are you needing to collect this information often? Do you have a lot of
> >> collections?
> >>
> >> Upayavira
> >>
> >>
> >> On Fri, Jun 5, 2015, at 06:29 AM, Zheng Lin Edwin Yeo wrote:
> >> > I'm trying to write a SolrJ program in Java to read and consolidate
> all
> >> > the
> >> > information into a JSON file, The client will just need to call this
> >> > SolrJ
> >> > program and read this JSON file to get the details. But the problem
> is we
> >> > are still querying the Solr once for each collection, just that this
> time
> >> > it is done in the SolrJ program in a for-loop, while previously it's
> done
> >> > on the client side. Not sure will this lead to performance
> improvement?
> >> >
> >> > For your suggestion on spawning a bunch of threads, does it mean the
> same
> >> > thing as I did?
> >> >
> >> > Regards,
> >> > Edwin
> >> >
> >> >
> >> > On 5 June 2015 at 12:03, Erick Erickson <er...@gmail.com>
> wrote:
> >> >
> >> > > Have you considered spawning a bunch of threads, one per collection
> >> > > and having them all run in parallel?
> >> > >
> >> > > Best,
> >> > > Erick
> >> > >
> >> > > On Thu, Jun 4, 2015 at 4:52 PM, Zheng Lin Edwin Yeo
> >> > > <ed...@gmail.com> wrote:
> >> > > > The reason we wanted to do a single call is to improve on the
> >> > > performance,
> >> > > > as our application requires to list the total number of records in
> >> each
> >> > > of
> >> > > > the collections, and the number of records that matches the query
> >> each of
> >> > > > the collections.
> >> > > >
> >> > > > Currently we are querying each collection one by one to retrieve
> the
> >> > > > numFound value and display them, but this can slow down the system
> >> > > > significantly when the number of collection grows. So we are
> >> thinking of
> >> > > > ways to improve the speed in this area.
> >> > > >
> >> > > > Any other methods which you can suggest that we can do to overcome
> >> this
> >> > > > speed problem?
> >> > > >
> >> > > > Regards,
> >> > > > Edwin
> >> > > > On 5 Jun 2015 00:16, "Erick Erickson" <er...@gmail.com>
> >> wrote:
> >> > > >
> >> > > >> Not in a single call that I know of. These are really orthogonal
> >> > > >> concepts. Getting the cluster status merely involves reading the
> >> > > >> Zookeeper clusterstate whereas getting the total number of docs
> for
> >> > > >> each would involve querying each collection, i.e. going to the
> Solr
> >> > > >> nodes themselves. I'd guess it's unlikely to be combined.
> >> > > >>
> >> > > >> Best,
> >> > > >> Erick
> >> > > >>
> >> > > >> On Thu, Jun 4, 2015 at 7:47 AM, Zheng Lin Edwin Yeo
> >> > > >> <ed...@gmail.com> wrote:
> >> > > >> > Hi,
> >> > > >> >
> >> > > >> > Would like to check, are we able to use the Collection API or
> any
> >> > > other
> >> > > >> > method to list all the collections in the cluster together with
> >> the
> >> > > >> number
> >> > > >> > of records in each of the collections in one output?
> >> > > >> >
> >> > > >> > Currently, I only know of the List Collections
> >> > > >> > /admin/collections?action=LIST. However, this only list the
> names
> >> of
> >> > > the
> >> > > >> > collections that are in the cluster, but not the number of
> >> records.
> >> > > >> >
> >> > > >> > Is there a way to show the number of records in each of the
> >> > > collections
> >> > > >> as
> >> > > >> > well?
> >> > > >> >
> >> > > >> > Regards,
> >> > > >> > Edwin
> >> > > >>
> >> > >
> >>
>

Re: List all Collections together with number of records

Posted by Erick Erickson <er...@gmail.com>.
bq: Yup this information will need to be collected each time the user search
for a query, as we want to show the number of records that matches the
search query in each of the collections.

You're looking at something akin to "federated search". About all you can
do is send out parallel queries to each collection.

This is an "interesting" requirement, and I really question whether it's a wise
thing to insist on. I'd really think about going back to the design.
For instance,
could you consolidate all these collections into a single one, with perhaps
a collection_id? Then the problem is relatively simple, use field collapsing
(aka "grouping").

Best,
Erick

On Sat, Jun 6, 2015 at 6:40 PM, Zheng Lin Edwin Yeo
<ed...@gmail.com> wrote:
> Yup this information will need to be collected each time the user search
> for a query, as we want to show the number of records that matches the
> search query in each of the collections.
>
> Currently I only have 6 collections, but it could increase to hundreds of
> collections in the future. So I'm worried that it could slow down the
> system a lot if we have to pass hundreds of queries for each search request.
>
> Regards,
> Edwin
>
>
> On 5 June 2015 at 21:00, Upayavira <uv...@odoko.co.uk> wrote:
>
>> I'm not so sure this is as bad as it sounds. When your collection is
>> sharded, no single node knows about the documents in other shards/nodes,
>> so to find the total number, a query will need to go to every node.
>>
>> Trying to work out something to do a single request to every node,
>> combine their collection statistics and aggregate them into a single
>> result sounds very complicated, and likely overkill.
>>
>> Are you needing to collect this information often? Do you have a lot of
>> collections?
>>
>> Upayavira
>>
>>
>> On Fri, Jun 5, 2015, at 06:29 AM, Zheng Lin Edwin Yeo wrote:
>> > I'm trying to write a SolrJ program in Java to read and consolidate all
>> > the
>> > information into a JSON file, The client will just need to call this
>> > SolrJ
>> > program and read this JSON file to get the details. But the problem is we
>> > are still querying the Solr once for each collection, just that this time
>> > it is done in the SolrJ program in a for-loop, while previously it's done
>> > on the client side. Not sure will this lead to performance improvement?
>> >
>> > For your suggestion on spawning a bunch of threads, does it mean the same
>> > thing as I did?
>> >
>> > Regards,
>> > Edwin
>> >
>> >
>> > On 5 June 2015 at 12:03, Erick Erickson <er...@gmail.com> wrote:
>> >
>> > > Have you considered spawning a bunch of threads, one per collection
>> > > and having them all run in parallel?
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > > On Thu, Jun 4, 2015 at 4:52 PM, Zheng Lin Edwin Yeo
>> > > <ed...@gmail.com> wrote:
>> > > > The reason we wanted to do a single call is to improve on the
>> > > performance,
>> > > > as our application requires to list the total number of records in
>> each
>> > > of
>> > > > the collections, and the number of records that matches the query
>> each of
>> > > > the collections.
>> > > >
>> > > > Currently we are querying each collection one by one to retrieve the
>> > > > numFound value and display them, but this can slow down the system
>> > > > significantly when the number of collection grows. So we are
>> thinking of
>> > > > ways to improve the speed in this area.
>> > > >
>> > > > Any other methods which you can suggest that we can do to overcome
>> this
>> > > > speed problem?
>> > > >
>> > > > Regards,
>> > > > Edwin
>> > > > On 5 Jun 2015 00:16, "Erick Erickson" <er...@gmail.com>
>> wrote:
>> > > >
>> > > >> Not in a single call that I know of. These are really orthogonal
>> > > >> concepts. Getting the cluster status merely involves reading the
>> > > >> Zookeeper clusterstate whereas getting the total number of docs for
>> > > >> each would involve querying each collection, i.e. going to the Solr
>> > > >> nodes themselves. I'd guess it's unlikely to be combined.
>> > > >>
>> > > >> Best,
>> > > >> Erick
>> > > >>
>> > > >> On Thu, Jun 4, 2015 at 7:47 AM, Zheng Lin Edwin Yeo
>> > > >> <ed...@gmail.com> wrote:
>> > > >> > Hi,
>> > > >> >
>> > > >> > Would like to check, are we able to use the Collection API or any
>> > > other
>> > > >> > method to list all the collections in the cluster together with
>> the
>> > > >> number
>> > > >> > of records in each of the collections in one output?
>> > > >> >
>> > > >> > Currently, I only know of the List Collections
>> > > >> > /admin/collections?action=LIST. However, this only list the names
>> of
>> > > the
>> > > >> > collections that are in the cluster, but not the number of
>> records.
>> > > >> >
>> > > >> > Is there a way to show the number of records in each of the
>> > > collections
>> > > >> as
>> > > >> > well?
>> > > >> >
>> > > >> > Regards,
>> > > >> > Edwin
>> > > >>
>> > >
>>

Re: List all Collections together with number of records

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Yup this information will need to be collected each time the user search
for a query, as we want to show the number of records that matches the
search query in each of the collections.

Currently I only have 6 collections, but it could increase to hundreds of
collections in the future. So I'm worried that it could slow down the
system a lot if we have to pass hundreds of queries for each search request.

Regards,
Edwin


On 5 June 2015 at 21:00, Upayavira <uv...@odoko.co.uk> wrote:

> I'm not so sure this is as bad as it sounds. When your collection is
> sharded, no single node knows about the documents in other shards/nodes,
> so to find the total number, a query will need to go to every node.
>
> Trying to work out something to do a single request to every node,
> combine their collection statistics and aggregate them into a single
> result sounds very complicated, and likely overkill.
>
> Are you needing to collect this information often? Do you have a lot of
> collections?
>
> Upayavira
>
>
> On Fri, Jun 5, 2015, at 06:29 AM, Zheng Lin Edwin Yeo wrote:
> > I'm trying to write a SolrJ program in Java to read and consolidate all
> > the
> > information into a JSON file, The client will just need to call this
> > SolrJ
> > program and read this JSON file to get the details. But the problem is we
> > are still querying the Solr once for each collection, just that this time
> > it is done in the SolrJ program in a for-loop, while previously it's done
> > on the client side. Not sure will this lead to performance improvement?
> >
> > For your suggestion on spawning a bunch of threads, does it mean the same
> > thing as I did?
> >
> > Regards,
> > Edwin
> >
> >
> > On 5 June 2015 at 12:03, Erick Erickson <er...@gmail.com> wrote:
> >
> > > Have you considered spawning a bunch of threads, one per collection
> > > and having them all run in parallel?
> > >
> > > Best,
> > > Erick
> > >
> > > On Thu, Jun 4, 2015 at 4:52 PM, Zheng Lin Edwin Yeo
> > > <ed...@gmail.com> wrote:
> > > > The reason we wanted to do a single call is to improve on the
> > > performance,
> > > > as our application requires to list the total number of records in
> each
> > > of
> > > > the collections, and the number of records that matches the query
> each of
> > > > the collections.
> > > >
> > > > Currently we are querying each collection one by one to retrieve the
> > > > numFound value and display them, but this can slow down the system
> > > > significantly when the number of collection grows. So we are
> thinking of
> > > > ways to improve the speed in this area.
> > > >
> > > > Any other methods which you can suggest that we can do to overcome
> this
> > > > speed problem?
> > > >
> > > > Regards,
> > > > Edwin
> > > > On 5 Jun 2015 00:16, "Erick Erickson" <er...@gmail.com>
> wrote:
> > > >
> > > >> Not in a single call that I know of. These are really orthogonal
> > > >> concepts. Getting the cluster status merely involves reading the
> > > >> Zookeeper clusterstate whereas getting the total number of docs for
> > > >> each would involve querying each collection, i.e. going to the Solr
> > > >> nodes themselves. I'd guess it's unlikely to be combined.
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >> On Thu, Jun 4, 2015 at 7:47 AM, Zheng Lin Edwin Yeo
> > > >> <ed...@gmail.com> wrote:
> > > >> > Hi,
> > > >> >
> > > >> > Would like to check, are we able to use the Collection API or any
> > > other
> > > >> > method to list all the collections in the cluster together with
> the
> > > >> number
> > > >> > of records in each of the collections in one output?
> > > >> >
> > > >> > Currently, I only know of the List Collections
> > > >> > /admin/collections?action=LIST. However, this only list the names
> of
> > > the
> > > >> > collections that are in the cluster, but not the number of
> records.
> > > >> >
> > > >> > Is there a way to show the number of records in each of the
> > > collections
> > > >> as
> > > >> > well?
> > > >> >
> > > >> > Regards,
> > > >> > Edwin
> > > >>
> > >
>

Re: List all Collections together with number of records

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
The query for *:* with rows=0 is only for the initial startup. When there's
search query and filter, these need to be added in to the query as we
wanted to display the total number of records in each of the collections
with respect to the query and filter.

Regards,
Edwin


On 5 June 2015 at 21:14, Shawn Heisey <ap...@elyograg.org> wrote:

> On 6/5/2015 7:00 AM, Upayavira wrote:
> > I'm not so sure this is as bad as it sounds. When your collection is
> > sharded, no single node knows about the documents in other shards/nodes,
> > so to find the total number, a query will need to go to every node.
> >
> > Trying to work out something to do a single request to every node,
> > combine their collection statistics and aggregate them into a single
> > result sounds very complicated, and likely overkill.
> >
> > Are you needing to collect this information often? Do you have a lot of
> > collections?
>
> A query for *:* with rows=0 is quite fast on any Solr version, unless
> RAM is too tight.  If your commits are infrequent, subsequent queries
> for that information will even faster because they will be served from
> Solr caches.
>
> There's no reason to have user code talk to all the shards and aggregate
> the document count for the collection -- let SolrCloud handle it and
> just query the collection with q=*:*&rows=0.  The numFound value in the
> response will cover the entire collection, and Solr will optimize the
> query as much as it possibly can be optimized.
>
> Thanks,
> Shawn
>
>

Re: List all Collections together with number of records

Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/5/2015 7:00 AM, Upayavira wrote:
> I'm not so sure this is as bad as it sounds. When your collection is
> sharded, no single node knows about the documents in other shards/nodes,
> so to find the total number, a query will need to go to every node.
> 
> Trying to work out something to do a single request to every node,
> combine their collection statistics and aggregate them into a single
> result sounds very complicated, and likely overkill.
> 
> Are you needing to collect this information often? Do you have a lot of
> collections?

A query for *:* with rows=0 is quite fast on any Solr version, unless
RAM is too tight.  If your commits are infrequent, subsequent queries
for that information will even faster because they will be served from
Solr caches.

There's no reason to have user code talk to all the shards and aggregate
the document count for the collection -- let SolrCloud handle it and
just query the collection with q=*:*&rows=0.  The numFound value in the
response will cover the entire collection, and Solr will optimize the
query as much as it possibly can be optimized.

Thanks,
Shawn


Re: List all Collections together with number of records

Posted by Upayavira <uv...@odoko.co.uk>.
I'm not so sure this is as bad as it sounds. When your collection is
sharded, no single node knows about the documents in other shards/nodes,
so to find the total number, a query will need to go to every node.

Trying to work out something to do a single request to every node,
combine their collection statistics and aggregate them into a single
result sounds very complicated, and likely overkill.

Are you needing to collect this information often? Do you have a lot of
collections?

Upayavira


On Fri, Jun 5, 2015, at 06:29 AM, Zheng Lin Edwin Yeo wrote:
> I'm trying to write a SolrJ program in Java to read and consolidate all
> the
> information into a JSON file, The client will just need to call this
> SolrJ
> program and read this JSON file to get the details. But the problem is we
> are still querying the Solr once for each collection, just that this time
> it is done in the SolrJ program in a for-loop, while previously it's done
> on the client side. Not sure will this lead to performance improvement?
> 
> For your suggestion on spawning a bunch of threads, does it mean the same
> thing as I did?
> 
> Regards,
> Edwin
> 
> 
> On 5 June 2015 at 12:03, Erick Erickson <er...@gmail.com> wrote:
> 
> > Have you considered spawning a bunch of threads, one per collection
> > and having them all run in parallel?
> >
> > Best,
> > Erick
> >
> > On Thu, Jun 4, 2015 at 4:52 PM, Zheng Lin Edwin Yeo
> > <ed...@gmail.com> wrote:
> > > The reason we wanted to do a single call is to improve on the
> > performance,
> > > as our application requires to list the total number of records in each
> > of
> > > the collections, and the number of records that matches the query each of
> > > the collections.
> > >
> > > Currently we are querying each collection one by one to retrieve the
> > > numFound value and display them, but this can slow down the system
> > > significantly when the number of collection grows. So we are thinking of
> > > ways to improve the speed in this area.
> > >
> > > Any other methods which you can suggest that we can do to overcome this
> > > speed problem?
> > >
> > > Regards,
> > > Edwin
> > > On 5 Jun 2015 00:16, "Erick Erickson" <er...@gmail.com> wrote:
> > >
> > >> Not in a single call that I know of. These are really orthogonal
> > >> concepts. Getting the cluster status merely involves reading the
> > >> Zookeeper clusterstate whereas getting the total number of docs for
> > >> each would involve querying each collection, i.e. going to the Solr
> > >> nodes themselves. I'd guess it's unlikely to be combined.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> On Thu, Jun 4, 2015 at 7:47 AM, Zheng Lin Edwin Yeo
> > >> <ed...@gmail.com> wrote:
> > >> > Hi,
> > >> >
> > >> > Would like to check, are we able to use the Collection API or any
> > other
> > >> > method to list all the collections in the cluster together with the
> > >> number
> > >> > of records in each of the collections in one output?
> > >> >
> > >> > Currently, I only know of the List Collections
> > >> > /admin/collections?action=LIST. However, this only list the names of
> > the
> > >> > collections that are in the cluster, but not the number of records.
> > >> >
> > >> > Is there a way to show the number of records in each of the
> > collections
> > >> as
> > >> > well?
> > >> >
> > >> > Regards,
> > >> > Edwin
> > >>
> >

Re: List all Collections together with number of records

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
I'm trying to write a SolrJ program in Java to read and consolidate all the
information into a JSON file, The client will just need to call this SolrJ
program and read this JSON file to get the details. But the problem is we
are still querying the Solr once for each collection, just that this time
it is done in the SolrJ program in a for-loop, while previously it's done
on the client side. Not sure will this lead to performance improvement?

For your suggestion on spawning a bunch of threads, does it mean the same
thing as I did?

Regards,
Edwin


On 5 June 2015 at 12:03, Erick Erickson <er...@gmail.com> wrote:

> Have you considered spawning a bunch of threads, one per collection
> and having them all run in parallel?
>
> Best,
> Erick
>
> On Thu, Jun 4, 2015 at 4:52 PM, Zheng Lin Edwin Yeo
> <ed...@gmail.com> wrote:
> > The reason we wanted to do a single call is to improve on the
> performance,
> > as our application requires to list the total number of records in each
> of
> > the collections, and the number of records that matches the query each of
> > the collections.
> >
> > Currently we are querying each collection one by one to retrieve the
> > numFound value and display them, but this can slow down the system
> > significantly when the number of collection grows. So we are thinking of
> > ways to improve the speed in this area.
> >
> > Any other methods which you can suggest that we can do to overcome this
> > speed problem?
> >
> > Regards,
> > Edwin
> > On 5 Jun 2015 00:16, "Erick Erickson" <er...@gmail.com> wrote:
> >
> >> Not in a single call that I know of. These are really orthogonal
> >> concepts. Getting the cluster status merely involves reading the
> >> Zookeeper clusterstate whereas getting the total number of docs for
> >> each would involve querying each collection, i.e. going to the Solr
> >> nodes themselves. I'd guess it's unlikely to be combined.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Jun 4, 2015 at 7:47 AM, Zheng Lin Edwin Yeo
> >> <ed...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > Would like to check, are we able to use the Collection API or any
> other
> >> > method to list all the collections in the cluster together with the
> >> number
> >> > of records in each of the collections in one output?
> >> >
> >> > Currently, I only know of the List Collections
> >> > /admin/collections?action=LIST. However, this only list the names of
> the
> >> > collections that are in the cluster, but not the number of records.
> >> >
> >> > Is there a way to show the number of records in each of the
> collections
> >> as
> >> > well?
> >> >
> >> > Regards,
> >> > Edwin
> >>
>

Re: List all Collections together with number of records

Posted by Erick Erickson <er...@gmail.com>.
Have you considered spawning a bunch of threads, one per collection
and having them all run in parallel?

Best,
Erick

On Thu, Jun 4, 2015 at 4:52 PM, Zheng Lin Edwin Yeo
<ed...@gmail.com> wrote:
> The reason we wanted to do a single call is to improve on the performance,
> as our application requires to list the total number of records in each of
> the collections, and the number of records that matches the query each of
> the collections.
>
> Currently we are querying each collection one by one to retrieve the
> numFound value and display them, but this can slow down the system
> significantly when the number of collection grows. So we are thinking of
> ways to improve the speed in this area.
>
> Any other methods which you can suggest that we can do to overcome this
> speed problem?
>
> Regards,
> Edwin
> On 5 Jun 2015 00:16, "Erick Erickson" <er...@gmail.com> wrote:
>
>> Not in a single call that I know of. These are really orthogonal
>> concepts. Getting the cluster status merely involves reading the
>> Zookeeper clusterstate whereas getting the total number of docs for
>> each would involve querying each collection, i.e. going to the Solr
>> nodes themselves. I'd guess it's unlikely to be combined.
>>
>> Best,
>> Erick
>>
>> On Thu, Jun 4, 2015 at 7:47 AM, Zheng Lin Edwin Yeo
>> <ed...@gmail.com> wrote:
>> > Hi,
>> >
>> > Would like to check, are we able to use the Collection API or any other
>> > method to list all the collections in the cluster together with the
>> number
>> > of records in each of the collections in one output?
>> >
>> > Currently, I only know of the List Collections
>> > /admin/collections?action=LIST. However, this only list the names of the
>> > collections that are in the cluster, but not the number of records.
>> >
>> > Is there a way to show the number of records in each of the collections
>> as
>> > well?
>> >
>> > Regards,
>> > Edwin
>>

Re: List all Collections together with number of records

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
The reason we wanted to do a single call is to improve on the performance,
as our application requires to list the total number of records in each of
the collections, and the number of records that matches the query each of
the collections.

Currently we are querying each collection one by one to retrieve the
numFound value and display them, but this can slow down the system
significantly when the number of collection grows. So we are thinking of
ways to improve the speed in this area.

Any other methods which you can suggest that we can do to overcome this
speed problem?

Regards,
Edwin
On 5 Jun 2015 00:16, "Erick Erickson" <er...@gmail.com> wrote:

> Not in a single call that I know of. These are really orthogonal
> concepts. Getting the cluster status merely involves reading the
> Zookeeper clusterstate whereas getting the total number of docs for
> each would involve querying each collection, i.e. going to the Solr
> nodes themselves. I'd guess it's unlikely to be combined.
>
> Best,
> Erick
>
> On Thu, Jun 4, 2015 at 7:47 AM, Zheng Lin Edwin Yeo
> <ed...@gmail.com> wrote:
> > Hi,
> >
> > Would like to check, are we able to use the Collection API or any other
> > method to list all the collections in the cluster together with the
> number
> > of records in each of the collections in one output?
> >
> > Currently, I only know of the List Collections
> > /admin/collections?action=LIST. However, this only list the names of the
> > collections that are in the cluster, but not the number of records.
> >
> > Is there a way to show the number of records in each of the collections
> as
> > well?
> >
> > Regards,
> > Edwin
>

Re: List all Collections together with number of records

Posted by Erick Erickson <er...@gmail.com>.
Not in a single call that I know of. These are really orthogonal
concepts. Getting the cluster status merely involves reading the
Zookeeper clusterstate whereas getting the total number of docs for
each would involve querying each collection, i.e. going to the Solr
nodes themselves. I'd guess it's unlikely to be combined.

Best,
Erick

On Thu, Jun 4, 2015 at 7:47 AM, Zheng Lin Edwin Yeo
<ed...@gmail.com> wrote:
> Hi,
>
> Would like to check, are we able to use the Collection API or any other
> method to list all the collections in the cluster together with the number
> of records in each of the collections in one output?
>
> Currently, I only know of the List Collections
> /admin/collections?action=LIST. However, this only list the names of the
> collections that are in the cluster, but not the number of records.
>
> Is there a way to show the number of records in each of the collections as
> well?
>
> Regards,
> Edwin