You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Srinivas Kashyap <sr...@bamboorose.com> on 2018/11/19 09:31:46 UTC

Sort index by size

Hello,

I have a solr core with some 20 fields in it.(all are stored and indexed). For an environment, the number of documents are around 0.29 million. When I run the full import through DIH, indexing is completing successfully. But, it is occupying the disk space of around 5 GB. Is there a possibility where I can go and check, which document is consuming more memory? Put in another way, can I sort the index based on size?

Thanks and Regards,
Srinivas Kashyap

  ________________________________
DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way.
No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.

Re: Sort index by size

Posted by Gus Heck <gu...@gmail.com>.
Just as a sanity check, is this getting replicated many times, or further
scaled up... it sounds like about $3.50/mo of disk space on AWS and it
should all fit in ram on any decent sized server.. (i.e. any server that
looks like half or quarter of a decent laptop)

As a question, it's interesting but it doesn't yet sound like a problem
worth sweating.

On Mon, Nov 19, 2018, 3:29 PM Edward Ribeiro <edward.ribeiro@gmail.com
wrote:

> One more tidbit: are you really sure you need all 20 fields to be indexed
> and stored? Do you really need all those 20 fields?
>
> See this blog post, for example:
> https://www.garysieling.com/blog/tuning-solr-lucene-disk-usage
>
> On Mon, Nov 19, 2018 at 1:45 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
> >
> > Worst case is 3X. That happens when there are no merges until the commit.
> >
> > With tlogs, worst case is more than that. I’ve seen humongous tlogs with
> a batch load and no hard commit until the end. If you do that several
> times, then you have a few old humongous tlogs. Bleah.
> >
> > wunder
> > Walter Underwood
> > wunder@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Nov 19, 2018, at 7:40 AM, David Hastings <
> hastings.recursive@gmail.com> wrote:
> > >
> > > Also a full import, assuming the documents were already indexed, will
> just
> > > double your index size until a merge/optimize is ran since you are just
> > > marking a document as deleted, not taking back any space, and then
> adding
> > > another completely new document on top of it.
> > >
> > > On Mon, Nov 19, 2018 at 10:36 AM Shawn Heisey <ap...@elyograg.org>
> wrote:
> > >
> > >> On 11/19/2018 2:31 AM, Srinivas Kashyap wrote:
> > >>> I have a solr core with some 20 fields in it.(all are stored and
> > >> indexed). For an environment, the number of documents are around 0.29
> > >> million. When I run the full import through DIH, indexing is
> completing
> > >> successfully. But, it is occupying the disk space of around 5 GB. Is
> there
> > >> a possibility where I can go and check, which document is consuming
> more
> > >> memory? Put in another way, can I sort the index based on size?
> > >>
> > >> I am not aware of any way to do that.  Might be one that I don't know
> > >> about, but if there were a way, seems like I would have come across it
> > >> before.
> > >>
> > >> It is not very that the large index size is due to a single document
> or
> > >> a handful of documents.  It is more likely that most documents are
> > >> relatively large.  I could be wrong about that, though.
> > >>
> > >> If you have 290000 documents (which is how I interpreted 0.29 million)
> > >> and the total index size is about 5 GB, then the average size per
> > >> document in the index is about 18 kilobytes.This is in my view pretty
> > >> large.  Typically I think that most documents are 1-2 kilobytes.
> > >>
> > >> Can we get your Solr version, a copy of your schema, and exactly what
> > >> Solr returns in search results for a typically sized document?  You'll
> > >> need to use a paste website or a file-sharing website ... if you try
> to
> > >> attach these things to a message, the mailing list will most likely
> eat
> > >> them, and we'll never see them. If you need to redact the information
> in
> > >> search results ... please do it in a way that we can still see the
> exact
> > >> size of the text -- don't just remove information, replace it with
> > >> information that's the same length.
> > >>
> > >> Thanks,
> > >> Shawn
> > >>
> > >>
>

Re: Sort index by size

Posted by Edward Ribeiro <ed...@gmail.com>.
One more tidbit: are you really sure you need all 20 fields to be indexed
and stored? Do you really need all those 20 fields?

See this blog post, for example:
https://www.garysieling.com/blog/tuning-solr-lucene-disk-usage

On Mon, Nov 19, 2018 at 1:45 PM Walter Underwood <wu...@wunderwood.org>
wrote:
>
> Worst case is 3X. That happens when there are no merges until the commit.
>
> With tlogs, worst case is more than that. I’ve seen humongous tlogs with
a batch load and no hard commit until the end. If you do that several
times, then you have a few old humongous tlogs. Bleah.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 19, 2018, at 7:40 AM, David Hastings <
hastings.recursive@gmail.com> wrote:
> >
> > Also a full import, assuming the documents were already indexed, will
just
> > double your index size until a merge/optimize is ran since you are just
> > marking a document as deleted, not taking back any space, and then
adding
> > another completely new document on top of it.
> >
> > On Mon, Nov 19, 2018 at 10:36 AM Shawn Heisey <ap...@elyograg.org>
wrote:
> >
> >> On 11/19/2018 2:31 AM, Srinivas Kashyap wrote:
> >>> I have a solr core with some 20 fields in it.(all are stored and
> >> indexed). For an environment, the number of documents are around 0.29
> >> million. When I run the full import through DIH, indexing is completing
> >> successfully. But, it is occupying the disk space of around 5 GB. Is
there
> >> a possibility where I can go and check, which document is consuming
more
> >> memory? Put in another way, can I sort the index based on size?
> >>
> >> I am not aware of any way to do that.  Might be one that I don't know
> >> about, but if there were a way, seems like I would have come across it
> >> before.
> >>
> >> It is not very that the large index size is due to a single document or
> >> a handful of documents.  It is more likely that most documents are
> >> relatively large.  I could be wrong about that, though.
> >>
> >> If you have 290000 documents (which is how I interpreted 0.29 million)
> >> and the total index size is about 5 GB, then the average size per
> >> document in the index is about 18 kilobytes.This is in my view pretty
> >> large.  Typically I think that most documents are 1-2 kilobytes.
> >>
> >> Can we get your Solr version, a copy of your schema, and exactly what
> >> Solr returns in search results for a typically sized document?  You'll
> >> need to use a paste website or a file-sharing website ... if you try to
> >> attach these things to a message, the mailing list will most likely eat
> >> them, and we'll never see them. If you need to redact the information
in
> >> search results ... please do it in a way that we can still see the
exact
> >> size of the text -- don't just remove information, replace it with
> >> information that's the same length.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>

Re: Sort index by size

Posted by Walter Underwood <wu...@wunderwood.org>.
Worst case is 3X. That happens when there are no merges until the commit.

With tlogs, worst case is more than that. I’ve seen humongous tlogs with a batch load and no hard commit until the end. If you do that several times, then you have a few old humongous tlogs. Bleah.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 19, 2018, at 7:40 AM, David Hastings <ha...@gmail.com> wrote:
> 
> Also a full import, assuming the documents were already indexed, will just
> double your index size until a merge/optimize is ran since you are just
> marking a document as deleted, not taking back any space, and then adding
> another completely new document on top of it.
> 
> On Mon, Nov 19, 2018 at 10:36 AM Shawn Heisey <ap...@elyograg.org> wrote:
> 
>> On 11/19/2018 2:31 AM, Srinivas Kashyap wrote:
>>> I have a solr core with some 20 fields in it.(all are stored and
>> indexed). For an environment, the number of documents are around 0.29
>> million. When I run the full import through DIH, indexing is completing
>> successfully. But, it is occupying the disk space of around 5 GB. Is there
>> a possibility where I can go and check, which document is consuming more
>> memory? Put in another way, can I sort the index based on size?
>> 
>> I am not aware of any way to do that.  Might be one that I don't know
>> about, but if there were a way, seems like I would have come across it
>> before.
>> 
>> It is not very that the large index size is due to a single document or
>> a handful of documents.  It is more likely that most documents are
>> relatively large.  I could be wrong about that, though.
>> 
>> If you have 290000 documents (which is how I interpreted 0.29 million)
>> and the total index size is about 5 GB, then the average size per
>> document in the index is about 18 kilobytes.This is in my view pretty
>> large.  Typically I think that most documents are 1-2 kilobytes.
>> 
>> Can we get your Solr version, a copy of your schema, and exactly what
>> Solr returns in search results for a typically sized document?  You'll
>> need to use a paste website or a file-sharing website ... if you try to
>> attach these things to a message, the mailing list will most likely eat
>> them, and we'll never see them. If you need to redact the information in
>> search results ... please do it in a way that we can still see the exact
>> size of the text -- don't just remove information, replace it with
>> information that's the same length.
>> 
>> Thanks,
>> Shawn
>> 
>> 


Re: Sort index by size

Posted by David Hastings <ha...@gmail.com>.
Also a full import, assuming the documents were already indexed, will just
double your index size until a merge/optimize is ran since you are just
marking a document as deleted, not taking back any space, and then adding
another completely new document on top of it.

On Mon, Nov 19, 2018 at 10:36 AM Shawn Heisey <ap...@elyograg.org> wrote:

> On 11/19/2018 2:31 AM, Srinivas Kashyap wrote:
> > I have a solr core with some 20 fields in it.(all are stored and
> indexed). For an environment, the number of documents are around 0.29
> million. When I run the full import through DIH, indexing is completing
> successfully. But, it is occupying the disk space of around 5 GB. Is there
> a possibility where I can go and check, which document is consuming more
> memory? Put in another way, can I sort the index based on size?
>
> I am not aware of any way to do that.  Might be one that I don't know
> about, but if there were a way, seems like I would have come across it
> before.
>
> It is not very that the large index size is due to a single document or
> a handful of documents.  It is more likely that most documents are
> relatively large.  I could be wrong about that, though.
>
> If you have 290000 documents (which is how I interpreted 0.29 million)
> and the total index size is about 5 GB, then the average size per
> document in the index is about 18 kilobytes.This is in my view pretty
> large.  Typically I think that most documents are 1-2 kilobytes.
>
> Can we get your Solr version, a copy of your schema, and exactly what
> Solr returns in search results for a typically sized document?  You'll
> need to use a paste website or a file-sharing website ... if you try to
> attach these things to a message, the mailing list will most likely eat
> them, and we'll never see them. If you need to redact the information in
> search results ... please do it in a way that we can still see the exact
> size of the text -- don't just remove information, replace it with
> information that's the same length.
>
> Thanks,
> Shawn
>
>

FW: Sort index by size

Posted by Srinivas Kashyap <sr...@bamboorose.com>.
Hi Shawn and everyone who replied to the thread,

The solr version is 5.2.1 and each document is returning multi-valued fields for majority of fields defined in schema.xml. I'm in the process of pasting the content of my files to a paste website and soon will update.

Thanks,
Srinivas


On 11/19/2018 2:31 AM, Srinivas Kashyap wrote:
> I have a solr core with some 20 fields in it.(all are stored and indexed). For an environment, the number of documents are around 0.29 million. When I run the full import through DIH, indexing is completing successfully. But, it is occupying the disk space of around 5 GB. Is there a possibility where I can go and check, which document is consuming more memory? Put in another way, can I sort the index based on size?

I am not aware of any way to do that.  Might be one that I don't know about, but if there were a way, seems like I would have come across it before.

It is not very that the large index size is due to a single document or a handful of documents.  It is more likely that most documents are relatively large.  I could be wrong about that, though.

If you have 290000 documents (which is how I interpreted 0.29 million) and the total index size is about 5 GB, then the average size per document in the index is about 18 kilobytes.This is in my view pretty large.  Typically I think that most documents are 1-2 kilobytes.

Can we get your Solr version, a copy of your schema, and exactly what Solr returns in search results for a typically sized document?  You'll need to use a paste website or a file-sharing website ... if you try to attach these things to a message, the mailing list will most likely eat them, and we'll never see them. If you need to redact the information in search results ... please do it in a way that we can still see the exact size of the text -- don't just remove information, replace it with information that's the same length.

Thanks,
Shawn

________________________________
DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way.
No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.

Re: Sort index by size

Posted by Shawn Heisey <ap...@elyograg.org>.
On 11/19/2018 2:31 AM, Srinivas Kashyap wrote:
> I have a solr core with some 20 fields in it.(all are stored and indexed). For an environment, the number of documents are around 0.29 million. When I run the full import through DIH, indexing is completing successfully. But, it is occupying the disk space of around 5 GB. Is there a possibility where I can go and check, which document is consuming more memory? Put in another way, can I sort the index based on size?

I am not aware of any way to do that.  Might be one that I don't know 
about, but if there were a way, seems like I would have come across it 
before.

It is not very that the large index size is due to a single document or 
a handful of documents.  It is more likely that most documents are 
relatively large.  I could be wrong about that, though.

If you have 290000 documents (which is how I interpreted 0.29 million) 
and the total index size is about 5 GB, then the average size per 
document in the index is about 18 kilobytes.This is in my view pretty 
large.  Typically I think that most documents are 1-2 kilobytes.

Can we get your Solr version, a copy of your schema, and exactly what 
Solr returns in search results for a typically sized document?  You'll 
need to use a paste website or a file-sharing website ... if you try to 
attach these things to a message, the mailing list will most likely eat 
them, and we'll never see them. If you need to redact the information in 
search results ... please do it in a way that we can still see the exact 
size of the text -- don't just remove information, replace it with 
information that's the same length.

Thanks,
Shawn