You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Manisha Rahatadkar <ma...@AnjuSoftware.com> on 2020/10/01 12:57:10 UTC

RE: Solr 7.7 - Few Questions

I apologize for sending this email again, I don't mean to spam the mailbox but looking out for the urgent help.

We are using Apache Solr 7.7 on Windows platform. The data is synced to Solr using Solr.Net commit. The data is being synced to SOLR in batches. The document size is very huge (~0.5GB average) and solr indexing is taking long time. Total document size is ~200GB. As the solr commit is done as a part of API, the API calls are failing as document indexing is not completed.


  1.  What is your advise on syncing such a large volume of data to Solr KB.
  2.  Because of the search requirements, almost 8 fields are defined as Text fields.
  3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a large volume of data?
  4.  How to set up Solr in production on Windows? Currently it's set up as a standalone engine and client is requested to take the backup of the drive. Is there any other better way to do? How to set up for the disaster recovery?
  5.  How to benchmark the system requirements for such a huge data

Thanks in advance.

Regards
Manisha Rahatadkar


Confidentiality Notice
====================
This email message, including any attachments, is for the sole use of the intended recipient and may contain confidential and privileged information. Any unauthorized view, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. Anju Software, Inc. 4500 S. Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.

Re: Solr 7.7 - Few Questions

Posted by Rahul Goswami <ra...@gmail.com>.

1. What tool they use to run Solr as a service on windows.
>> Look into procrun. Afterall. Solr runs inside Jetty. So you should have
a way to invoke Jetty’s Main class with required parameters and bundle that
as a procrun service

2. How to set up the disaster recovery?
>> You can back up your indexes at regular periods. This can be done by
taking snapshots and backing them up...and then using the appropriate
snapshot names to restore a certain commit point. For more details please
refer to this link:
https://lucene.apache.org/solr/guide/7_7/making-and-restoring-backups.html

3. How to scale up the servers for the better performance?
>> This is too open ended a question and depends on a lot of factors
specific to your environment and use-case :)

- Rahul


On Tue, Oct 6, 2020 at 4:26 PM Manisha Rahatadkar <
manisha.rahatadkar@anjusoftware.com> wrote:

> Hi All
>
> First of all thanks to Shawn, Rahul and Charlie for taking time to reply
> my questions and valuable information.
>
> I was very concerned about the size of the each document and on several
> follow ups got more information that the documents which have 0.5GB size
> are mp4 documents and these are not synced to Solr.
>
> @Shawn Heisey recommended NOT to use Windows because of windows license
> cost and service installer testing is done on Linux.
> I agree with him. We are using NSSM tool to run solr as a service.
>
> Are there any members here using Solr on Windows? I look forward to hear
> from them on:
>
> 1. What tool they use to run Solr as a service on windows.
> 2. How to set up the disaster recovery?
> 3. How to scale up the servers for the better performance?
>
> Thanks in advance and looking forward to hear back your experiences on
> Solr Scale up.
>
> Regards,
> Manisha Rahatadkar
>
> -----Original Message-----
> From: Rahul Goswami <ra...@gmail.com>
> Sent: Sunday, October 4, 2020 11:49 PM
> To: chull@opensourceconnections.com; solr-user@lucene.apache.org
> Subject: Re: Solr 7.7 - Few Questions
>
> Charlie,
> Thanks for providing an alternate approach to doing this. It would be
> interesting to know how one  could go about organizing the docs in this
> case? (Nested documents?) How would join queries perform on a large
> index(200 million+ docs)?
>
> Thanks,
> Rahul
>
>
>
> On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull <ch...@flax.co.uk> wrote:
>
> > Hi Rahul,
> >
> >
> >
> > In addition to the wise advice below: remember in Solr, a 'document'
> > is
> >
> > just the name for the thing that would appear as one of the results
> > when
> >
> > you search (analagous to a database record). It's not the same
> >
> > conceptually as a 'Word document' or a 'PDF document'. If your source
> >
> > documents are so big, consider how they might be broken into parts, or
> >
> > whether you really need to index all of them for retrieval purposes,
> > or
> >
> > what parts of them need to be extracted as text. Thus, the Solr
> >
> > documents don't necessarily need to be as large as your source documents.
> >
> >
> >
> > Consider an email size 20kb with ten PDF attachments, each 20MB. You
> >
> > probably shouldn't push all this data into a single Solr document, but
> >
> > you *could* index them as 11 separate Solr documents, but with
> > metadata
> >
> > to indicate that one is an email and ten are PDFs, and a shared ID of
> >
> > some kind to indicate they're related. Then at query time there are
> >
> > various ways for you to group these together, so for example if the
> >
> > query hit one of the PDFs you could show the user the original email,
> >
> > plus the 9 other attachments, using the shared ID as a key.
> >
> >
> >
> > HTH,
> >
> >
> >
> > Charlie
> >
> >
> >
> > On 02/10/2020 01:53, Rahul Goswami wrote:
> >
> > > Manisha,
> >
> > > In addition to what Shawn has mentioned above, I would also like you
> > > to
> >
> > > reevaluate your use case. Do you *need to* index the whole document ?
> eg:
> >
> > > If it's an email, the body of the email *might* be more important
> > > than
> > any
> >
> > > attachments, in which case you could choose to only index the email
> > > body
> >
> > > and ignore (or only partially index) the text from attachments. If
> > > you
> >
> > > could afford to index the documents partially, you could consider
> > > Solr's
> >
> > > "Limit token count filter": See the link below.
> >
> > >
> >
> > >
> > https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limi
> > t-token-count-filter
> >
> > >
> >
> > > You'll need to configure it in the schema for the "index" analyzer
> > > for
> > the
> >
> > > data type of the field with large text.
> >
> > > Indexing documents of the order of half a GB will definitely come to
> > > hurt
> >
> > > your operations, if not now, later (think OOM, extremely slow atomic
> >
> > > updates, long running merges etc.).
> >
> > >
> >
> > > - Rahul
> >
> > >
> >
> > >
> >
> > >
> >
> > > On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey <ap...@elyograg.org>
> wrote:
> >
> > >
> >
> > >> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
> >
> > >>> We are using Apache Solr 7.7 on Windows platform. The data is
> > >>> synced to
> >
> > >> Solr using Solr.Net commit. The data is being synced to SOLR in
> batches.
> >
> > >> The document size is very huge (~0.5GB average) and solr indexing
> > >> is
> > taking
> >
> > >> long time. Total document size is ~200GB. As the solr commit is
> > >> done as
> > a
> >
> > >> part of API, the API calls are failing as document indexing is not
> >
> > >> completed.
> >
> > >>
> >
> > >> A single document is five hundred megabytes?  What kind of
> > >> documents do
> >
> > >> you have?  You can't even index something that big without tweaking
> >
> > >> configuration parameters that most people don't even know about.
> >
> > >> Assuming you can even get it working, there's no way that indexing
> > >> a
> >
> > >> document like that is going to be fast.
> >
> > >>
> >
> > >>>     1.  What is your advise on syncing such a large volume of data
> > >>> to
> >
> > >> Solr KB.
> >
> > >>
> >
> > >> What is "KB"?  I have never heard of this in relation to Solr.
> >
> > >>
> >
> > >>>     2.  Because of the search requirements, almost 8 fields are
> > >>> defined
> >
> > >> as Text fields.
> >
> > >>
> >
> > >> I can't figure out what you are trying to say with this statement.
> >
> > >>
> >
> > >>>     3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for
> > >>> such
> > a
> >
> > >> large volume of data?
> >
> > >>
> >
> > >> If just one of the documents you're sending to Solr really is five
> >
> > >> hundred megabytes, then 2 gigabytes would probably be just barely
> > >> enough
> >
> > >> to index one document into an empty index ... and it would probably
> > >> be
> >
> > >> doing garbage collection so frequently that it would make things
> > >> REALLY
> >
> > >> slow.  I have no way to predict how much heap you will need.  That
> > >> will
> >
> > >> require experimentation.  I can tell you that 2GB is definitely not
> > enough.
> >
> > >>
> >
> > >>>     4.  How to set up Solr in production on Windows? Currently
> > >>> it's set
> >
> > >> up as a standalone engine and client is requested to take the
> > >> backup of
> > the
> >
> > >> drive. Is there any other better way to do? How to set up for the
> > disaster
> >
> > >> recovery?
> >
> > >>
> >
> > >> I would suggest NOT doing it on Windows.  My reasons for that come
> > >> down
> >
> > >> to costs -- a Windows Server license isn't cheap.
> >
> > >>
> >
> > >> That said, there's nothing wrong with running on Windows, but
> > >> you're on
> >
> > >> your own as far as running it as a service.  We only have a service
> >
> > >> installer for UNIX-type systems.  Most of the testing for that is
> > >> done
> >
> > >> on Linux.
> >
> > >>
> >
> > >>>     5.  How to benchmark the system requirements for such a huge
> > >>> data
> >
> > >> I do not know what all your needs are, so I have no way to answer
> this.
> >
> > >> You're going to know a lot more about it that any of us are.
> >
> > >>
> >
> > >> Thanks,
> >
> > >> Shawn
> >
> > >>
> >
> >
> >
> > --
> >
> > Charlie Hull
> >
> > OpenSource Connections, previously Flax
> >
> >
> >
> > tel/fax: +44 (0)8700 118334
> >
> > mobile:  +44 (0)7767 825828
> >
> > web: www.o19s.com
> >
> >
> >
> >
> Confidentiality Notice
> ====================
> This email message, including any attachments, is for the sole use of the
> intended recipient and may contain confidential and privileged information.
> Any unauthorized view, use, disclosure or distribution is prohibited. If
> you are not the intended recipient, please contact the sender by reply
> email and destroy all copies of the original message. Anju Software, Inc.
> 4500 S. Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.
>

RE: Solr 7.7 - Few Questions

Posted by "Hanjan, Harinderdeep S." <Ha...@calgary.ca>.

1. What tool they use to run Solr as a service on windows.
We use NSSM. https://nssm.cc/


-----Original Message-----
From: Manisha Rahatadkar [mailto:manisha.rahatadkar@AnjuSoftware.com]
Sent: Tuesday, October 6, 2020 2:26 PM
To: solr-user@lucene.apache.org; chull@opensourceconnections.com; Shawn Heisey <ap...@elyograg.org>
Subject: [EXT] RE: Solr 7.7 - Few Questions

Hi All

First of all thanks to Shawn, Rahul and Charlie for taking time to reply my questions and valuable information.

I was very concerned about the size of the each document and on several follow ups got more information that the documents which have 0.5GB size are mp4 documents and these are not synced to Solr.

@Shawn Heisey recommended NOT to use Windows because of windows license cost and service installer testing is done on Linux.
I agree with him. We are using NSSM tool to run solr as a service.

Are there any members here using Solr on Windows? I look forward to hear from them on:

1. What tool they use to run Solr as a service on windows.
2. How to set up the disaster recovery?
3. How to scale up the servers for the better performance?

Thanks in advance and looking forward to hear back your experiences on Solr Scale up.

Regards,
Manisha Rahatadkar

-----Original Message-----
From: Rahul Goswami <ra...@gmail.com>
Sent: Sunday, October 4, 2020 11:49 PM
To: chull@opensourceconnections.com; solr-user@lucene.apache.org
Subject: Re: Solr 7.7 - Few Questions

Charlie,
Thanks for providing an alternate approach to doing this. It would be interesting to know how one  could go about organizing the docs in this case? (Nested documents?) How would join queries perform on a large
index(200 million+ docs)?

Thanks,
Rahul



On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull <ch...@flax.co.uk> wrote:

> Hi Rahul,
>
>
>
> In addition to the wise advice below: remember in Solr, a 'document'
> is
>
> just the name for the thing that would appear as one of the results
> when
>
> you search (analagous to a database record). It's not the same
>
> conceptually as a 'Word document' or a 'PDF document'. If your source
>
> documents are so big, consider how they might be broken into parts, or
>
> whether you really need to index all of them for retrieval purposes,
> or
>
> what parts of them need to be extracted as text. Thus, the Solr
>
> documents don't necessarily need to be as large as your source documents.
>
>
>
> Consider an email size 20kb with ten PDF attachments, each 20MB. You
>
> probably shouldn't push all this data into a single Solr document, but
>
> you *could* index them as 11 separate Solr documents, but with
> metadata
>
> to indicate that one is an email and ten are PDFs, and a shared ID of
>
> some kind to indicate they're related. Then at query time there are
>
> various ways for you to group these together, so for example if the
>
> query hit one of the PDFs you could show the user the original email,
>
> plus the 9 other attachments, using the shared ID as a key.
>
>
>
> HTH,
>
>
>
> Charlie
>
>
>
> On 02/10/2020 01:53, Rahul Goswami wrote:
>
> > Manisha,
>
> > In addition to what Shawn has mentioned above, I would also like you
> > to
>
> > reevaluate your use case. Do you *need to* index the whole document ? eg:
>
> > If it's an email, the body of the email *might* be more important
> > than
> any
>
> > attachments, in which case you could choose to only index the email
> > body
>
> > and ignore (or only partially index) the text from attachments. If
> > you
>
> > could afford to index the documents partially, you could consider
> > Solr's
>
> > "Limit token count filter": See the link below.
>
> >
>
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org
> _solr_guide_7-5F7_filter-2Ddescriptions.html-23limi&d=DwIGaQ&c=jdm1Hby
> _BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=K2RffikYjYEm9pNz7rnNO_vxytl-lju
> jndRmklLfE1I&m=2ozKDmMVWaDgCqsPaYwwEELoGjA5d6xC9xgH28tiErs&s=ysHd67CYE
> hPBEEWIda8ItM0R5gllayaddUyTwqm0glw&e=
> t-token-count-filter
>
> >
>
> > You'll need to configure it in the schema for the "index" analyzer
> > for
> the
>
> > data type of the field with large text.
>
> > Indexing documents of the order of half a GB will definitely come to
> > hurt
>
> > your operations, if not now, later (think OOM, extremely slow atomic
>
> > updates, long running merges etc.).
>
> >
>
> > - Rahul
>
> >
>
> >
>
> >
>
> > On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey <ap...@elyograg.org> wrote:
>
> >
>
> >> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
>
> >>> We are using Apache Solr 7.7 on Windows platform. The data is
> >>> synced to
>
> >> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
>
> >> The document size is very huge (~0.5GB average) and solr indexing
> >> is
> taking
>
> >> long time. Total document size is ~200GB. As the solr commit is
> >> done as
> a
>
> >> part of API, the API calls are failing as document indexing is not
>
> >> completed.
>
> >>
>
> >> A single document is five hundred megabytes?  What kind of
> >> documents do
>
> >> you have?  You can't even index something that big without tweaking
>
> >> configuration parameters that most people don't even know about.
>
> >> Assuming you can even get it working, there's no way that indexing
> >> a
>
> >> document like that is going to be fast.
>
> >>
>
> >>>     1.  What is your advise on syncing such a large volume of data
> >>> to
>
> >> Solr KB.
>
> >>
>
> >> What is "KB"?  I have never heard of this in relation to Solr.
>
> >>
>
> >>>     2.  Because of the search requirements, almost 8 fields are
> >>> defined
>
> >> as Text fields.
>
> >>
>
> >> I can't figure out what you are trying to say with this statement.
>
> >>
>
> >>>     3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for
> >>> such
> a
>
> >> large volume of data?
>
> >>
>
> >> If just one of the documents you're sending to Solr really is five
>
> >> hundred megabytes, then 2 gigabytes would probably be just barely
> >> enough
>
> >> to index one document into an empty index ... and it would probably
> >> be
>
> >> doing garbage collection so frequently that it would make things
> >> REALLY
>
> >> slow.  I have no way to predict how much heap you will need.  That
> >> will
>
> >> require experimentation.  I can tell you that 2GB is definitely not
> enough.
>
> >>
>
> >>>     4.  How to set up Solr in production on Windows? Currently
> >>> it's set
>
> >> up as a standalone engine and client is requested to take the
> >> backup of
> the
>
> >> drive. Is there any other better way to do? How to set up for the
> disaster
>
> >> recovery?
>
> >>
>
> >> I would suggest NOT doing it on Windows.  My reasons for that come
> >> down
>
> >> to costs -- a Windows Server license isn't cheap.
>
> >>
>
> >> That said, there's nothing wrong with running on Windows, but
> >> you're on
>
> >> your own as far as running it as a service.  We only have a service
>
> >> installer for UNIX-type systems.  Most of the testing for that is
> >> done
>
> >> on Linux.
>
> >>
>
> >>>     5.  How to benchmark the system requirements for such a huge
> >>> data
>
> >> I do not know what all your needs are, so I have no way to answer this.
>
> >> You're going to know a lot more about it that any of us are.
>
> >>
>
> >> Thanks,
>
> >> Shawn
>
> >>
>
>
>
> --
>
> Charlie Hull
>
> OpenSource Connections, previously Flax
>
>
>
> tel/fax: +44 (0)8700 118334
>
> mobile:  +44 (0)7767 825828
>
> web:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.o19s.com&d=DwI
> GaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=K2RffikYjYEm9pNz7r
> nNO_vxytl-ljujndRmklLfE1I&m=2ozKDmMVWaDgCqsPaYwwEELoGjA5d6xC9xgH28tiEr
> s&s=u7veS2SuFdlFUakJ5-Z5kdpDi7m5oROwTULz3Nly7Zk&e=
>
>
>
>
Confidentiality Notice
====================
This email message, including any attachments, is for the sole use of the intended recipient and may contain confidential and privileged information. Any unauthorized view, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. Anju Software, Inc. 4500 S. Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.

________________________________
NOTICE -
This communication is intended ONLY for the use of the person or entity named above and may contain information that is confidential or legally privileged. If you are not the intended recipient named above or a person responsible for delivering messages or communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying of this communication or any of the information contained in it is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone and then destroy or delete this communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.

RE: Solr 7.7 - Few Questions

Posted by Manisha Rahatadkar <ma...@AnjuSoftware.com>.

Hi All

First of all thanks to Shawn, Rahul and Charlie for taking time to reply my questions and valuable information.

I was very concerned about the size of the each document and on several follow ups got more information that the documents which have 0.5GB size are mp4 documents and these are not synced to Solr.

@Shawn Heisey recommended NOT to use Windows because of windows license cost and service installer testing is done on Linux.
I agree with him. We are using NSSM tool to run solr as a service.

Are there any members here using Solr on Windows? I look forward to hear from them on:

1. What tool they use to run Solr as a service on windows.
2. How to set up the disaster recovery?
3. How to scale up the servers for the better performance?

Thanks in advance and looking forward to hear back your experiences on Solr Scale up.

Regards,
Manisha Rahatadkar

-----Original Message-----
From: Rahul Goswami <ra...@gmail.com>
Sent: Sunday, October 4, 2020 11:49 PM
To: chull@opensourceconnections.com; solr-user@lucene.apache.org
Subject: Re: Solr 7.7 - Few Questions

Charlie,
Thanks for providing an alternate approach to doing this. It would be interesting to know how one  could go about organizing the docs in this case? (Nested documents?) How would join queries perform on a large
index(200 million+ docs)?

Thanks,
Rahul



On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull <ch...@flax.co.uk> wrote:

> Hi Rahul,
>
>
>
> In addition to the wise advice below: remember in Solr, a 'document'
> is
>
> just the name for the thing that would appear as one of the results
> when
>
> you search (analagous to a database record). It's not the same
>
> conceptually as a 'Word document' or a 'PDF document'. If your source
>
> documents are so big, consider how they might be broken into parts, or
>
> whether you really need to index all of them for retrieval purposes,
> or
>
> what parts of them need to be extracted as text. Thus, the Solr
>
> documents don't necessarily need to be as large as your source documents.
>
>
>
> Consider an email size 20kb with ten PDF attachments, each 20MB. You
>
> probably shouldn't push all this data into a single Solr document, but
>
> you *could* index them as 11 separate Solr documents, but with
> metadata
>
> to indicate that one is an email and ten are PDFs, and a shared ID of
>
> some kind to indicate they're related. Then at query time there are
>
> various ways for you to group these together, so for example if the
>
> query hit one of the PDFs you could show the user the original email,
>
> plus the 9 other attachments, using the shared ID as a key.
>
>
>
> HTH,
>
>
>
> Charlie
>
>
>
> On 02/10/2020 01:53, Rahul Goswami wrote:
>
> > Manisha,
>
> > In addition to what Shawn has mentioned above, I would also like you
> > to
>
> > reevaluate your use case. Do you *need to* index the whole document ? eg:
>
> > If it's an email, the body of the email *might* be more important
> > than
> any
>
> > attachments, in which case you could choose to only index the email
> > body
>
> > and ignore (or only partially index) the text from attachments. If
> > you
>
> > could afford to index the documents partially, you could consider
> > Solr's
>
> > "Limit token count filter": See the link below.
>
> >
>
> >
> https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limi
> t-token-count-filter
>
> >
>
> > You'll need to configure it in the schema for the "index" analyzer
> > for
> the
>
> > data type of the field with large text.
>
> > Indexing documents of the order of half a GB will definitely come to
> > hurt
>
> > your operations, if not now, later (think OOM, extremely slow atomic
>
> > updates, long running merges etc.).
>
> >
>
> > - Rahul
>
> >
>
> >
>
> >
>
> > On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey <ap...@elyograg.org> wrote:
>
> >
>
> >> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
>
> >>> We are using Apache Solr 7.7 on Windows platform. The data is
> >>> synced to
>
> >> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
>
> >> The document size is very huge (~0.5GB average) and solr indexing
> >> is
> taking
>
> >> long time. Total document size is ~200GB. As the solr commit is
> >> done as
> a
>
> >> part of API, the API calls are failing as document indexing is not
>
> >> completed.
>
> >>
>
> >> A single document is five hundred megabytes?  What kind of
> >> documents do
>
> >> you have?  You can't even index something that big without tweaking
>
> >> configuration parameters that most people don't even know about.
>
> >> Assuming you can even get it working, there's no way that indexing
> >> a
>
> >> document like that is going to be fast.
>
> >>
>
> >>>     1.  What is your advise on syncing such a large volume of data
> >>> to
>
> >> Solr KB.
>
> >>
>
> >> What is "KB"?  I have never heard of this in relation to Solr.
>
> >>
>
> >>>     2.  Because of the search requirements, almost 8 fields are
> >>> defined
>
> >> as Text fields.
>
> >>
>
> >> I can't figure out what you are trying to say with this statement.
>
> >>
>
> >>>     3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for
> >>> such
> a
>
> >> large volume of data?
>
> >>
>
> >> If just one of the documents you're sending to Solr really is five
>
> >> hundred megabytes, then 2 gigabytes would probably be just barely
> >> enough
>
> >> to index one document into an empty index ... and it would probably
> >> be
>
> >> doing garbage collection so frequently that it would make things
> >> REALLY
>
> >> slow.  I have no way to predict how much heap you will need.  That
> >> will
>
> >> require experimentation.  I can tell you that 2GB is definitely not
> enough.
>
> >>
>
> >>>     4.  How to set up Solr in production on Windows? Currently
> >>> it's set
>
> >> up as a standalone engine and client is requested to take the
> >> backup of
> the
>
> >> drive. Is there any other better way to do? How to set up for the
> disaster
>
> >> recovery?
>
> >>
>
> >> I would suggest NOT doing it on Windows.  My reasons for that come
> >> down
>
> >> to costs -- a Windows Server license isn't cheap.
>
> >>
>
> >> That said, there's nothing wrong with running on Windows, but
> >> you're on
>
> >> your own as far as running it as a service.  We only have a service
>
> >> installer for UNIX-type systems.  Most of the testing for that is
> >> done
>
> >> on Linux.
>
> >>
>
> >>>     5.  How to benchmark the system requirements for such a huge
> >>> data
>
> >> I do not know what all your needs are, so I have no way to answer this.
>
> >> You're going to know a lot more about it that any of us are.
>
> >>
>
> >> Thanks,
>
> >> Shawn
>
> >>
>
>
>
> --
>
> Charlie Hull
>
> OpenSource Connections, previously Flax
>
>
>
> tel/fax: +44 (0)8700 118334
>
> mobile:  +44 (0)7767 825828
>
> web: www.o19s.com
>
>
>
>
Confidentiality Notice
====================
This email message, including any attachments, is for the sole use of the intended recipient and may contain confidential and privileged information. Any unauthorized view, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. Anju Software, Inc. 4500 S. Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.

Re: Solr 7.7 - Few Questions

Posted by Charlie Hull <ch...@flax.co.uk>.

Nested docs would be one approach, result grouping might be another. 
Regarding JOINs, the only way you're going to know is by some 
representative testing.

Charlie

On 05/10/2020 05:49, Rahul Goswami wrote:
> Charlie,
> Thanks for providing an alternate approach to doing this. It would be
> interesting to know how one  could go about organizing the docs in this
> case? (Nested documents?) How would join queries perform on a large
> index(200 million+ docs)?
>
> Thanks,
> Rahul
>
>
>
> On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull <ch...@flax.co.uk> wrote:
>
>> Hi Rahul,
>>
>>
>>
>> In addition to the wise advice below: remember in Solr, a 'document' is
>>
>> just the name for the thing that would appear as one of the results when
>>
>> you search (analagous to a database record). It's not the same
>>
>> conceptually as a 'Word document' or a 'PDF document'. If your source
>>
>> documents are so big, consider how they might be broken into parts, or
>>
>> whether you really need to index all of them for retrieval purposes, or
>>
>> what parts of them need to be extracted as text. Thus, the Solr
>>
>> documents don't necessarily need to be as large as your source documents.
>>
>>
>>
>> Consider an email size 20kb with ten PDF attachments, each 20MB. You
>>
>> probably shouldn't push all this data into a single Solr document, but
>>
>> you *could* index them as 11 separate Solr documents, but with metadata
>>
>> to indicate that one is an email and ten are PDFs, and a shared ID of
>>
>> some kind to indicate they're related. Then at query time there are
>>
>> various ways for you to group these together, so for example if the
>>
>> query hit one of the PDFs you could show the user the original email,
>>
>> plus the 9 other attachments, using the shared ID as a key.
>>
>>
>>
>> HTH,
>>
>>
>>
>> Charlie
>>
>>
>>
>> On 02/10/2020 01:53, Rahul Goswami wrote:
>>
>>> Manisha,
>>> In addition to what Shawn has mentioned above, I would also like you to
>>> reevaluate your use case. Do you *need to* index the whole document ? eg:
>>> If it's an email, the body of the email *might* be more important than
>> any
>>
>>> attachments, in which case you could choose to only index the email body
>>> and ignore (or only partially index) the text from attachments. If you
>>> could afford to index the documents partially, you could consider Solr's
>>> "Limit token count filter": See the link below.
>> https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter
>>
>>> You'll need to configure it in the schema for the "index" analyzer for
>> the
>>
>>> data type of the field with large text.
>>> Indexing documents of the order of half a GB will definitely come to hurt
>>> your operations, if not now, later (think OOM, extremely slow atomic
>>> updates, long running merges etc.).
>>> - Rahul
>>> On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey <ap...@elyograg.org> wrote:
>>>> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
>>>>> We are using Apache Solr 7.7 on Windows platform. The data is synced to
>>>> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
>>>> The document size is very huge (~0.5GB average) and solr indexing is
>> taking
>>
>>>> long time. Total document size is ~200GB. As the solr commit is done as
>> a
>>
>>>> part of API, the API calls are failing as document indexing is not
>>>> completed.
>>>> A single document is five hundred megabytes?  What kind of documents do
>>>> you have?  You can't even index something that big without tweaking
>>>> configuration parameters that most people don't even know about.
>>>> Assuming you can even get it working, there's no way that indexing a
>>>> document like that is going to be fast.
>>>>>      1.  What is your advise on syncing such a large volume of data to
>>>> Solr KB.
>>>> What is "KB"?  I have never heard of this in relation to Solr.
>>>>>      2.  Because of the search requirements, almost 8 fields are defined
>>>> as Text fields.
>>>> I can't figure out what you are trying to say with this statement.
>>>>>      3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such
>> a
>>
>>>> large volume of data?
>>>> If just one of the documents you're sending to Solr really is five
>>>> hundred megabytes, then 2 gigabytes would probably be just barely enough
>>>> to index one document into an empty index ... and it would probably be
>>>> doing garbage collection so frequently that it would make things REALLY
>>>> slow.  I have no way to predict how much heap you will need.  That will
>>>> require experimentation.  I can tell you that 2GB is definitely not
>> enough.
>>
>>>>>      4.  How to set up Solr in production on Windows? Currently it's set
>>>> up as a standalone engine and client is requested to take the backup of
>> the
>>
>>>> drive. Is there any other better way to do? How to set up for the
>> disaster
>>
>>>> recovery?
>>>> I would suggest NOT doing it on Windows.  My reasons for that come down
>>>> to costs -- a Windows Server license isn't cheap.
>>>> That said, there's nothing wrong with running on Windows, but you're on
>>>> your own as far as running it as a service.  We only have a service
>>>> installer for UNIX-type systems.  Most of the testing for that is done
>>>> on Linux.
>>>>>      5.  How to benchmark the system requirements for such a huge data
>>>> I do not know what all your needs are, so I have no way to answer this.
>>>> You're going to know a lot more about it that any of us are.
>>>> Thanks,
>>>> Shawn
>>
>>
>> --
>>
>> Charlie Hull
>>
>> OpenSource Connections, previously Flax
>>
>>
>>
>> tel/fax: +44 (0)8700 118334
>>
>> mobile:  +44 (0)7767 825828
>>
>> web: www.o19s.com
>>
>>
>>
>>

-- 
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com

Re: Solr 7.7 - Few Questions

Posted by Rahul Goswami <ra...@gmail.com>.

Charlie,
Thanks for providing an alternate approach to doing this. It would be
interesting to know how one  could go about organizing the docs in this
case? (Nested documents?) How would join queries perform on a large
index(200 million+ docs)?

Thanks,
Rahul



On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull <ch...@flax.co.uk> wrote:

> Hi Rahul,
>
>
>
> In addition to the wise advice below: remember in Solr, a 'document' is
>
> just the name for the thing that would appear as one of the results when
>
> you search (analagous to a database record). It's not the same
>
> conceptually as a 'Word document' or a 'PDF document'. If your source
>
> documents are so big, consider how they might be broken into parts, or
>
> whether you really need to index all of them for retrieval purposes, or
>
> what parts of them need to be extracted as text. Thus, the Solr
>
> documents don't necessarily need to be as large as your source documents.
>
>
>
> Consider an email size 20kb with ten PDF attachments, each 20MB. You
>
> probably shouldn't push all this data into a single Solr document, but
>
> you *could* index them as 11 separate Solr documents, but with metadata
>
> to indicate that one is an email and ten are PDFs, and a shared ID of
>
> some kind to indicate they're related. Then at query time there are
>
> various ways for you to group these together, so for example if the
>
> query hit one of the PDFs you could show the user the original email,
>
> plus the 9 other attachments, using the shared ID as a key.
>
>
>
> HTH,
>
>
>
> Charlie
>
>
>
> On 02/10/2020 01:53, Rahul Goswami wrote:
>
> > Manisha,
>
> > In addition to what Shawn has mentioned above, I would also like you to
>
> > reevaluate your use case. Do you *need to* index the whole document ? eg:
>
> > If it's an email, the body of the email *might* be more important than
> any
>
> > attachments, in which case you could choose to only index the email body
>
> > and ignore (or only partially index) the text from attachments. If you
>
> > could afford to index the documents partially, you could consider Solr's
>
> > "Limit token count filter": See the link below.
>
> >
>
> >
> https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter
>
> >
>
> > You'll need to configure it in the schema for the "index" analyzer for
> the
>
> > data type of the field with large text.
>
> > Indexing documents of the order of half a GB will definitely come to hurt
>
> > your operations, if not now, later (think OOM, extremely slow atomic
>
> > updates, long running merges etc.).
>
> >
>
> > - Rahul
>
> >
>
> >
>
> >
>
> > On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey <ap...@elyograg.org> wrote:
>
> >
>
> >> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
>
> >>> We are using Apache Solr 7.7 on Windows platform. The data is synced to
>
> >> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
>
> >> The document size is very huge (~0.5GB average) and solr indexing is
> taking
>
> >> long time. Total document size is ~200GB. As the solr commit is done as
> a
>
> >> part of API, the API calls are failing as document indexing is not
>
> >> completed.
>
> >>
>
> >> A single document is five hundred megabytes?  What kind of documents do
>
> >> you have?  You can't even index something that big without tweaking
>
> >> configuration parameters that most people don't even know about.
>
> >> Assuming you can even get it working, there's no way that indexing a
>
> >> document like that is going to be fast.
>
> >>
>
> >>>     1.  What is your advise on syncing such a large volume of data to
>
> >> Solr KB.
>
> >>
>
> >> What is "KB"?  I have never heard of this in relation to Solr.
>
> >>
>
> >>>     2.  Because of the search requirements, almost 8 fields are defined
>
> >> as Text fields.
>
> >>
>
> >> I can't figure out what you are trying to say with this statement.
>
> >>
>
> >>>     3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such
> a
>
> >> large volume of data?
>
> >>
>
> >> If just one of the documents you're sending to Solr really is five
>
> >> hundred megabytes, then 2 gigabytes would probably be just barely enough
>
> >> to index one document into an empty index ... and it would probably be
>
> >> doing garbage collection so frequently that it would make things REALLY
>
> >> slow.  I have no way to predict how much heap you will need.  That will
>
> >> require experimentation.  I can tell you that 2GB is definitely not
> enough.
>
> >>
>
> >>>     4.  How to set up Solr in production on Windows? Currently it's set
>
> >> up as a standalone engine and client is requested to take the backup of
> the
>
> >> drive. Is there any other better way to do? How to set up for the
> disaster
>
> >> recovery?
>
> >>
>
> >> I would suggest NOT doing it on Windows.  My reasons for that come down
>
> >> to costs -- a Windows Server license isn't cheap.
>
> >>
>
> >> That said, there's nothing wrong with running on Windows, but you're on
>
> >> your own as far as running it as a service.  We only have a service
>
> >> installer for UNIX-type systems.  Most of the testing for that is done
>
> >> on Linux.
>
> >>
>
> >>>     5.  How to benchmark the system requirements for such a huge data
>
> >> I do not know what all your needs are, so I have no way to answer this.
>
> >> You're going to know a lot more about it that any of us are.
>
> >>
>
> >> Thanks,
>
> >> Shawn
>
> >>
>
>
>
> --
>
> Charlie Hull
>
> OpenSource Connections, previously Flax
>
>
>
> tel/fax: +44 (0)8700 118334
>
> mobile:  +44 (0)7767 825828
>
> web: www.o19s.com
>
>
>
>

Re: Solr 7.7 - Few Questions

Posted by Charlie Hull <ch...@flax.co.uk>.

Hi Rahul,

In addition to the wise advice below: remember in Solr, a 'document' is 
just the name for the thing that would appear as one of the results when 
you search (analagous to a database record). It's not the same 
conceptually as a 'Word document' or a 'PDF document'. If your source 
documents are so big, consider how they might be broken into parts, or 
whether you really need to index all of them for retrieval purposes, or 
what parts of them need to be extracted as text. Thus, the Solr 
documents don't necessarily need to be as large as your source documents.

Consider an email size 20kb with ten PDF attachments, each 20MB. You 
probably shouldn't push all this data into a single Solr document, but 
you *could* index them as 11 separate Solr documents, but with metadata 
to indicate that one is an email and ten are PDFs, and a shared ID of 
some kind to indicate they're related. Then at query time there are 
various ways for you to group these together, so for example if the 
query hit one of the PDFs you could show the user the original email, 
plus the 9 other attachments, using the shared ID as a key.

HTH,

Charlie

On 02/10/2020 01:53, Rahul Goswami wrote:
> Manisha,
> In addition to what Shawn has mentioned above, I would also like you to
> reevaluate your use case. Do you *need to* index the whole document ? eg:
> If it's an email, the body of the email *might* be more important than any
> attachments, in which case you could choose to only index the email body
> and ignore (or only partially index) the text from attachments. If you
> could afford to index the documents partially, you could consider Solr's
> "Limit token count filter": See the link below.
>
> https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter
>
> You'll need to configure it in the schema for the "index" analyzer for the
> data type of the field with large text.
> Indexing documents of the order of half a GB will definitely come to hurt
> your operations, if not now, later (think OOM, extremely slow atomic
> updates, long running merges etc.).
>
> - Rahul
>
>
>
> On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey <ap...@elyograg.org> wrote:
>
>> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
>>> We are using Apache Solr 7.7 on Windows platform. The data is synced to
>> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
>> The document size is very huge (~0.5GB average) and solr indexing is taking
>> long time. Total document size is ~200GB. As the solr commit is done as a
>> part of API, the API calls are failing as document indexing is not
>> completed.
>>
>> A single document is five hundred megabytes?  What kind of documents do
>> you have?  You can't even index something that big without tweaking
>> configuration parameters that most people don't even know about.
>> Assuming you can even get it working, there's no way that indexing a
>> document like that is going to be fast.
>>
>>>     1.  What is your advise on syncing such a large volume of data to
>> Solr KB.
>>
>> What is "KB"?  I have never heard of this in relation to Solr.
>>
>>>     2.  Because of the search requirements, almost 8 fields are defined
>> as Text fields.
>>
>> I can't figure out what you are trying to say with this statement.
>>
>>>     3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a
>> large volume of data?
>>
>> If just one of the documents you're sending to Solr really is five
>> hundred megabytes, then 2 gigabytes would probably be just barely enough
>> to index one document into an empty index ... and it would probably be
>> doing garbage collection so frequently that it would make things REALLY
>> slow.  I have no way to predict how much heap you will need.  That will
>> require experimentation.  I can tell you that 2GB is definitely not enough.
>>
>>>     4.  How to set up Solr in production on Windows? Currently it's set
>> up as a standalone engine and client is requested to take the backup of the
>> drive. Is there any other better way to do? How to set up for the disaster
>> recovery?
>>
>> I would suggest NOT doing it on Windows.  My reasons for that come down
>> to costs -- a Windows Server license isn't cheap.
>>
>> That said, there's nothing wrong with running on Windows, but you're on
>> your own as far as running it as a service.  We only have a service
>> installer for UNIX-type systems.  Most of the testing for that is done
>> on Linux.
>>
>>>     5.  How to benchmark the system requirements for such a huge data
>> I do not know what all your needs are, so I have no way to answer this.
>> You're going to know a lot more about it that any of us are.
>>
>> Thanks,
>> Shawn
>>

-- 
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com

Re: Solr 7.7 - Few Questions

Posted by Rahul Goswami <ra...@gmail.com>.

Manisha,
In addition to what Shawn has mentioned above, I would also like you to
reevaluate your use case. Do you *need to* index the whole document ? eg:
If it's an email, the body of the email *might* be more important than any
attachments, in which case you could choose to only index the email body
and ignore (or only partially index) the text from attachments. If you
could afford to index the documents partially, you could consider Solr's
"Limit token count filter": See the link below.

https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter

You'll need to configure it in the schema for the "index" analyzer for the
data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long running merges etc.).

- Rahul



On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
> > We are using Apache Solr 7.7 on Windows platform. The data is synced to
> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
> The document size is very huge (~0.5GB average) and solr indexing is taking
> long time. Total document size is ~200GB. As the solr commit is done as a
> part of API, the API calls are failing as document indexing is not
> completed.
>
> A single document is five hundred megabytes?  What kind of documents do
> you have?  You can't even index something that big without tweaking
> configuration parameters that most people don't even know about.
> Assuming you can even get it working, there's no way that indexing a
> document like that is going to be fast.
>
> >    1.  What is your advise on syncing such a large volume of data to
> Solr KB.
>
> What is "KB"?  I have never heard of this in relation to Solr.
>
> >    2.  Because of the search requirements, almost 8 fields are defined
> as Text fields.
>
> I can't figure out what you are trying to say with this statement.
>
> >    3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a
> large volume of data?
>
> If just one of the documents you're sending to Solr really is five
> hundred megabytes, then 2 gigabytes would probably be just barely enough
> to index one document into an empty index ... and it would probably be
> doing garbage collection so frequently that it would make things REALLY
> slow.  I have no way to predict how much heap you will need.  That will
> require experimentation.  I can tell you that 2GB is definitely not enough.
>
> >    4.  How to set up Solr in production on Windows? Currently it's set
> up as a standalone engine and client is requested to take the backup of the
> drive. Is there any other better way to do? How to set up for the disaster
> recovery?
>
> I would suggest NOT doing it on Windows.  My reasons for that come down
> to costs -- a Windows Server license isn't cheap.
>
> That said, there's nothing wrong with running on Windows, but you're on
> your own as far as running it as a service.  We only have a service
> installer for UNIX-type systems.  Most of the testing for that is done
> on Linux.
>
> >    5.  How to benchmark the system requirements for such a huge data
>
> I do not know what all your needs are, so I have no way to answer this.
> You're going to know a lot more about it that any of us are.
>
> Thanks,
> Shawn
>

Re: Solr 7.7 - Few Questions

Posted by Shawn Heisey <ap...@elyograg.org>.

On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
> We are using Apache Solr 7.7 on Windows platform. The data is synced to Solr using Solr.Net commit. The data is being synced to SOLR in batches. The document size is very huge (~0.5GB average) and solr indexing is taking long time. Total document size is ~200GB. As the solr commit is done as a part of API, the API calls are failing as document indexing is not completed.

A single document is five hundred megabytes?  What kind of documents do 
you have?  You can't even index something that big without tweaking 
configuration parameters that most people don't even know about. 
Assuming you can even get it working, there's no way that indexing a 
document like that is going to be fast.

>    1.  What is your advise on syncing such a large volume of data to Solr KB.

What is "KB"?  I have never heard of this in relation to Solr.

>    2.  Because of the search requirements, almost 8 fields are defined as Text fields.

I can't figure out what you are trying to say with this statement.

>    3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a large volume of data?

If just one of the documents you're sending to Solr really is five 
hundred megabytes, then 2 gigabytes would probably be just barely enough 
to index one document into an empty index ... and it would probably be 
doing garbage collection so frequently that it would make things REALLY 
slow.  I have no way to predict how much heap you will need.  That will 
require experimentation.  I can tell you that 2GB is definitely not enough.

>    4.  How to set up Solr in production on Windows? Currently it's set up as a standalone engine and client is requested to take the backup of the drive. Is there any other better way to do? How to set up for the disaster recovery?

I would suggest NOT doing it on Windows.  My reasons for that come down 
to costs -- a Windows Server license isn't cheap.

That said, there's nothing wrong with running on Windows, but you're on 
your own as far as running it as a service.  We only have a service 
installer for UNIX-type systems.  Most of the testing for that is done 
on Linux.

>    5.  How to benchmark the system requirements for such a huge data

I do not know what all your needs are, so I have no way to answer this. 
You're going to know a lot more about it that any of us are.

Thanks,
Shawn