You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Matheo Software Info <in...@matheo-software.com> on 2019/06/21 16:32:02 UTC

Is Solr can do that ?

Dear Solr User,

 

My question is very simple J I would like to know if Solr can process around
30To of data (Pdf, Text, Word, etc) ?

 

What is the best way to index this huge data ? several servers ? several
shards ? other ?

 

Many thanks for your information,

 

 

Cordialement, Best Regards

Bruno Mannina

 <http://www.matheo-software.com> www.matheo-software.com

 <http://www.patent-pulse.com> www.patent-pulse.com

Tél. +33 0 970 738 743

Mob. +33 0 634 421 817

 <https://www.facebook.com/PatentPulse> facebook (1)
<https://twitter.com/matheosoftware> 1425551717
<https://www.linkedin.com/company/matheo-software> 1425551737
<https://www.youtube.com/user/MatheoSoftware> 1425551760

 



---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

RE: Is Solr can do that ?

Posted by Bruno Mannina <bm...@free.fr>.

Hi Toke,

Thanks for sharing this experience, it's very useful for me to have a first overview of what will I need.
If I could resume, I will:
- learn about Tika
- Ask a lot of question like the frequency of add/update solr data
- Number of Users
- CPU/RAM/HDD
- A first test with a representative sample

And of course a good expertise :)

Thanks,
Bruno

-----Message d'origine-----
De : Toke Eskildsen [mailto:toes@kb.dk] 
Envoyé : samedi 22 juin 2019 11:36
À : solr_user lucene_apache
Objet : Re: Is Solr can do that ?

Matheo Software Info <in...@matheo-software.com> wrote:
> My question is very simple ☺ I would like to know if Solr can process 
> around 30To of data (Pdf, Text, Word, etc…) ?

Simple answer: Yes. Assuming 30To means 30 terabyte.

> What is the best way to index this huge data ? several servers ?
> several shards ? other ?

As other participants has mentioned, it is hard to give numbers. What we can do is share experience.

We are doing webarchive indexing and I guess there would be quite an overlap with your content as we also use Tika. One difference is that the images in a webarchive are quite cheap to index, so you'll probably need (relatively) more hardware than we use. Very roughly we used 40 CPU-years to index 600 (700? I forget) TB of data in one of our runs. Scaling to your 30TB this suggests something like 2 CPU-years, or a couple of months for a 16 core machine.

This is just to get a ballpark: You will do yourself a huge favor by building a test-setup and process 1 TB or so of your data to get _your_ numbers, before you design your indexing setup. It is our experience that the analyzing part (Tika) takes much more power than the Solr indexing part: At our last run we had 30-40 CPU-cores doing Tika (and related analysis) feeding into a Solr running on a 4-core machine on spinning drives.

As for Solr setup for search, then you need to describe in detail what your requirements are, before we can give you suggestions. Is the index updated all the time, in batches or one-off? How many concurrent users? Are the searches interactive or batch-jobs? What kind of aggregations do you need?

In our setup we build separate collections that are merged to single segments and never updated. Our use varies between very few interactive users and a lot of batch jobs. Scaling this specialized setup to your corpus size would require about 3TB of SSD, 64MB RAM and 4 CPU-cores, divided among 4 shards. You are likely to need quite a lot more than that, so this is just to say that at this scale the use of the index matters _a lot_.

- Toke Eskildsen

---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

Re: Is Solr can do that ?

Posted by Toke Eskildsen <to...@kb.dk>.

Matheo Software Info <in...@matheo-software.com> wrote:
> My question is very simple ☺ I would like to know if Solr can process
> around 30To of data (Pdf, Text, Word, etc…) ?

Simple answer: Yes. Assuming 30To means 30 terabyte.

> What is the best way to index this huge data ? several servers ?
> several shards ? other ?

As other participants has mentioned, it is hard to give numbers. What we can do is share experience.

We are doing webarchive indexing and I guess there would be quite an overlap with your content as we also use Tika. One difference is that the images in a webarchive are quite cheap to index, so you'll probably need (relatively) more hardware than we use. Very roughly we used 40 CPU-years to index 600 (700? I forget) TB of data in one of our runs. Scaling to your 30TB this suggests something like 2 CPU-years, or a couple of months for a 16 core machine.

This is just to get a ballpark: You will do yourself a huge favor by building a test-setup and process 1 TB or so of your data to get _your_ numbers, before you design your indexing setup. It is our experience that the analyzing part (Tika) takes much more power than the Solr indexing part: At our last run we had 30-40 CPU-cores doing Tika (and related analysis) feeding into a Solr running on a 4-core machine on spinning drives.

As for Solr setup for search, then you need to describe in detail what your requirements are, before we can give you suggestions. Is the index updated all the time, in batches or one-off? How many concurrent users? Are the searches interactive or batch-jobs? What kind of aggregations do you need?

In our setup we build separate collections that are merged to single segments and never updated. Our use varies between very few interactive users and a lot of batch jobs. Scaling this specialized setup to your corpus size would require about 3TB of SSD, 64MB RAM and 4 CPU-cores, divided among 4 shards. You are likely to need quite a lot more than that, so this is just to say that at this scale the use of the index matters _a lot_.

- Toke Eskildsen

RE: Is Solr can do that ?

Posted by Bruno Mannina <bm...@free.fr>.

Hello Sam,

First, thanks for your answer.

I don't know yet the number of document, I know just that it will be Text, Pdf, Word, Xls, etc...
I will try to get more info about the number of document.

I don't know TIka, I will investigate it.

Thanks,
Bruno

-----Message d'origine-----
De : Samuel Kasimalla [mailto:skasimalla@gmail.com] 
Envoyé : vendredi 21 juin 2019 18:56
À : solr-user@lucene.apache.org
Objet : Re: Is Solr can do that ?

Hi Bruno,

Assuming you meant 30TB, the first step is to use TIka parser and convert the rich documents into plain text.

We need the number of documents, the unofficial word on the street is about
50 million documents per shard, of course a lot of parameters are involved in this, it's a simple question but answer is not so simple :).

Hope this helps.

Thanks
Sam
https://www.linkedin.com/in/skasimalla/

On Fri, Jun 21, 2019 at 12:49 PM Matheo Software Info < info@matheo-software.com> wrote:

> Dear Solr User,
>
>
>
> My question is very simple J I would like to know if Solr can process 
> around 30To of data (Pdf, Text, Word, etc…) ?
>
>
>
> What is the best way to index this huge data ? several servers ? 
> several shards ? other ?
>
>
>
> Many thanks for your information,
>
>
>
>
>
> Cordialement, Best Regards
>
> Bruno Mannina
>
> www.matheo-software.com
>
> www.patent-pulse.com
>
> Tél. +33 0 970 738 743
>
> Mob. +33 0 634 421 817
>
> [image: facebook (1)] <https://www.facebook.com/PatentPulse>[image:
> 1425551717] <https://twitter.com/matheosoftware>[image: 1425551737]
> <https://www.linkedin.com/company/matheo-software>[image: 1425551760] 
> <https://www.youtube.com/user/MatheoSoftware>
>
>
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_
> campaign=sig-email&utm_content=emailclient> Garanti sans virus. 
> www.avast.com 
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_
> campaign=sig-email&utm_content=emailclient>
> <#m_149119889610705423_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>

---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

RE: Is Solr can do that ?

Posted by Bruno Mannina <bm...@free.fr>.

Hello Erick,

Well I do not know TIKA, I will of course study it.

Thanks for the info concerning solrj and Tika.

Bruno

-----Message d'origine-----
De : Erick Erickson [mailto:erickerickson@gmail.com] 
Envoyé : vendredi 21 juin 2019 19:10
À : solr-user@lucene.apache.org
Objet : Re: Is Solr can do that ?

What Sam said. 

Here’s something to get you started on how and why it’s better to be using Tika rather than shipping the docs to Solr and having ExtractingRequestHandler do it on Solr: https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

> On Jun 21, 2019, at 9:56 AM, Samuel Kasimalla <sk...@gmail.com> wrote:
> 
> Hi Bruno,
> 
> Assuming you meant 30TB, the first step is to use TIka parser and 
> convert the rich documents into plain text.
> 
> We need the number of documents, the unofficial word on the street is 
> about
> 50 million documents per shard, of course a lot of parameters are 
> involved in this, it's a simple question but answer is not so simple :).
> 
> Hope this helps.
> 
> Thanks
> Sam
> https://www.linkedin.com/in/skasimalla/
> 
> On Fri, Jun 21, 2019 at 12:49 PM Matheo Software Info < 
> info@matheo-software.com> wrote:
> 
>> Dear Solr User,
>> 
>> 
>> 
>> My question is very simple J I would like to know if Solr can process 
>> around 30To of data (Pdf, Text, Word, etc…) ?
>> 
>> 
>> 
>> What is the best way to index this huge data ? several servers ? 
>> several shards ? other ?
>> 
>> 
>> 
>> Many thanks for your information,
>> 
>> 
>> 
>> 
>> 
>> Cordialement, Best Regards
>> 
>> Bruno Mannina
>> 
>> www.matheo-software.com
>> 
>> www.patent-pulse.com
>> 
>> Tél. +33 0 970 738 743
>> 
>> Mob. +33 0 634 421 817
>> 
>> [image: facebook (1)] <https://www.facebook.com/PatentPulse>[image:
>> 1425551717] <https://twitter.com/matheosoftware>[image: 1425551737]
>> <https://www.linkedin.com/company/matheo-software>[image: 1425551760] 
>> <https://www.youtube.com/user/MatheoSoftware>
>> 
>> 
>> 
>> 
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm
>> _campaign=sig-email&utm_content=emailclient> Garanti sans virus. 
>> www.avast.com 
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm
>> _campaign=sig-email&utm_content=emailclient>
>> <#m_149119889610705423_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

Re: Is Solr can do that ?

Posted by Erick Erickson <er...@gmail.com>.

What Sam said. 

Here’s something to get you started on how and why it’s better to be using Tika rather than shipping the docs to Solr and having ExtractingRequestHandler do it on Solr: https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

> On Jun 21, 2019, at 9:56 AM, Samuel Kasimalla <sk...@gmail.com> wrote:
> 
> Hi Bruno,
> 
> Assuming you meant 30TB, the first step is to use TIka parser and convert
> the rich documents into plain text.
> 
> We need the number of documents, the unofficial word on the street is about
> 50 million documents per shard, of course a lot of parameters are involved
> in this, it's a simple question but answer is not so simple :).
> 
> Hope this helps.
> 
> Thanks
> Sam
> https://www.linkedin.com/in/skasimalla/
> 
> On Fri, Jun 21, 2019 at 12:49 PM Matheo Software Info <
> info@matheo-software.com> wrote:
> 
>> Dear Solr User,
>> 
>> 
>> 
>> My question is very simple J I would like to know if Solr can process
>> around 30To of data (Pdf, Text, Word, etc…) ?
>> 
>> 
>> 
>> What is the best way to index this huge data ? several servers ? several
>> shards ? other ?
>> 
>> 
>> 
>> Many thanks for your information,
>> 
>> 
>> 
>> 
>> 
>> Cordialement, Best Regards
>> 
>> Bruno Mannina
>> 
>> www.matheo-software.com
>> 
>> www.patent-pulse.com
>> 
>> Tél. +33 0 970 738 743
>> 
>> Mob. +33 0 634 421 817
>> 
>> [image: facebook (1)] <https://www.facebook.com/PatentPulse>[image:
>> 1425551717] <https://twitter.com/matheosoftware>[image: 1425551737]
>> <https://www.linkedin.com/company/matheo-software>[image: 1425551760]
>> <https://www.youtube.com/user/MatheoSoftware>
>> 
>> 
>> 
>> 
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> Garanti
>> sans virus. www.avast.com
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>> <#m_149119889610705423_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: Is Solr can do that ?

Posted by Samuel Kasimalla <sk...@gmail.com>.

Hi Bruno,

Assuming you meant 30TB, the first step is to use TIka parser and convert
the rich documents into plain text.

We need the number of documents, the unofficial word on the street is about
50 million documents per shard, of course a lot of parameters are involved
in this, it's a simple question but answer is not so simple :).

Hope this helps.

Thanks
Sam
https://www.linkedin.com/in/skasimalla/

On Fri, Jun 21, 2019 at 12:49 PM Matheo Software Info <
info@matheo-software.com> wrote:

> Dear Solr User,
>
>
>
> My question is very simple J I would like to know if Solr can process
> around 30To of data (Pdf, Text, Word, etc…) ?
>
>
>
> What is the best way to index this huge data ? several servers ? several
> shards ? other ?
>
>
>
> Many thanks for your information,
>
>
>
>
>
> Cordialement, Best Regards
>
> Bruno Mannina
>
> www.matheo-software.com
>
> www.patent-pulse.com
>
> Tél. +33 0 970 738 743
>
> Mob. +33 0 634 421 817
>
> [image: facebook (1)] <https://www.facebook.com/PatentPulse>[image:
> 1425551717] <https://twitter.com/matheosoftware>[image: 1425551737]
> <https://www.linkedin.com/company/matheo-software>[image: 1425551760]
> <https://www.youtube.com/user/MatheoSoftware>
>
>
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> Garanti
> sans virus. www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
> <#m_149119889610705423_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>

RE: Is Solr can do that ?

Posted by Bruno Mannina <bm...@free.fr>.

Hello Shawn,

Good news that Solr can do that.

I know that with 30Tb of data, hardware will be the first thing to have.
Concerning Expertise, it's the real problem for me.

First I think I will do several tests before seeing how Solr works with
non-xml document (I have only experience with XML documents)

Thanks,
Bruno 

On 6/21/2019 10:32 AM, Matheo Software Info wrote:
> My question is very simple JI would like to know if Solr can process 
> around 30To of data (Pdf, Text, Word, etc.) ?
> 
> What is the best way to index this huge data ? several servers ? 
> several shards ? other ?

Sure, Solr can do that.  Whether you have enough resources or expertise
available to accomplish it is an entirely different question.

Handling that much data is likely going to require a LOT of expensive
hardware.  The index will almost certainly need to be sharded.  Knowing
exactly what numbers are involved is impossible with the information
available ... and even with more information, it will most likely require
experimentation with your actual data to find an optimal solution.

https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-don
t-have-a-definitive-answer/

Thanks,
Shawn

---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

Re: Is Solr can do that ?

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/21/2019 10:32 AM, Matheo Software Info wrote:
> My question is very simple JI would like to know if Solr can process 
> around 30To of data (Pdf, Text, Word, etc…) ?
> 
> What is the best way to index this huge data ? several servers ? several 
> shards ? other ?

Sure, Solr can do that.  Whether you have enough resources or expertise 
available to accomplish it is an entirely different question.

Handling that much data is likely going to require a LOT of expensive 
hardware.  The index will almost certainly need to be sharded.  Knowing 
exactly what numbers are involved is impossible with the information 
available ... and even with more information, it will most likely 
require experimentation with your actual data to find an optimal solution.

https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Thanks,
Shawn