You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by mustafozbek <mu...@gmail.com> on 2012/01/13 13:08:00 UTC

Can Apache Solr Handle TeraByte Large Data

I am an apache solr user about a year. I used solr for simple search tools
but now I want to use solr with 5TB of data. I assume that 5TB data will be
7TB when solr index it according to filter that I use. And then I will add
nearly 50MB of data per hour to the same index.
1-	Are there any problem using single solr server with 5TB data. (without
shards)
   a-	Can solr server answers the queries in an acceptable time
   b-	what is the expected time for commiting of 50MB data on 7TB index.
   c-	Is there an upper limit for index size.
2-	what are the suggestions that you offer
   a-	How many shards should I use
   b-	Should I use solr cores
   c-	What is the committing frequency you offered. (is 1 hour OK)
3-	are there any test results for this kind of large data

There is no available 5TB data, I just want to estimate what will be the
result.
Note: You can assume that hardware resourses are not a problem.


--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p3656484.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Could indexing English Wikipedia dump over and over get you there?

Otis 
----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html



>________________________________
> From: Memory Makers <me...@gmail.com>
>To: solr-user@lucene.apache.org 
>Sent: Tuesday, January 17, 2012 12:15 AM
>Subject: Re: Can Apache Solr Handle TeraByte Large Data
> 
>I've been toying with the idea of setting up an experiment to index a large
>document set 1+ TB -- any thoughts on an open data set that one could use
>for this purpose?
>
>Thanks.
>
>On Mon, Jan 16, 2012 at 5:00 PM, Burton-West, Tom <tb...@umich.edu>wrote:
>
>> Hello ,
>>
>> Searching real-time sounds difficult with that amount of data. With large
>> documents, 3 million documents, and 5TB of data the index will be very
>> large. With indexes that large your performance will probably be I/O bound.
>>
>> Do you plan on allowing phrase or proximity searches? If so, your
>> performance will be even more I/O bound as documents that large will have
>> huge positions indexes that will need to be read into memory for processing
>> phrase queries. To reduce I/O you need as much of the index in memory
>> (Lucene/Solr caches, and operating system disk cache).  Every commit
>> invalidates the Solr/Lucene caches (unless the newer nrt code has solved
>> this for Solr).
>>
>> If you index and serve on the same server, you are also going to get
>> terrible response time whenever your commits trigger a large merge.
>>
>> If you need to service 10-100 qps or more, you may need to look at putting
>> your index on SSDs or spreading it over enough machines so it can stay in
>> memory.
>>
>> What kind of response times are you looking for and what query rate?
>>
>> We have somewhat smaller documents. We have 10 million documents and about
>> 6-8TB of data in HathiTrust and have spread the index over 12 shards on 4
>> machines (i.e. 3 shards per machine).   We get an average of around
>> 200-300ms response time but our 95th percentile times are about 800ms and
>> 99th percentile are around 2 seconds.  This is with an average load of less
>> than 1 query/second.
>>
>> As Otis suggested, you may want to implement a strategy that allows users
>> to search within the large documents by breaking the documents up into
>> smaller units. What we do is have two Solr indexes.  The first indexes
>> complete documents.  When the user clicks on a result, we index the entire
>> document on a page level in a small Solr index on-the-fly.  That way they
>> can search within the document and get page level results.
>>
>> More details about our setup:
>> http://www.hathitrust.org/blogs/large-scale-search
>>
>> Tom Burton-West
>> University of Michigan Library
>> www.hathitrust.org
>> -----Original Message-----
>>
>>
>
>
>

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Memory Makers <me...@gmail.com>.
I've been toying with the idea of setting up an experiment to index a large
document set 1+ TB -- any thoughts on an open data set that one could use
for this purpose?

Thanks.

On Mon, Jan 16, 2012 at 5:00 PM, Burton-West, Tom <tb...@umich.edu>wrote:

> Hello ,
>
> Searching real-time sounds difficult with that amount of data. With large
> documents, 3 million documents, and 5TB of data the index will be very
> large. With indexes that large your performance will probably be I/O bound.
>
> Do you plan on allowing phrase or proximity searches? If so, your
> performance will be even more I/O bound as documents that large will have
> huge positions indexes that will need to be read into memory for processing
> phrase queries. To reduce I/O you need as much of the index in memory
> (Lucene/Solr caches, and operating system disk cache).  Every commit
> invalidates the Solr/Lucene caches (unless the newer nrt code has solved
> this for Solr).
>
> If you index and serve on the same server, you are also going to get
> terrible response time whenever your commits trigger a large merge.
>
> If you need to service 10-100 qps or more, you may need to look at putting
> your index on SSDs or spreading it over enough machines so it can stay in
> memory.
>
> What kind of response times are you looking for and what query rate?
>
> We have somewhat smaller documents. We have 10 million documents and about
> 6-8TB of data in HathiTrust and have spread the index over 12 shards on 4
> machines (i.e. 3 shards per machine).   We get an average of around
> 200-300ms response time but our 95th percentile times are about 800ms and
> 99th percentile are around 2 seconds.  This is with an average load of less
> than 1 query/second.
>
> As Otis suggested, you may want to implement a strategy that allows users
> to search within the large documents by breaking the documents up into
> smaller units. What we do is have two Solr indexes.  The first indexes
> complete documents.  When the user clicks on a result, we index the entire
> document on a page level in a small Solr index on-the-fly.  That way they
> can search within the document and get page level results.
>
> More details about our setup:
> http://www.hathitrust.org/blogs/large-scale-search
>
> Tom Burton-West
> University of Michigan Library
> www.hathitrust.org
> -----Original Message-----
>
>

RE: Can Apache Solr Handle TeraByte Large Data

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Hello ,

Searching real-time sounds difficult with that amount of data. With large documents, 3 million documents, and 5TB of data the index will be very large. With indexes that large your performance will probably be I/O bound.  

Do you plan on allowing phrase or proximity searches? If so, your performance will be even more I/O bound as documents that large will have huge positions indexes that will need to be read into memory for processing phrase queries. To reduce I/O you need as much of the index in memory (Lucene/Solr caches, and operating system disk cache).  Every commit invalidates the Solr/Lucene caches (unless the newer nrt code has solved this for Solr).  

If you index and serve on the same server, you are also going to get terrible response time whenever your commits trigger a large merge.

If you need to service 10-100 qps or more, you may need to look at putting your index on SSDs or spreading it over enough machines so it can stay in memory.

What kind of response times are you looking for and what query rate?

We have somewhat smaller documents. We have 10 million documents and about 6-8TB of data in HathiTrust and have spread the index over 12 shards on 4 machines (i.e. 3 shards per machine).   We get an average of around 200-300ms response time but our 95th percentile times are about 800ms and 99th percentile are around 2 seconds.  This is with an average load of less than 1 query/second.

As Otis suggested, you may want to implement a strategy that allows users to search within the large documents by breaking the documents up into smaller units. What we do is have two Solr indexes.  The first indexes complete documents.  When the user clicks on a result, we index the entire document on a page level in a small Solr index on-the-fly.  That way they can search within the document and get page level results.
 
More details about our setup: http://www.hathitrust.org/blogs/large-scale-search

Tom Burton-West
University of Michigan Library
www.hathitrust.org
-----Original Message-----


Re: Can Apache Solr Handle TeraByte Large Data

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,

>________________________________
> From: mustafozbek <mu...@gmail.com>
> 
>All documents that we use are rich text documents and we parse them with
>tika. we need to search real time. 

Because of real-time requirement, you'll need to use unreleased/dev version of Solr.

>Robert Stewart wrote
>> Any idea how many documents your 5TB data contains? 
>There are about 3millions document. You see the problem is that we have
>documents large in size and small in numbers. Is that fine?


That's fine.  But you may want to think about breaking up large docs into smaller Solr docs, since finding a match in a very large doc may make it hard for users to jump to the match/matches in a large doc unless you highlight matches in the document and allow the user to jump from match to match.

Otis
----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html

Re: Can Apache Solr Handle TeraByte Large Data

Posted by mustafozbek <mu...@gmail.com>.
All documents that we use are rich text documents and we parse them with
tika. we need to search real time. 

Robert Stewart wrote
> Any idea how many documents your 5TB data contains? 
There are about 3millions document. You see the problem is that we have
documents large in size and small in numbers. Is that fine?


--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p3662567.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Robert Stewart <bs...@gmail.com>.
Any idea how many documents your 5TB data contains?  Certain features such as faceting depends more on # of total documents than on actual size of data.

I have tested approx. 1 TB (100 million documents) running on a single machine (40 cores, 128 GB RAM), using distributed search across 10 shards (10 million docs each).  So running 10 SOLR processes.  Search performance is good (under 1 second avg. including faceting).

So based on that for 5TB (assuming 500 millon docs) you could probably shard across a few such machines and get decent performance with distributed search.

The indexes were sharded by time.  New documents go into a single index (the "current" index), and once that index reaches 10 million docs, a new index is created to become the "current" index.  Then the oldest index is dropped from search (so total remains 10 shards).  It is news data, so older data is less important.



On Jan 13, 2012, at 10:00 AM, <da...@ontrenet.com> <da...@ontrenet.com> wrote:

> 
> Maybe also have a look at these links.
> 
> http://www.hathitrust.org/blogs/large-scale-search/performance-5-million-volumes
> http://www.hathitrust.org/blogs/large-scale-search
> 
> On Fri, 13 Jan 2012 15:49:06 +0100, Daniel Brügge <da...@bruegge.eu>
> wrote:
>> Hi,
>> 
>> it's definitely a problem to store 5TB in Solr without using sharding. I
>> try to split data over solr instances,
>> so that the index will fit in my memory on the server.
>> 
>> I ran into trouble with a Solr using 50G index. 
>> 
>> Daniel
>> 
>> On Jan 13, 2012, at 1:08 PM, mustafozbek wrote:
>> 
>>> I am an apache solr user about a year. I used solr for simple search
>>> tools
>>> but now I want to use solr with 5TB of data. I assume that 5TB data
> will
>>> be
>>> 7TB when solr index it according to filter that I use. And then I will
>>> add
>>> nearly 50MB of data per hour to the same index.
>>> 1-	Are there any problem using single solr server with 5TB data.
> (without
>>> shards)
>>>  a-	Can solr server answers the queries in an acceptable time
>>>  b-	what is the expected time for commiting of 50MB data on 7TB index.
>>>  c-	Is there an upper limit for index size.
>>> 2-	what are the suggestions that you offer
>>>  a-	How many shards should I use
>>>  b-	Should I use solr cores
>>>  c-	What is the committing frequency you offered. (is 1 hour OK)
>>> 3-	are there any test results for this kind of large data
>>> 
>>> There is no available 5TB data, I just want to estimate what will be
> the
>>> result.
>>> Note: You can assume that hardware resourses are not a problem.
>>> 
>>> 
>>> --
>>> View this message in context:
>>> 
> http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p3656484.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can Apache Solr Handle TeraByte Large Data

Posted by da...@ontrenet.com.
Maybe also have a look at these links.

http://www.hathitrust.org/blogs/large-scale-search/performance-5-million-volumes
http://www.hathitrust.org/blogs/large-scale-search

On Fri, 13 Jan 2012 15:49:06 +0100, Daniel Brügge <da...@bruegge.eu>
wrote:
> Hi,
> 
> it's definitely a problem to store 5TB in Solr without using sharding. I
> try to split data over solr instances,
> so that the index will fit in my memory on the server.
> 
> I ran into trouble with a Solr using 50G index. 
> 
> Daniel
> 
> On Jan 13, 2012, at 1:08 PM, mustafozbek wrote:
> 
>> I am an apache solr user about a year. I used solr for simple search
>> tools
>> but now I want to use solr with 5TB of data. I assume that 5TB data
will
>> be
>> 7TB when solr index it according to filter that I use. And then I will
>> add
>> nearly 50MB of data per hour to the same index.
>> 1-	Are there any problem using single solr server with 5TB data.
(without
>> shards)
>>   a-	Can solr server answers the queries in an acceptable time
>>   b-	what is the expected time for commiting of 50MB data on 7TB index.
>>   c-	Is there an upper limit for index size.
>> 2-	what are the suggestions that you offer
>>   a-	How many shards should I use
>>   b-	Should I use solr cores
>>   c-	What is the committing frequency you offered. (is 1 hour OK)
>> 3-	are there any test results for this kind of large data
>> 
>> There is no available 5TB data, I just want to estimate what will be
the
>> result.
>> Note: You can assume that hardware resourses are not a problem.
>> 
>> 
>> --
>> View this message in context:
>>
http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p3656484.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Daniel Brügge <da...@bruegge.eu>.
Hi,

it's definitely a problem to store 5TB in Solr without using sharding. I try to split data over solr instances,
so that the index will fit in my memory on the server.

I ran into trouble with a Solr using 50G index. 

Daniel

On Jan 13, 2012, at 1:08 PM, mustafozbek wrote:

> I am an apache solr user about a year. I used solr for simple search tools
> but now I want to use solr with 5TB of data. I assume that 5TB data will be
> 7TB when solr index it according to filter that I use. And then I will add
> nearly 50MB of data per hour to the same index.
> 1-	Are there any problem using single solr server with 5TB data. (without
> shards)
>   a-	Can solr server answers the queries in an acceptable time
>   b-	what is the expected time for commiting of 50MB data on 7TB index.
>   c-	Is there an upper limit for index size.
> 2-	what are the suggestions that you offer
>   a-	How many shards should I use
>   b-	Should I use solr cores
>   c-	What is the committing frequency you offered. (is 1 hour OK)
> 3-	are there any test results for this kind of large data
> 
> There is no available 5TB data, I just want to estimate what will be the
> result.
> Note: You can assume that hardware resourses are not a problem.
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p3656484.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can Apache Solr Handle TeraByte Large Data

Posted by Mugeesh Husain <mu...@gmail.com>.
thanks you Upayavira,

I think i have done all these thing using SolrJ which was usefull  before
starting development of the project.
I hope i will not got any of issue using SolrJ and got lots of stuff using
it.   

Thanks
Mugeesh Husain



--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221066.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Upayavira <uv...@odoko.co.uk>.
Post your docs in sets of 1000. Create a:

 List<SolrInputDocument> docs

Then add 1000 docs to it, then client.add(docs);

Repeat until your 40m are indexed.

Upayavira

On Wed, Aug 5, 2015, at 05:07 PM, Mugeesh Husain wrote:
> filesystem are about 40 millions of document it will iterate 40 times how
> may
> solrJ could not handle 40m times loops(before  indexing i have to split
> values from filename and make some operation then index to Solr)
> 
> Is it will continuous indexing using 40m times or i have to sleep in
> between
> some interaval.
> 
> Does it will take same time in compare of HTTP or  bin/post ?
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221060.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Mugeesh Husain <mu...@gmail.com>.
filesystem are about 40 millions of document it will iterate 40 times how may
solrJ could not handle 40m times loops(before  indexing i have to split
values from filename and make some operation then index to Solr)

Is it will continuous indexing using 40m times or i have to sleep in between
some interaval.

Does it will take same time in compare of HTTP or  bin/post ?



--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221060.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Upayavira <uv...@odoko.co.uk>.
If you are using Java, you will likely find SolrJ the best way - it uses
serialised Java objects to communicate with Solr - you don't need to
worry about that. Just use code similar to that earlier in the thread.
No XML, no CSV, just simple java code.

Upayavira


On Wed, Aug 5, 2015, at 04:50 PM, Mugeesh Husain wrote:
> @Upayavira
> 
> Thanks these thing are most useful for my understanding 
> I have thing about i will create XML or CVS file from my requirement
> using
> java
> Then Index it via HTTP post or  bin/post 
> 
> I am not using DIH because i did't get any of  link or idea how to split
> data and add to solr one by one.(As i mention onmy requirement) 
> 
> tell me Indexing XML file or CVS files which one is a better way ?
> 
> with csv i noticed that it didn't parse the data into the correct fields.
> So
> how do we ensure that the data is correctly stored in Solr ?
> 
> Or XML is a correct way to parse it
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221051.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Mugeesh Husain <mu...@gmail.com>.
@Upayavira

Thanks these thing are most useful for my understanding 
I have thing about i will create XML or CVS file from my requirement using
java
Then Index it via HTTP post or  bin/post 

I am not using DIH because i did't get any of  link or idea how to split
data and add to solr one by one.(As i mention onmy requirement) 

tell me Indexing XML file or CVS files which one is a better way ?

with csv i noticed that it didn't parse the data into the correct fields. So
how do we ensure that the data is correctly stored in Solr ?

Or XML is a correct way to parse it



--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221051.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Upayavira <uv...@odoko.co.uk>.

On Tue, Aug 4, 2015, at 06:13 PM, Mugeesh Husain wrote:
> @Upayavira if i uses Solrj for indexing.  autocommit or Softautocommit
> will
> work in case of SolJ

There are two ways to get content into Solr:

 * push it in via an HTTP post.
   - this is what SolrJ uses, what bin/post uses, and everything else
   other than:
 * DIH: this runs inside Solr and pulls content into the index via
 configurations

Personally, I'm not a fan of the DIH. It works for simple scenarios, but
as soon as your needs get a little complex, it seems to struggle, and it
seems your needs have become sufficiently complex already.

Solr itself does the autocommit, so it will work with anything that you
use to push content into Solr, SolrJ, bin/post, DIH or anything else.

Upayavira

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Mugeesh Husain <mu...@gmail.com>.
@Upayavira if i uses Solrj for indexing.  autocommit or Softautocommit will
work in case of SolJ



--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220796.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Upayavira <uv...@odoko.co.uk>.
Yes, you are right - generally autocommit is a better way. If you are
doing a one-off indexing, then a manual commit may well be the best
option, but generally, autocommit is a better way.

Upayavira

On Mon, Aug 3, 2015, at 11:15 PM, Konstantin Gribov wrote:
> Upayavira, manual commit isn't a good advice, especially with small bulks
> or single document, is it? I see recommendations on using
> autoCommit+autoSoftCommit instead of manual commit mostly.
> 
> вт, 4 авг. 2015 г. в 1:00, Upayavira <uv...@odoko.co.uk>:
> 
> > SolrJ is just a "SolrClient". In pseudocode, you say:
> >
> > SolrClient client = new
> > SolrClient("http://localhost:8983/solr/whatever");
> >
> > List<SolrInputDocument> docs = new ArrayList<>();
> > SolrInputDocument doc = new SolrInputDocument();
> > doc.addField("id", "abc123");
> > doc.addField("some-text-field", "I like it when the sun shines");
> > docs.add(doc);
> > client.add(docs);
> > client.commit();
> >
> > (warning, the above is typed from memory)
> >
> > So, the question is simply how many documents do you add to docs before
> > you do client.add(docs);
> >
> > And how often (if at all) do you call client.commit().
> >
> > So when you are told "Use SolrJ", really, you are being told to write
> > some Java code that happens to use the SolrJ client library for Solr.
> >
> > Upayavira
> >
> >
> > On Mon, Aug 3, 2015, at 10:01 PM, Alexandre Rafalovitch wrote:
> > > Well,
> > >
> > > If it is just file names, I'd probably use SolrJ client, maybe with
> > > Java 8. Read file names, split the name into parts with regular
> > > expressions, stuff parts into different field names and send to Solr.
> > > Java 8 has FileSystem walkers, etc to make it easier.
> > >
> > > You could do it with DIH, but it would be with nested entities and the
> > > inner entity would probably try to parse the file. So, a lot of wasted
> > > effort if you just care about the file names.
> > >
> > > Or, I would just do a directory listing in the operating system and
> > > use regular expressions to split it into CSV file, which I would then
> > > import into Solr directly.
> > >
> > > In all of these cases, the question would be which field is the ID of
> > > the record to ensure no duplicates.
> > >
> > > Regards,
> > >    Alex.
> > >
> > > ----
> > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > http://www.solr-start.com/
> > >
> > >
> > > On 3 August 2015 at 15:34, Mugeesh Husain <mu...@gmail.com> wrote:
> > > > @Alexandre  No i dont need a content of a file. i am repeating my
> > requirement
> > > >
> > > > I have a 40 millions of files which is stored in a file systems,
> > > > the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
> > > >
> > > > I just  split all Value from a filename only,these values i have to
> > index.
> > > >
> > > > I am interested to index value to solr not file contains.
> > > >
> > > > I have tested the DIH from a file system its work fine but i dont know
> > how
> > > > can i implement my code in DIH
> > > > if my code get some value than how i can i index it using DIH.
> > > >
> > > > If i will use DIH then How i will make split operation and get value
> > from
> > > > it.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
> > > > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> -- 
> Best regards,
> Konstantin Gribov

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Konstantin Gribov <gr...@gmail.com>.
Upayavira, manual commit isn't a good advice, especially with small bulks
or single document, is it? I see recommendations on using
autoCommit+autoSoftCommit instead of manual commit mostly.

вт, 4 авг. 2015 г. в 1:00, Upayavira <uv...@odoko.co.uk>:

> SolrJ is just a "SolrClient". In pseudocode, you say:
>
> SolrClient client = new
> SolrClient("http://localhost:8983/solr/whatever");
>
> List<SolrInputDocument> docs = new ArrayList<>();
> SolrInputDocument doc = new SolrInputDocument();
> doc.addField("id", "abc123");
> doc.addField("some-text-field", "I like it when the sun shines");
> docs.add(doc);
> client.add(docs);
> client.commit();
>
> (warning, the above is typed from memory)
>
> So, the question is simply how many documents do you add to docs before
> you do client.add(docs);
>
> And how often (if at all) do you call client.commit().
>
> So when you are told "Use SolrJ", really, you are being told to write
> some Java code that happens to use the SolrJ client library for Solr.
>
> Upayavira
>
>
> On Mon, Aug 3, 2015, at 10:01 PM, Alexandre Rafalovitch wrote:
> > Well,
> >
> > If it is just file names, I'd probably use SolrJ client, maybe with
> > Java 8. Read file names, split the name into parts with regular
> > expressions, stuff parts into different field names and send to Solr.
> > Java 8 has FileSystem walkers, etc to make it easier.
> >
> > You could do it with DIH, but it would be with nested entities and the
> > inner entity would probably try to parse the file. So, a lot of wasted
> > effort if you just care about the file names.
> >
> > Or, I would just do a directory listing in the operating system and
> > use regular expressions to split it into CSV file, which I would then
> > import into Solr directly.
> >
> > In all of these cases, the question would be which field is the ID of
> > the record to ensure no duplicates.
> >
> > Regards,
> >    Alex.
> >
> > ----
> > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > http://www.solr-start.com/
> >
> >
> > On 3 August 2015 at 15:34, Mugeesh Husain <mu...@gmail.com> wrote:
> > > @Alexandre  No i dont need a content of a file. i am repeating my
> requirement
> > >
> > > I have a 40 millions of files which is stored in a file systems,
> > > the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
> > >
> > > I just  split all Value from a filename only,these values i have to
> index.
> > >
> > > I am interested to index value to solr not file contains.
> > >
> > > I have tested the DIH from a file system its work fine but i dont know
> how
> > > can i implement my code in DIH
> > > if my code get some value than how i can i index it using DIH.
> > >
> > > If i will use DIH then How i will make split operation and get value
> from
> > > it.
> > >
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
>
-- 
Best regards,
Konstantin Gribov

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Upayavira <uv...@odoko.co.uk>.
SolrJ is just a "SolrClient". In pseudocode, you say:

SolrClient client = new
SolrClient("http://localhost:8983/solr/whatever");

List<SolrInputDocument> docs = new ArrayList<>();
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "abc123");
doc.addField("some-text-field", "I like it when the sun shines");
docs.add(doc);
client.add(docs);
client.commit();

(warning, the above is typed from memory)

So, the question is simply how many documents do you add to docs before
you do client.add(docs);

And how often (if at all) do you call client.commit().

So when you are told "Use SolrJ", really, you are being told to write
some Java code that happens to use the SolrJ client library for Solr.

Upayavira


On Mon, Aug 3, 2015, at 10:01 PM, Alexandre Rafalovitch wrote:
> Well,
> 
> If it is just file names, I'd probably use SolrJ client, maybe with
> Java 8. Read file names, split the name into parts with regular
> expressions, stuff parts into different field names and send to Solr.
> Java 8 has FileSystem walkers, etc to make it easier.
> 
> You could do it with DIH, but it would be with nested entities and the
> inner entity would probably try to parse the file. So, a lot of wasted
> effort if you just care about the file names.
> 
> Or, I would just do a directory listing in the operating system and
> use regular expressions to split it into CSV file, which I would then
> import into Solr directly.
> 
> In all of these cases, the question would be which field is the ID of
> the record to ensure no duplicates.
> 
> Regards,
>    Alex.
> 
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
> 
> 
> On 3 August 2015 at 15:34, Mugeesh Husain <mu...@gmail.com> wrote:
> > @Alexandre  No i dont need a content of a file. i am repeating my requirement
> >
> > I have a 40 millions of files which is stored in a file systems,
> > the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
> >
> > I just  split all Value from a filename only,these values i have to index.
> >
> > I am interested to index value to solr not file contains.
> >
> > I have tested the DIH from a file system its work fine but i dont know how
> > can i implement my code in DIH
> > if my code get some value than how i can i index it using DIH.
> >
> > If i will use DIH then How i will make split operation and get value from
> > it.
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
> > Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Mugeesh Husain <mu...@gmail.com>.
@Mikhail   Use of data import handler ,if i define my baseDir is
D:/work/folder. Will it work for sub-folder and sub-folder of sub-folder ...
etc  also.?



--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221063.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
On Tue, Aug 4, 2015 at 8:10 PM, Mugeesh Husain <mu...@gmail.com> wrote:

> Thanks you Erik, I will preferred XML files instead of csv.
> On my requirement if i want to use DIH for indexing than how could i split
> these operation or include java clode to DIH..
>
Here is my favorite way to tweak data in DIH
https://wiki.apache.org/solr/DataImportHandler#ScriptTransformer
You can even do it in java https://wiki.apache.org/solr/DIHCustomTransformer
but personally I prefer JavaScript.

Note, as a big fan of DIH I had to say that it's not an option in case of
SolrCloud, I explained it
http://blog.griddynamics.com/2015/07/how-to-import-structured-data-into-solr.html

I have googled but not get such type of requirement.

provide my any of link for it or some suggestion to do it.
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220793.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Mugeesh Husain <mu...@gmail.com>.
Thanks you Erik, I will preferred XML files instead of csv.
On my requirement if i want to use DIH for indexing than how could i split
these operation or include java clode to DIH..
I have googled but not get such type of requirement.
provide my any of link for it or some suggestion to do it.





--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220793.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Erik Hatcher <er...@gmail.com>.
If you have data that only consists of id (full filename) and filename (indexed, tokenized) 40M of those will fit comfortably into a single shard provided enough RAM to operate.

I know SolrJ is tossed out there a lot as a/the way to index - but if you’ve got a directory tree of files and want to index _just_ the file names then a shell script that generated a CSV could be easy and clean.  It’s trivial to `bin/post -c <your collection> data.csv`

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>




> On Aug 4, 2015, at 5:51 AM, Mugeesh Husain <mu...@gmail.com> wrote:
> 
> Thank @Alexandre and  Erickson ,Hatcher.
> 
> I will generate ID of MD5  with help of filename using java.
> I can do it with help of SolrJ nicely because i am java developer apart from
> this 
> The question raised that data is too large i think it will break into
> multiple shards(core)
> Using multi core indexing how i can analysed duplicate ID while reindexing
> the whole.(Using Solrj) and
> How i will analysed one core contains such amount of data and other etc.
> 
> I have decide i will do it with SolrJ because i don't have good
> understanding with DIH for such type operation which i needed on my
> requirement. i'd google but unable to find such type of DIH Example which i
> can implement on my problem.
> 
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220673.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can Apache Solr Handle TeraByte Large Data

Posted by Mugeesh Husain <mu...@gmail.com>.
Thank @Alexandre and  Erickson ,Hatcher.

I will generate ID of MD5  with help of filename using java.
I can do it with help of SolrJ nicely because i am java developer apart from
this 
The question raised that data is too large i think it will break into
multiple shards(core)
Using multi core indexing how i can analysed duplicate ID while reindexing
the whole.(Using Solrj) and
How i will analysed one core contains such amount of data and other etc.

I have decide i will do it with SolrJ because i don't have good
understanding with DIH for such type operation which i needed on my
requirement. i'd google but unable to find such type of DIH Example which i
can implement on my problem.





--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220673.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Well,

If it is just file names, I'd probably use SolrJ client, maybe with
Java 8. Read file names, split the name into parts with regular
expressions, stuff parts into different field names and send to Solr.
Java 8 has FileSystem walkers, etc to make it easier.

You could do it with DIH, but it would be with nested entities and the
inner entity would probably try to parse the file. So, a lot of wasted
effort if you just care about the file names.

Or, I would just do a directory listing in the operating system and
use regular expressions to split it into CSV file, which I would then
import into Solr directly.

In all of these cases, the question would be which field is the ID of
the record to ensure no duplicates.

Regards,
   Alex.

----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 3 August 2015 at 15:34, Mugeesh Husain <mu...@gmail.com> wrote:
> @Alexandre  No i dont need a content of a file. i am repeating my requirement
>
> I have a 40 millions of files which is stored in a file systems,
> the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
>
> I just  split all Value from a filename only,these values i have to index.
>
> I am interested to index value to solr not file contains.
>
> I have tested the DIH from a file system its work fine but i dont know how
> can i implement my code in DIH
> if my code get some value than how i can i index it using DIH.
>
> If i will use DIH then How i will make split operation and get value from
> it.
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Mugeesh Husain <mu...@gmail.com>.
@Alexandre  No i dont need a content of a file. i am repeating my requirement

I have a 40 millions of files which is stored in a file systems, 
the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf 

I just  split all Value from a filename only,these values i have to index.

I am interested to index value to solr not file contains.

I have tested the DIH from a file system its work fine but i dont know how
can i implement my code in DIH
if my code get some value than how i can i index it using DIH.

If i will use DIH then How i will make split operation and get value from
it.





--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Just to reconfirm, are you indexing file content? Because if you are,
you need to be aware most of the PDF do not extract well, as they do
not have text flow preserved.

If you are indexing PDF files, I would run a sample through Tika
directly (that's what Solr uses under the covers anyway) and see what
the output looks like.

Apart from that, either SolrJ or DIH would work. If this is for a
production system, I'd use SolrJ with client-side Tika parsing. But
you could use DIH for a quick test run.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 3 August 2015 at 13:56, Mugeesh Husain <mu...@gmail.com> wrote:
> Hi Alexandre,
> I have a 40 millions of files which is stored in a file systems,
> the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
> 1.)I have to split all underscore value from a filename and these value have
> to be index to the solr.
> 2.)Do Not need file contains(Text) to index.
>
> You Told me "The answer is Yes" i didn't get in which way you said Yes.
>
> Thanks
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Erick Erickson <er...@gmail.com>.
I'd go with SolrJ personally. For a terabyte of data that (I'm inferring)
are PDF files and the like (aka "semi-structured documents) you'll
need to have Tika parse out the data you need to index. And doing
that through posting or DIH puts all the analysis on the Solr servers,
which will work, but not optimally.

Here's something to get you started:

https://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Mon, Aug 3, 2015 at 1:56 PM, Mugeesh Husain <mu...@gmail.com> wrote:
> Hi Alexandre,
> I have a 40 millions of files which is stored in a file systems,
> the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
> 1.)I have to split all underscore value from a filename and these value have
> to be index to the solr.
> 2.)Do Not need file contains(Text) to index.
>
> You Told me "The answer is Yes" i didn't get in which way you said Yes.
>
> Thanks
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Mugeesh Husain <mu...@gmail.com>.
@Erik Hatcher You mean i have to use Solrj for indexing to it.(right ?)

Can Solrj handle large amount of data which i have mentioned previous post ?
If i will use DIH then how will i split value from filename etc.

I want to start my development in a right direction that why i am little
confuse on which way i will start my
requirement.

Please told me you guys are told me yes(Is yes for Solrj ? or DIH ?)



--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220550.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Erick Erickson <er...@gmail.com>.
Ahhh, listen to Hatcher if you're not indexing the _contents_ of the
files, just the filenames....

Erick

On Mon, Aug 3, 2015 at 2:22 PM, Erik Hatcher <er...@gmail.com> wrote:
> Most definitely yes given your criteria below.  If you don’t care for the text to be parsed and indexed within the files, a simple file system crawler that just got the directory listings and posted the file names split as you’d like to Solr would suffice it sounds like.
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com <http://www.lucidworks.com/>
>
>
>
>
>> On Aug 3, 2015, at 1:56 PM, Mugeesh Husain <mu...@gmail.com> wrote:
>>
>> Hi Alexandre,
>> I have a 40 millions of files which is stored in a file systems,
>> the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
>> 1.)I have to split all underscore value from a filename and these value have
>> to be index to the solr.
>> 2.)Do Not need file contains(Text) to index.
>>
>> You Told me "The answer is Yes" i didn't get in which way you said Yes.
>>
>> Thanks
>>
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Erik Hatcher <er...@gmail.com>.
Most definitely yes given your criteria below.  If you don’t care for the text to be parsed and indexed within the files, a simple file system crawler that just got the directory listings and posted the file names split as you’d like to Solr would suffice it sounds like.
—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>




> On Aug 3, 2015, at 1:56 PM, Mugeesh Husain <mu...@gmail.com> wrote:
> 
> Hi Alexandre,
> I have a 40 millions of files which is stored in a file systems,
> the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
> 1.)I have to split all underscore value from a filename and these value have
> to be index to the solr.
> 2.)Do Not need file contains(Text) to index.
> 
> You Told me "The answer is Yes" i didn't get in which way you said Yes.
> 
> Thanks
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can Apache Solr Handle TeraByte Large Data

Posted by Mugeesh Husain <mu...@gmail.com>.
Hi Alexandre,
I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
1.)I have to split all underscore value from a filename and these value have
to be index to the solr.
2.)Do Not need file contains(Text) to index.

You Told me "The answer is Yes" i didn't get in which way you said Yes.

Thanks




--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
That's still a VERY open question. The answer is Yes, but the details
depend on the shape and source of your data. And the search you are
anticipating.

Is this a lot of entries with small number of fields. Or a -
relatively - small number of entries with huge field counts. Do you
need to store/return all those fields or just search them?

Is the content coming as one huge file (in which format?) or from an
external source such as database?

And so on.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 3 August 2015 at 11:42, Mugeesh Husain <mu...@gmail.com> wrote:
> Hi,
> I am new in solr development and have a same requirement and I have already
> got some knowledge such as how many shard have to created such amount of
> data at all. with help of googling.
>
> I want to take Some suggestion there are so many method to do indexing such
> as DIH,solr,Solrj.
>
> Please suggest me in which way i have to do it.
> 1.) Should i  use Solrj
> 1.) Should i  use DIH
> 1.) Should i  use post method(in terminal)
>
> or Is there any other way for indexing such amount of data.
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220469.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Mugeesh Husain <mu...@gmail.com>.
Hi,
I am new in solr development and have a same requirement and I have already
got some knowledge such as how many shard have to created such amount of
data at all. with help of googling.

I want to take Some suggestion there are so many method to do indexing such
as DIH,solr,Solrj.

Please suggest me in which way i have to do it.
1.) Should i  use Solrj 
1.) Should i  use DIH 
1.) Should i  use post method(in terminal) 

or Is there any other way for indexing such amount of data.




--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220469.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,
 
Inline

----- Original Message -----
> From: mustafozbek <mu...@gmail.com>
> 
> I am an apache solr user about a year. I used solr for simple search tools
> but now I want to use solr with 5TB of data. I assume that 5TB data will be
> 7TB when solr index it according to filter that I use. And then I will add
> nearly 50MB of data per hour to the same index.
> 1-    Are there any problem using single solr server with 5TB data. (without
> shards)
>    a-    Can solr server answers the queries in an acceptable time

Not likely, unless the diversity of queries is very small and OS can keep the relevant parts of the index cached and Solr caches get hit a lot.

>    b-    what is the expected time for commiting of 50MB data on 7TB index.

Depends on settings like ramBufferSizeMB and how you add the data (e.g. via DIH, via SolrJ, via csvn import...)

>    c-    Is there an upper limit for index size.

Yes, there are Lucene doc IDs that limit its size, but you will hit you will hit hardware limits before you hit that limit.

> 2-    what are the suggestions that you offer
>    a-    How many shards should I use

Depends primarily on the number of servers available and their capacity.

>    b-    Should I use solr cores

Sounds like you should really start by using SolrCloud.

>    c-    What is the committing frequency you offered. (is 1 hour OK)

Depends on how often you want to see new data show up in search results.  Some people need that to be immediately, or 1 second or 1 hour, while some are OK with 24h.


> 3-    are there any test results for this kind of large data


Nothing official, but it's been done.  For example, we've done large-scale stuff like this with Solr for our clients at Sematext, but we can't publish technical details.

> There is no available 5TB data, I just want to estimate what will be the
> result.
> Note: You can assume that hardware resourses are not a problem.


Otis
----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html